Market basket analysis - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part1.pdf · 2015-04-14 · Market basket analysis Find joint values of

Market basket analysis

Find joint values of the variables X = (X1, ...,Xp) that appearmost frequently in the data base. It is most often applied tobinary-valued data Xj .

I In this context the observations are sales transactions, such asthose occurring at the checkout counter of a store. Thevariables represent all of the items sold in the store. Forobservation i , each variable Xj is assigned one of two values;

xij =

{1 If the jth item is purchased in ith transaction0 Otherwise

I Those variables that frequently have joint values of onerepresent items that are frequently purchased together. Thisinformation can be quite useful for stocking shelves,cross-marketing in sales promotions, catalog design, andconsumer segmentation based on buying patterns.

Market basket analysisAssume X1, ...,Xp are all binary variables. Market basket analysisaims to find a subset of the integers K = {1, ...,P} so that thefollowing is large:

P

(∏k∈K

{Xk = 1}

).

I The set K is called an item set. The number of items in K iscalled its size.

I The above probability is called the Support or prevalence,T (K) of the item set K. It is estimated by

P̂

(∏k∈K{Xk = 1}

)=

1

N

N∑i=1

∏i∈K

xik .

I An observation i for which∏

i∈K xik = 1 is said to Containthe item set K.

I Given a lower bound t, the market basket analysis seeks allthe item set Kl with support in the data base greater thanthis lower bound t, i.e., {Kl |T (Kl) > t}.

The Apriori algorithm

The solution to the market basket analysis can be obtained withfeasible computation for very large data bases provided thethreshold t is adjusted so that the solution consists of only a smallfraction of all 2p possible item sets. The ”Apriori” algorithm(Agrawal et al., 1995) exploits several aspects of the curse ofdimensionality to solve the problem with a small number of passesover the data. Specifically, for a given support threshold t:

I The cardinality |{K|T (K) > t}| is relatively small.

I Any item set L consisting of a subset of the items in K musthave support greater than or equal to that of K, i.e., ifL ⊂ K, then T (L) > T (K)

The Apriori algorithmI The first pass over the data computes the support of all

single-item sets. Those whose support is less than thethreshold are discarded.

I The second pass computes the support of all item sets of sizetwo that can be formed from pairs of the single itemssurviving the first pass.

I Each successive pass over the data considers only those itemsets that can be formed by combining those that survived theprevious pass with those retained from the first pass.

I Passes over the data continue until all candidate rules from theprevious pass have support less than the specified threshold

I The Apriori algorithm requires only one pass over the data foreach value of T (K), which is crucial since we assume the datacannot be fitted into a computer’s main memory. If the dataare sufficiently sparse (or if the threshold t is high enough),then the process will terminate in reasonable time even forhuge data sets.

Association rule

Each high support item set K returned by the Apriori algorithm iscast into a set of ”association rules.” The items Xk , k ∈ K, arepartitioned into two disjoint subsets, A ∪ B = K, and written

A => B.

The first item subset A is called the ”antecedent” and the secondB the ”consequent.”

I The ”support” of the rule T (A => B) is the support of theitem set they are derived.

I The ”confidence” or ”predictability”C (A => B) of the ruleis its support divided by the support of the antecedent

C (A => B) =T (A => B)

T (A)

which can be viewed as an estimate of P(B|A)

I The ”expected confidence” is defined as the support of theconsequent T (B), which is an estimate of the unconditionalprobability P(B).

I The ”lift” of the rule is defined as the confidence divided bythe expected confidence

L(A => B) =C (A => B)

T (B)

Association rule: example

suppose the item set K = {butter , jelly , bread} and consider therule {peanutbutter , jelly} => {bread}.

I A support value of 0.03 for this rule means that peanutbutter, jelly, and bread appeared together in 3% of the marketbaskets.

I A confidence of 0.82 for this rule implies that when peanutbutter and jelly were purchased, 82% of the time bread wasalso purchased.

I If bread appeared in 43% of all market baskets then the rule{peanutbutter , jelly} => {bread} would have a lift of 1.95.

Association rule

I Sometimes, the desired output of the entire analysis is acollection of association rules that satisfy the constraints

T (A => B) > t and C (A => B) > c

for some threshold t and c. For example,Display all transactions in which ice skates are theconsequent that have confidence over 80% and supportof more than 2%.Efficient algorithms based on the apriori algorithm have beendeveloped.

I Association rules have become a popular tool for analyzingvery large commercial data bases in settings where marketbasket is relevant. That is, when the data can be cast in theform of a multidimensional contingency table. The output isin the form of conjunctive rules that are easily understood andinterpreted.

I The Apriori algorithm allows this analysis to be applied tohuge data bases, much larger that are amenable to othertypes of analyses. Association rules are among data mining’sbiggest successes.

I The number of solution item sets, their size, and the numberof passes required over the data can grow exponentially withdecreasing size of this lower bound. Thus, rules with highconfidence or lift, but low support, will not be discovered. Forexample, a high confidence rule such as vodka => caviarwill not be uncovered owing to the low sales volume of theconsequent caviar .

We illustrate the use of Apriori on a moderately sizeddemographics data base. This data set consists of N = 9409questionnaires filled out by shopping mall customers in the SanFrancisco Bay Area (Impact Resources, Inc., Columbus OH, 1987).Here we use answers to the first 14 questions, relating todemographics, for illustration.

A freeware implementation of the Apriori algorithm due toChristian Borgelt is used.

I After removing observations with missing values, each ordinalpredictor was cut at its median and coded by two dummyvariables; each categorical predictor with k categories wascoded by k dummy variables.

I This resulted in a 6876 Œ 50 matrix of 6876 observations on50 dummy variables.

I The algorithm found a total of 6288 association rules,involving ??? 5 predictors, with support of at least 10%.Understanding this large set of rules is itself a challengingdata analysis task.

Here are three examples of association rules found by the Apriorialgorithm:

I Association rule 1: Support 25%, confidence 99.7% and lift1.03.

I Association rule 2: Support 13.4%, confidence 80.8%, and lift2.13.

I Association rule 3: Support 26.5%, confidence 82.8% and lift2.15.

Cluster anlaysis

I Group or segment a collection of objects into subsets or”clusters,” such that those within each cluster are more closelyrelated to one another than objects assigned to differentclusters.

I Central to all of the goals of cluster analysis is the notion ofthe degree of similarity (or dissimilarity) between theindividual objects being clustered. A clustering methodattempts to group the objects based on the definition ofsimilarity supplied to it.

I Definition of similarity can only come from subject matterconsiderations. The situation is somewhat similar to thespecification of a loss or cost function in prediction problems(supervised learning). There the cost associated with aninaccurate prediction depends on considerations outside thedata.

Proximity matrices

I Most algorithms presume a matrix of dissimilarities withnonnegative entries and zero diagonal elements:dii = 0, i = 1, 2, ...,N.

I If the original data were collected as similarities, a suitablemonotone-decreasing function can be used to convert them todissimilarities.

I most algorithms assume symmetric dissimilarity matrices, so ifthe original matrix D is not symmetric it must be replaced by(D + DT )/2

Attribute dissimilarity

For the jth attribute of objects xij and xi ′j , let dj(xij , xi ′j) be thedissimilarity between them on the jth attribute.

I Quantitative variables: dj(xij , xi ′j) = (xij − xi ′j)2

I Ordinal variables: Error measures for ordinal variables aregenerally defined by replacing their M original values with

i − 1/2

M, i = 1, ...,M

in the prescribed order of their original values. They are thentreated as quantitative variables on this scale.

I Nominal variables:

dj(xij , xij ′) =

{1 if xij = xi ′j0 otherwise

Object Dissimilarity

Combining the p-individual attribute dissimilaritiesdj(xij , xi ′j), j = 1, ..., p into a single overall measure of dissimilarityD(xi , xi ′) between two objects or observations, is usually donethrough convex combination:

D(xi , xi ′) =

p∑j=1

wjdj(xij , xi ′j),

p∑j=1

wj = 1.

I It is important to realize that setting the weight wj to thesame value for each variable does not necessarily give allattributes equal influence. When the squared error distanceis used, the relative importance of each variable is proportionalto its variance over the data.

I With the squared error distance, setting the weight to be theinverse of the vairance leads to equal influence of allattributes in the overall dissimilarity between objects.

Standardization in clustering?

I If the goal is to discover natural groupings in the data, someattributes may exhibit more of a grouping tendency thanothers. Variables that are more relevant in separating thegroups should be assigned a higher influence in defining objectdissimilarity. Giving all attributes equal influence in this casewill tend to obscure the groups to the point where a clusteringalgorithm cannot uncover them.

I Although simple generic prescriptions for choosing theindividual attribute dissimilarities dj(xij , xi ′j) and their weightswj can be comforting, there is no substitute for carefulthought in the context of each individual problem. Specifyingan appropriate dissimilarity measure is far more important inobtaining success with clustering than choice of clusteringalgorithm. This aspect of the problem is emphasized less inthe clustering literature than the algorithms themselves, sinceit depends on domain knowledge specifics and is less amenableto general research.

Standardization in clustering?

Figure: Simulated data: on the left, K -means clustering (with K = 2)has been applied to the raw data. The two colors indicate the clustermemberships. On the right, the features were first standardized beforeclustering. This is equivalent to using feature weights 1/[2var(Xj)]. Thestandardization has obscured the two well-separated groups. Note thateach plot uses the same units in the horizontal and vertical axes.

K-means clustering

The K-means algorithm is one of the most popular iterativedescent clustering methods. It is intended for situations in whichall variables are of the quantitative type, and squaredEuclidean distance is chosen as the dissimilarity measure.The within-cluster point scatter is

W (C ) =1

2

K∑k=1

∑C(i)=k

∑C(i ′)=k

‖xi − x ′i ‖2 =K∑

k=1

Nk

∑C(i)=k

‖xi − x̄k‖2

where x̄k is the mean vector associated with the kth cluster underthe clustering rule C (i), and Nk is the number of observationsbelonging to cluster k.

The K -means clustering algorithm aims find a clustering rule C ∗

such that

C ∗ = minC

K∑k=1

K∑k=1

Nk

∑C(i)=k

‖xi − x̄k‖2.

Notice thatx̄S = argminm

∑i∈S‖xi −m‖2

Hence we can obtain C ∗ by solving the enlarged optimizationproblem

minC ,m1,m2,...,mK

K∑k=1

Nk

∑C(i)=k

‖xi −mk‖2

This can be minimized by an alternatiing optimization proceduregiven in the next algorithm.

I The K -means is guaranteed to converge. However, the resultmay represent a suboptimal local minimum.

I one should start the algorithm with many different randomchoices for the starting means, and choose the solution havingsmallest value of the objective function.

I K-means clustering has shortcomings. For one, it does notgive a linear ordering of objects within a cluster: we havesimply listed them in alphabetic order above.

I Secondly, as the number of clusters K is changed, the clustermemberships can change in arbitrary ways. That is, with sayfour clusters, the clusters need not be nested within the threeclusters above. For these reasons, hierarchical clustering, isprobably preferable for this application.

Figure: Successive iterations of the K -means clustering algorithm for asimulated data.

K=medoids clustering

K medoids clustering

I Medoids clustering does not require all variables to be of thequantitative type.

I squared Euclidean distance can be replaced with distancesrobust to outliers.

I Finding the center of each cluster with Medoids clusteringcosts O(N2

k ) flops, while with K -means, it costs O(Nk).Thus, K-medoids is far more computationally intensive thanK-means.

Initialization of K centers

I Choose one center uniformly at random from among the datapoints.

I For each data point x , compute D(x), the distance between xand the nearest center that has already been chosen.

I Choose one new data point at random as a new center, usinga weighted probability distribution where a point x is chosenwith probability proportional to D(x)2.

I Repeat Steps 2 and 3 until k centers have been chosen.

I Now that the initial centers have been chosen, proceed usingstandard K -means clustering.

This seeding method yields considerable improvement in the finalerror of K -means. Although the initial selection in the algorithmtakes extra time, the K -means part itself converges very quicklyafter this seeding and thus the algorithm actually lowers thecomputation time.

Choice of K

I Cross-validation chooses large K , because the within clusterdissimilarity WK decreases as K increases, even for test data!

I An estimate K ∗ for the optimal K ∗ can be obtained byidentifying a ”kink” in the plot of Wk as a function of K , thatis, a sharp decrease of Wk followed by a slight decrease.

I Uses the Gap statistic - it chooses the K ∗ where the datalook most clustered when compared to uniformly-distributeddata.

I For each K , compute the within cluster dissimilarity W̃K andits standard deviation sK , for m sets of ranomly generateduniformly-distributed data.

I Choose K such that G (K ) ≥ G (K + 1)− sK√

1 + 1/m with

G (K ) = | logWk − log W̃K |.

Gap statistic

Hierarchical clustering (Agglomerative)

I Produce hierarchical representations in which the clusters ateach level of the hierarchy are created by merging clusters atthe next lower level.

I Do not require initial configuration assignment and initialchoice of the number of clusters

I Do require the user to specify a measure of dissimilaritybetween (disjoint) groups of observations, based on thepairwise dissimilarities among the observations in the twogroups.

Agglomerative clustering

I Begin with every observation representing a singleton cluster.

I At each of the N − 1 steps the closest two (least dissimilar)clusters are merged into a single cluster, producing one lesscluster at the next higher level, according to a measure ofdissimilarity between two clusters. Let G and H represent twogroups.

I Single linkage(SL) clustering:

dSL(G ,H) = mini∈G ,i ′∈H

dii ′

I Complete linkage (CL) clustering:

dCL(G ,H) = maxi∈G ,i ′∈H

dii ′

I Group average clustering:

dGA(G ,H) =1

NGNH

∑i∈G

∑i ′∈H

dii ′

where NG and NH are the respective number of observationsin each group.

Comparison of different cluster dissimilarity measuresI If the data exhibit a strong clustering tendency, with each of

the clusters being compact and well separated from others,then all three methods produce similar results. Clusters arecompact if all of the observations within them are relativelyclose together (small dissimilarities) as compared withobservations in different clusters.

I Single linkage has a tendency to combine, at relatively lowthresholds, observations linked by a series of closeintermediate observations. This phenomenon, referred to aschaining, is often considered a defect of the method. Theclusters produced by single linkage can violate the”compactness” property

I Complete linkage will tend to produce compact clusters.However, it can produce clusters that violate the ”closeness”property. That is, observations assigned to a cluster can bemuch closer to members of other clusters than they are tosome members of their own cluster.

I Group average represents a compromise between the two.

Comparison of different cluster dissimilarity measures

I Single linkage and complete linkage clustering are invariant tomonotone transformation of the distance function, but thegroup average clustering is not.

I The group average dissimilarity is an estimate of∫ ∫d(x , x ′)pG (x)pH(x ′)dxdx ′

which is a kind of distance between the two densities pG forgroup G and pH for group H. On the other hand, singlelinkage dissimilarity approaches zero and complete linkagedissimilarity approaches infinity as N. Thus, it is not clearwhat aspects of the population distribution are beingestimated by the two group dissimilarities.

Documents

Market basket analysis - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part1.pdf · 2015-04-14 · Market basket analysis Find joint values of