ZHANGXI LIN TEXAS TECH UNIVERSITY Lecture Notes 10 CRM
Segmentation - Introduction
Slide 2
Outline CRM and Segmentation Review of Clustering Data Mining
Types of Clustering K-Means Clustering Hierarchical Clustering
Slide 3
Segmentation in the Context of CRM Segmentation: Subdividing
the population according to known good discriminators Applying
clustering data mining can help segmentation
Slide 4
Segmentation Types and Methods Interchangeable between
segmentation and record classification in the context of CRM
Customer profiling: to gain insight of the 4W Who, what, where, and
when Customer likeness clustering RFM cell classification grouping
Purchase affinity clustering
Slide 5
Mass Customization vs. Mass Marketing Mass customization:
tailor product/service/promotion to each individual customer, or a
few customer, or a segment Fact: 29% of all marketing services are
classified as a mass marketing segment
Slide 6
Promotions or Communications by Segment Groups Case: Three
groups of customer profile Medium-sized companies. Good customer,
purchasing from direct channels. Small-sized companies. Purchase
only a few products. Need specific services Large companies. Not
loyal enough
Slide 7
Long Tail Theory The phrase The Long Tail (as a proper noun
with capitalized letters) was first coined by Chris Anderson in an
October 2004 Wired magazine article to describe the niche strategy
of businesses, such as Amazon.com or Netflix, that sell a large
number of unique items, each in relatively small quantities.proper
nounChris AndersonAmazon.com Netflix The distribution and inventory
costs of these businesses allow them to realize significant profit
out of selling small volumes of hard-to-find items to many
customers, instead of only selling large volumes of a reduced
number of popular items. The group of persons that buy the
hard-to-find or "non-hit" items is the customer demographic called
the Long Tail. Given a large enough availability of choice, a large
population of customers, and negligible stocking and distribution
costs, the selection and buying pattern of the population results
in a power law distribution curve, or Pareto distribution. This
suggests that a market with a high freedom of choice will create a
certain degree of inequality by favoring the upper 20% of the items
("hits" or "head") against the other 80% ("non-hits" or "long
tail").power law Pareto distribution
Slide 8
Long Tail Theory
Slide 9
Demonstration Dataset: Buytest Tasks Distributions of
variables
Slide 10
Types of Clustering
Slide 11
Data & Text Mining 11 Types of Clustering A clustering is a
set of clusters Important distinction between hierarchical and
partitional sets of clusters Partitional Clustering A division data
objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset Hierarchical clustering A set of
nested clusters organized as a hierarchical tree
Slide 12
Data & Text Mining 12 Partitional Clustering Original
Points A Partitional Clustering
Slide 13
Data & Text Mining 13 Hierarchical Clustering Traditional
Hierarchical Clustering Non-traditional Hierarchical Clustering
Non-traditional Dendrogram Traditional Dendrogram
Slide 14
Data & Text Mining 14 Other Distinctions Between Sets of
Clusters Exclusive versus non-exclusive In non-exclusive
clustering, points may belong to multiple clusters. Can represent
multiple classes or border points Fuzzy versus non-fuzzy In fuzzy
clustering, a point belongs to every cluster with some weight
between 0 and 1 Weights must sum to 1 Probabilistic clustering has
similar characteristics Partial versus complete In some cases, we
only want to cluster some of the data Heterogeneous versus
homogeneous Cluster of widely different sizes, shapes, and
densities
Slide 15
Data & Text Mining 15 Types of Clusters the outcomes of
clustering Well-separated clusters Center-based clusters Contiguous
clusters Density-based clusters Property or Conceptual Described by
an Objective Function
Slide 16
Data & Text Mining 16 Types of Clusters: Well-Separated
Well-Separated Clusters: A cluster is a set of points such that any
point in a cluster is closer (or more similar) to every other point
in the cluster than to any point not in the cluster. 3
well-separated clusters
Slide 17
Data & Text Mining 17 Types of Clusters: Center-Based
Center-based A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the center of a cluster, than
to the center of any other cluster The center of a cluster is often
a centroid, the average of all the points in the cluster, or a
medoid, the most representative point of a cluster 4 center-based
clusters
Slide 18
Data & Text Mining 18 Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a
set of points such that a point in a cluster is closer (or more
similar) to one or more other points in the cluster than to any
point not in the cluster. 8 contiguous clusters
Slide 19
Data & Text Mining 19 Types of Clusters: Density-Based
Density-based A cluster is a dense region of points, which is
separated by low-density regions, from other regions of high
density. Used when the clusters are irregular or intertwined, and
when noise and outliers are present. 6 density-based clusters
Slide 20
Data & Text Mining 20 Types of Clusters: Conceptual
Clusters Shared Property or Conceptual Clusters Finds clusters that
share some common property or represent a particular concept.. 2
Overlapping Circles
Slide 21
Data & Text Mining 21 Types of Clusters: Objective Function
Clusters Defined by an Objective Function Finds clusters that
minimize or maximize an objective function. Enumerate all possible
ways of dividing the points into clusters and evaluate the
`goodness' of each potential set of clusters by using the given
objective function. (NP Hard) Can have global or local objectives.
Hierarchical clustering algorithms typically have local objectives
Partitional algorithms typically have global objectives A variation
of the global objective function approach is to fit the data to a
parameterized model. Parameters for the model are determined from
the data. Mixture models assume that the data is a mixture' of a
number of statistical distributions.
Slide 22
Data & Text Mining 22 Types of Clusters: Objective Function
Map the clustering problem to a different domain and solve a
related problem in that domain Proximity matrix defines a weighted
graph, where the nodes are the points being clustered, and the
weighted edges represent the proximities between points Clustering
is equivalent to breaking the graph into connected components, one
for each cluster. Want to minimize the edge weight between clusters
and maximize the edge weight within clusters
Slide 23
Data & Text Mining 23 Distance of clusters
Slide 24
Data & Text Mining 24 Manhattan Distance (U 1,V 1 ) (U 2,V
2 ) L 1 = |U 1 - U 2 | + |V 1 - V 2 |
Slide 25
Data & Text Mining 25 Euclidean Distance (U 1,V 1 ) (U 2,V
2 ) L 2 = ((U 1 - U 2 ) 2 + (V 1 - V 2 ) 2 ) 1/2
Slide 26
Data & Text Mining 26 Euclidean Distance Distance
Matrix
Slide 27
Data & Text Mining 27 Minkowski Distance Minkowski Distance
is a generalization of Euclidean Distance Where r is a parameter, n
is the number of dimensions (attributes) and p k and q k are,
respectively, the kth attributes (components) or data objects p and
q.
Slide 28
Data & Text Mining 28 Minkowski Distance: Examples r = 1.
City block (Manhattan, taxicab, L 1 norm) distance. A common
example of this is the Hamming distance, which is just the number
of bits that are different between two binary vectors r = 2.
Euclidean distance r . supremum (L max norm, L norm) distance. This
is the maximum difference between any component of the vectors Do
not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
Slide 29
Data & Text Mining 29 Minkowski Distance Distance
Matrix
Slide 30
Data & Text Mining 30 Cosine Similarity If d 1 and d 2 are
two document vectors, then cos( d 1, d 2 ) = (d 1 d 2 ) / ||d 1 ||
||d 2 ||, where indicates vector dot product and || d || is the
length of vector d. Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0
0 0 0 0 1 0 2 d 1 d 2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 +
2*1 + 0*0 + 0*2 = 5 ||d 1 || =
(3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481
||d 2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5
= 2.245 cos( d 1, d 2 ) =.3150
Slide 31
Data & Text Mining 31 K-Means clustering
Slide 32
Data & Text Mining 32 K-means Clustering Partitional
clustering approach Each cluster is associated with a centroid
(center point) Each point is assigned to the cluster with the
closest centroid Number of clusters, K, must be specified The basic
algorithm is very simple
Slide 33
Data & Text Mining 33 K-means Clustering Details Initial
centroids are often chosen randomly. Clusters produced vary from
one run to another. The centroid is (typically) the mean of the
points in the cluster. Closeness is measured by Euclidean distance,
cosine similarity, correlation, etc. K-means will converge for
common similarity measures mentioned above. Most of the convergence
happens in the first few iterations. Often the stopping condition
is changed to Until relatively few points change clusters
Complexity is O( n * K * I * d ) n = number of points, K = number
of clusters, I = number of iterations, d = number of
attributes
Slide 34
Data & Text Mining 34 Two different K-means Clusterings
Sub-optimal Clustering Optimal Clustering Original Points
Slide 35
Data & Text Mining 35 Importance of Choosing Initial
Centroids
Slide 36
Data & Text Mining 36 Importance of Choosing Initial
Centroids
Slide 37
Data & Text Mining 37 Evaluating K-means Clusters Most
common measure is Sum of Squared Error (SSE) For each point, the
error is the distance to the nearest cluster To get SSE, we square
these errors and sum them. x is a data point in cluster C i and m i
is the representative point for cluster C i can show that m i
corresponds to the center (mean) of the cluster Given two clusters,
we can choose the one with the smallest error One easy way to
reduce SSE is to increase K, the number of clusters A good
clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Slide 38
Data & Text Mining 38 Importance of Choosing Initial
Centroids
Slide 39
Data & Text Mining 39 Importance of Choosing Initial
Centroids
Slide 40
Data & Text Mining 40 Problems with Selecting Initial
Points If there are K real clusters then the chance of selecting
one centroid from each cluster is small. Chance is relatively small
when K is large If clusters are the same size, n, then For example,
if K = 10, then probability = 10!/10 10 = 0.00036 Sometimes the
initial centroids will readjust themselves in right way, and
sometimes they dont Consider an example of five pairs of
clusters
Slide 41
Data & Text Mining 41 10 Clusters Example Starting with two
initial centroids in one cluster of each pair of clusters
Slide 42
Data & Text Mining 42 10 Clusters Example Starting with two
initial centroids in one cluster of each pair of clusters
Slide 43
Data & Text Mining 43 10 Clusters Example Starting with
some pairs of clusters having three initial centroids, while other
have only one.
Slide 44
Data & Text Mining 44 10 Clusters Example Starting with
some pairs of clusters having three initial centroids, while other
have only one.
Slide 45
Data & Text Mining 45 Hierarchical Clustering
Slide 46
Data & Text Mining 46 Hierarchical Clustering Produces a
set of nested clusters organized as a hierarchical tree Can be
visualized as a dendrogram A tree like diagram that records the
sequences of merges or splits
Slide 47
Data & Text Mining 47 Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters Any desired
number of clusters can be obtained by cutting the dendogram at the
proper level They may correspond to meaningful taxonomies Example
in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, )
Slide 48
Data & Text Mining 48 Hierarchical Clustering Two main
types of hierarchical clustering Agglomerative: Start with the
points as individual clusters At each step, merge the closest pair
of clusters until only one cluster (or k clusters) left Divisive:
Start with one, all-inclusive cluster At each step, split a cluster
until each cluster contains a point (or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance
matrix Merge or split one cluster at a time
Slide 49
Data & Text Mining 49 Agglomerative Clustering Algorithm
More popular hierarchical clustering technique Basic algorithm is
straightforward 1. Compute the proximity matrix 2. Let each data
point be a cluster 3. Repeat 4. Merge the two closest clusters 5.
Update the proximity matrix 6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
Different approaches to defining the distance between clusters
distinguish the different algorithms
Slide 50
Data & Text Mining 50 Starting Situation Start with
clusters of individual points and a proximity matrix p1 p3 p5 p4 p2
p1p2p3p4p5......... Proximity Matrix
Slide 51
Data & Text Mining 51 Intermediate Situation After some
merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4
C2 C3C4C5 Proximity Matrix
Slide 52
Data & Text Mining 52 Intermediate Situation We want to
merge the two closest clusters (C2 and C5) and update the proximity
matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity
Matrix
Slide 53
Data & Text Mining 53 After Merging The question is How do
we update the proximity matrix? C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5
C1 C3 C4 C2 U C5 C3C4 Proximity Matrix
Slide 54
Data & Text Mining 54 How to Define Inter-Cluster
Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Similarity? MIN MAX
Group Average Distance Between Centroids Other methods driven by an
objective function Wards Method uses squared error Proximity
Matrix
Slide 55
Data & Text Mining 55 How to Define Inter-Cluster
Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix MIN
MAX Group Average Distance Between Centroids Other methods driven
by an objective function Wards Method uses squared error
Slide 56
Data & Text Mining 56 How to Define Inter-Cluster
Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix MIN
MAX Group Average Distance Between Centroids Other methods driven
by an objective function Wards Method uses squared error
Slide 57
Data & Text Mining 57 How to Define Inter-Cluster
Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix MIN
MAX Group Average Distance Between Centroids Other methods driven
by an objective function Wards Method uses squared error
Slide 58
Data & Text Mining 58 How to Define Inter-Cluster
Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix MIN
MAX Group Average Distance Between Centroids Other methods driven
by an objective function Wards Method uses squared error
Slide 59
Data & Text Mining 59 Cluster Similarity: MIN or Single
Link Similarity of two clusters is based on the two most similar
(closest) points in the different clusters Determined by one pair
of points, i.e., by one link in the proximity graph. 12345
Slide 60
Data & Text Mining 60 Hierarchical Clustering: MIN Nested
ClustersDendrogram 1 2 3 4 5 6 1 2 3 4 5
Slide 61
Data & Text Mining 61 Strength of MIN Original Points Two
Clusters Can handle non-elliptical shapes
Slide 62
Data & Text Mining 62 Limitations of MIN Original Points
Two Clusters Sensitive to noise and outliers
Slide 63
Data & Text Mining 63 Cluster Similarity: MAX or Complete
Linkage Similarity of two clusters is based on the two least
similar (most distant) points in the different clusters Determined
by all pairs of points in the two clusters 12345
Slide 64
Data & Text Mining 64 Hierarchical Clustering: MAX Nested
ClustersDendrogram 1 2 3 4 5 6 1 2 5 3 4
Slide 65
Data & Text Mining 65 Strength of MAX Original Points Two
Clusters Less susceptible to noise and outliers
Slide 66
Data & Text Mining 66 Limitations of MAX Original Points
Two Clusters Tends to break large clusters Biased towards globular
clusters
Slide 67
Data & Text Mining 67 Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity
between points in the two clusters. Need to use average
connectivity for scalability since total proximity favors large
clusters 12345
Slide 68
Data & Text Mining 68 Hierarchical Clustering: Group
Average Nested ClustersDendrogram 1 2 3 4 5 6 1 2 5 3 4
Slide 69
Data & Text Mining 69 Hierarchical Clustering: Group
Average Compromise between Single and Complete Link Strengths Less
susceptible to noise and outliers Limitations Biased towards
globular clusters
Slide 70
Data & Text Mining 70 Cluster Similarity: Wards Method
Similarity of two clusters is based on the increase in squared
error when two clusters are merged Similar to group average if
distance between points is distance squared Less susceptible to
noise and outliers Biased towards globular clusters Hierarchical
analogue of K-means Can be used to initialize K-means