41
Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.

Clustering analysis workshop CITM , Lab 3 18, Oct 2014

Embed Size (px)

DESCRIPTION

Clustering analysis workshop CITM , Lab 3 18, Oct 2014. Facilitator: Hosam Al- Samarraie , PhD. Outline. – The basic concepts of cluster analysis . – The different types of clustering procedures . – How to execute and generate clustering results. – The SPSS clustering outputs. - PowerPoint PPT Presentation

Citation preview

Page 1: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Clustering analysis workshopCITM, Lab 318, Oct 2014

Facilitator: Hosam Al-Samarraie, PhD.

Page 2: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Outline

• – The basic concepts of cluster analysis.• – The different types of clustering procedures.• – How to execute and generate clustering

results.• – The SPSS clustering outputs.• – The learning machine outputs.

Page 3: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

What Does Data Mining Do?

• Data mining extract patterns from data– Pattern? A mathematical (numeric

and/or symbolic) relationship among data items

• Types of patterns– Association– Prediction– Cluster (segmentation)

Page 4: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Knowledge Discovery

Steps in a Knowledge Discovery process

Page 5: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Supervised vs. Unsupervised Learning

• Supervised learning (classification)– Supervision: I know the output and I want to

examine the effect between the Independent variable on Dependent one.

• Unsupervised learning (clustering)– The class or the nature of the variables is

unknown

– Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Page 6: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

The concept of cluster analysis

Cluster analysis is unsupervised learning for identifying homogenous groups of objects called clusters.

Cluster share many characteristics, but are very dissimilar to objects not belonging to that cluster.

Page 7: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Cont…

• Measuring distances (differences or dissimilarities between subjects)

• Measuring proximities (similarity between subjects)

Page 8: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Types of Data!!

Gender…. Age groupLength

Numeric Not numeric

Count

Page 9: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Typical research questions the Cluster Analysis answers are as follows:

• Medicine – What are the diagnostic clusters? • To answer this question the researcher would devise a diagnostic questionnaire that entails the

symptoms (for example in psychology standardized scales for anxiety, depression etc.). The cluster analysis can then identify groups of patients that present with similar symptoms and simultaneously maximize the difference between the groups.

• Marketing – What are the customer segments? • To answer this question a market researcher conducts a survey most commonly covering needs,

attitudes, demographics, and behavior of customers. The researcher then uses the cluster analysis to identify homogenous groups of customers that have similar needs and attitudes but are distinctively different from other customer segments.

• Education – What are student groups that need special attention? • The researcher measures a couple of psychological, aptitude, and achievement characteristics.

A cluster analysis then identifies what homogeneous groups exist among students (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others, etc.).

• A discriminant analysis then profiles these performance clusters and tells us what psychological, environmental, aptitudinal, affective, and attitudinal factors characterize these student groups.

Page 10: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Types of clustering

Page 11: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Hierarchical Clustering1. use agglomerative ("bottom-up”) algorithms begin with each element as a

separate cluster and merge them into successively larger clusters.

2. Handles continuous data.

Page 12: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Cont…

• Can be visualized as a dendrogram– A tree-like diagram that records the sequences of

merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

Page 13: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Non hierarchicalK-means clustering

1. Begin with two starting center points and allocate each item to nearest cluster center.

2. Allocate items to nearest cluster center.

Page 14: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Mix Two-Steps Clustering

1. designed to handle very large data sets. 2. can handle both continuous and categorical variables or

attributes. 3. automatically select the number of clusters.

Page 15: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Generate clustering 1

Page 16: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

1. Decide on cluster variables

• At the beginning of the clustering process, we have to select appropriate variables for clustering.

Page 17: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Note!!!

• It is important to avoid using an abundance of clustering variables, as this increases the odds that the variables are no longer dissimilar.

• Meaning? If highly correlated variables are used for cluster analysis, specific aspects covered by these variables will be overrepresented in the clustering solution.

• In this regard, absolute correlations above 0.90 are always problematic.

• For example, measuring happiness and joy of a person.

Page 18: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Insight!!• When we usually use factor analysis, we usually get factor solution that does

not explain a certain amount of variance;

• As such, discarding of information will be performed before identifying the segments.

• However, removing variables with low loadings on all the extracted factors means that some potential information for the identification of segments are discarded.

• This in turn reduce the possibility of identifying different groups.

• Finally, the resulted factors based on the original variables become questionable.

Page 19: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

2

Page 20: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

2.Decide on the Clustering Procedure

• Refers to the process of forming the cluster.

Page 21: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Dataset

• Lets say I have different people with different measures of height and weight (variables).

• Now, if I want to group those people by weight and height into different groups, then I need to use Cluster analysis.

Page 22: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

The SPSS clustering

Variables

People to be clustered. It can be performance, achievement,

etc…

Page 23: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Cont…Hierarchical Methods: If there is a limited number of observation, usually <200.

▸ Analyze Classify Hierarchical Cluster▸ ▸

K-Means: If there are many observations, usually > 500.

▸ Analyze Classify K-Means Cluster▸ ▸

Two-step cluster: If there are many observations and the clusters are measured on different scale levels (5 likert scale, nominal, ordinal, etc..)

▸ Analyze Classify Two-Step Cluster▸ ▸

Page 24: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

In Hierarchical Select a Clustering Algorithm

• Ward’s method • (only hierarchical clustering)• ▸ Analyze Classify Hierarchical Cluster ▸ ▸ ▸

Method Cluster Method▸

Page 25: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Select measure of SimilarityIn hierarchal

• Only apply for Hierarchal and two-steps methods

Euclidean is the most commonly used type when it comes to analyzing ratio or interval-scaled data.

Page 26: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Select measure of Similarity

In Two-step

Two-step clustering:• ▸ Analyze Classify Two-Step Cluster ▸ ▸ ▸

Distance Measure

Page 27: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Standardize in Hierarchal only.

In both methods, convert variables with multiple categories (on a range of 0 to 1 or 1 to 1, or use Z score).

Page 28: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

3

Page 29: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Identifying the number of clusters?

• For hierarchical clustering by examining the dendrogram:

• ▸ Analyze Classify Hierarchical Cluster▸ ▸• ▸ Plots Dendrogram▸

Not always recommended

Page 30: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Alternative solution

• Draw a scree plot (e.g., using Microsoft Excel) based on the coefficients in the agglomeration schedule. (Elbow method)..2 clusters are possible to use..

0 5 10 15 20 250

1000

2000

3000

4000

5000

6000

7000

8000

9000

Cofficent

Cofficent

Page 31: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

For two-step and k-means

• Note: two-step clustering identify the number of clusters automatically.

• However, K-means use default of 2. The most recommended one is 3-4 clusters.

• So you need to try both and see which one provides useful output.

Page 32: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Save membership

• After identifying the number of clusters, we save the memberships between the cases.

Click save

Add 2

Page 33: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Membership to be used

Here is the membership

Page 34: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

4

Page 35: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Assess the solution’s stability

• By using other methods and compare between each other.....

Page 36: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Assess the solution’s validity

• Criterion validity: Evaluate whether there are significant differences between the segments resulted from the membership step.

• P<0.05 We are doing well…

Page 37: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Interpret the cluster solution

• Examine cluster centroids and assess whether these differ significantly from each other (e.g., by means of t-tests or ANOVA). As we did earlier.

• Identify names or labels for each cluster and characterize each cluster by means of observable variables, if necessary (cluster profiling).

Page 38: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

SPSS

• That’s all…..now lets try it in spss.

Page 39: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

Another example

• Lets say I want to explore children that needs special learning.

• So I collected some data about children's reading and cognitive performance gain.

Now I ask the question, • What are children groups that need extra

learning?

Page 40: Clustering  analysis workshop CITM , Lab 3 18, Oct 2014

• For the data place this url• www.hosamspace.com/data• Download the cluster children data.• Open the file in spss (or just double click)• Now observe the data.