Introduction to Clementine

Introduction to ClementineIntroduction to Clementine

Tutors: Tutors: Cecia ChanCecia Chan & Gabriel Fung & Gabriel Fung

Data Mining TutorialData Mining Tutorial

A Brief Review of Data Mining (I)A Brief Review of Data Mining (I)

Data mining is…Data mining is…– A process of extracting A process of extracting previously unknownpreviously unknown, , validvalid and and

actionable knowledgeactionable knowledge from large databases from large databases

A rule of thumb:A rule of thumb:– If we know clearly the shape and likely content of what If we know clearly the shape and likely content of what

we are looking for, we are probably not dealing with we are looking for, we are probably not dealing with data miningdata mining

A Brief Review of Data Mining (II)A Brief Review of Data Mining (II)

Therefore, data mining is Therefore, data mining is notnot……– SQL queries against any number of disparate database or data SQL queries against any number of disparate database or data

warehousewarehouse

– SQL queries in a parallel or massively parallel environmentSQL queries in a parallel or massively parallel environment

– IInformation retrieval, for example, through intelligent agentsnformation retrieval, for example, through intelligent agents

– Multidimensional database analysis (MDA)Multidimensional database analysis (MDA)

– OLAPOLAP

– Exploratory data analysis (EDA)Exploratory data analysis (EDA)

– GGraphical visualizationraphical visualization

– Traditional statistical processing against a data warehouseTraditional statistical processing against a data warehouse

However, they are all However, they are all related related to data miningto data mining

Data Mining ProcessData Mining Process

1.1. Business objective(s) determinationBusiness objective(s) determination– What is your goal? What is your goal?

2.2. Data collectionData collection– You can learn nothing without data…You can learn nothing without data…

3.3. Data preprocessing (or Data preparation)Data preprocessing (or Data preparation)– Remove outlier / filter noise / modify fields / etcRemove outlier / filter noise / modify fields / etc

4.4. ModelingModeling– The core part of data miningThe core part of data mining

5.5. EvaluationEvaluation– See what you have learn!See what you have learn!

Data Mining SoftwareData Mining Software

Existing Data mining software:Existing Data mining software:– Clementine from SPSS (we have this software)Clementine from SPSS (we have this software), ,

Enterprise Minter from SAS (we have this software)Enterprise Minter from SAS (we have this software),,Intelligence Miner from IBM (we have this software)Intelligence Miner from IBM (we have this software), , MineSet from Silicon Graphics, MineSet from Silicon Graphics, K-wiz from Compression Sciences Ltd., K-wiz from Compression Sciences Ltd., DBMiner from DBMiner Tech. Inc.,DBMiner from DBMiner Tech. Inc.,PolyAnalyst from Megaputer Intelligence, PolyAnalyst from Megaputer Intelligence, StatServer from MathsoftStatServer from Mathsoft::::

Problem StatementProblem Statement

Situation:Situation:– You are a researcher compiling data for a medical You are a researcher compiling data for a medical

studystudy

– You have collected data about a set of patients, all of You have collected data about a set of patients, all of whom suffered from the same illnesswhom suffered from the same illness

– Each patient responded to one of five drug treatmentsEach patient responded to one of five drug treatments

Step 1: Business objectiveStep 1: Business objective

Figure out which drug might be appropriate for a Figure out which drug might be appropriate for a future patient with the same illnessfuture patient with the same illness

Here are the data collected:Here are the data collected:– AgeAge

– Sex (M or F)Sex (M or F)

– BP (Blood pressure: High, normal, or low)BP (Blood pressure: High, normal, or low)

– Weight (The weight of the patient)Weight (The weight of the patient)

– Cholesterol (Blood cholesterol: Normal or high)Cholesterol (Blood cholesterol: Normal or high)

– Na (Blood sodium concentration)Na (Blood sodium concentration)

– K (Blood potassium concentration)K (Blood potassium concentration)

– Drug (Drug to which the patient responded) Drug (Drug to which the patient responded)

Using Clementine (1)Using Clementine (1)

Clementine is located in…Clementine is located in…– Start Start All Programs All Programs Clementine 6.0.2 Clementine 6.0.2

ModelsModels

NodesNodes

Work-SpaceWork-Space

Using Clementine (2)Using Clementine (2)

Nodes in the workspace represent different objects Nodes in the workspace represent different objects and actions. You connect the nodes to form and actions. You connect the nodes to form streams, which, when executed, let you visualize streams, which, when executed, let you visualize relationships and draw conclusions.relationships and draw conclusions.

Step 2: Data Collection (1)Step 2: Data Collection (1)

Double Click

Nodes for inputting Nodes for inputting the collected datathe collected data

Data Collection (2)Data Collection (2)

Location of your fileLocation of your file

Use how many columns from the fileUse how many columns from the file

Is the first row specify the names of the Is the first row specify the names of the fields or not fields or not

Other detailsOther details

Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (1)(1)

Nodes for exploration/visualization:Nodes for exploration/visualization:– Table (in the Output panel)Table (in the Output panel)

– Plot (in the Graphs Panel)Plot (in the Graphs Panel)

– Histogram (in the Graphs Panel)Histogram (in the Graphs Panel)

– Distribution (in the Graphs Panel)Distribution (in the Graphs Panel)

– Web (in the Graphs Panel)Web (in the Graphs Panel)


Note:Note: Connect the nodes by click-and-drag the middle button of the mouseConnect the nodes by click-and-drag the middle button of the mouse

Double Click

Connect the nodes:Connect the nodes:


Execution:Execution:

Note:Note:Right click on the table nodeRight click on the table nodeto display this menuto display this menu


Other nodes (Please try the other nodes yourself):Other nodes (Please try the other nodes yourself):– Histogram:Histogram:

Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (1)(1)

Replacing values:Replacing values:– Use Filler node:Use Filler node:

» SupposeSuppose we want to transform all weights to its log value we want to transform all weights to its log value (Note: we usually only transform variables to log when it is (Note: we usually only transform variables to log when it is highly skewed):highly skewed):


Derive a new value:Derive a new value:– Use Derive node:Use Derive node:

» SupposeSuppose we want to combine Na and K: we want to combine Na and K:


Remove some fieldsRemove some fields– Use Filter nodeUse Filter node

» SupposeSuppose we have derived a new field Na_Over_K, now we we have derived a new field Na_Over_K, now we need to remove the field Na and K:need to remove the field Na and K:

Step 4: Modeling – Define fieldsStep 4: Modeling – Define fields

Define the fieldsDefine the fields– Use Type node:Use Type node:

Step 4: Modeling – Build a Model (1)Step 4: Modeling – Build a Model (1) It is the core part of data mining. It is the core part of data mining. Supervised Learning:Supervised Learning:

– Train Net (Neural Network)Train Net (Neural Network)– C5.0 (C5.0 Decision Tree)C5.0 (C5.0 Decision Tree)– Linear Reg. (Linear regression)Linear Reg. (Linear regression)– C & R Tree (Classification and Regression Tree, CART)C & R Tree (Classification and Regression Tree, CART)

Unsupervised Learning:Unsupervised Learning:– Train Kohonen (Self-Organized Map, SOM)Train Kohonen (Self-Organized Map, SOM)– Train KMeans (K-means Clustering)Train KMeans (K-means Clustering)– TwoStep (A kind of Hierarchical Clustering)TwoStep (A kind of Hierarchical Clustering)

Others:Others:– GRI (Association Rule mining)GRI (Association Rule mining)– Apriori (Association Rule mining)Apriori (Association Rule mining)– Factor / PCA (Factor analysis, attribute selection technique)Factor / PCA (Factor analysis, attribute selection technique)

Step 4: Modeling – Build a Model (2)Step 4: Modeling – Build a Model (2)

Build what model?Build what model?– Recall that our objective is to determine which type of drugs is Recall that our objective is to determine which type of drugs is

suitable for a specific patient.suitable for a specific patient.

– Thus, it is a classification problem (supervised learning)Thus, it is a classification problem (supervised learning)

In this tutorial, we use:In this tutorial, we use:– C5.0 and C & R TreeC5.0 and C & R Tree

Step 4: Modeling – Build a Model (3)Step 4: Modeling – Build a Model (3)

Note:Note:– There are many complex settings for each modelThere are many complex settings for each model

– In this tutorial, we use default settingIn this tutorial, we use default setting

– Fine tuning a model requires solid experiences in data miningFine tuning a model requires solid experiences in data mining

Step 5: Evaluation (1)Step 5: Evaluation (1)

It means NOTHING even if we have learned It means NOTHING even if we have learned SOMETHING, until the knowledge that we have SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALIDlearned are ACTIONABLE and VALID

Remember:Remember:– The data set of training and testing are ALWAYS The data set of training and testing are ALWAYS

different (why?)different (why?)

Step 5: Step 5: Evaluation (2)Evaluation (2)

Create the following flowCreate the following flow

Note:Note:Must have the same flowMust have the same flowas the training stageas the training stage

Step 5: Step 5: Evaluation (3)Evaluation (3)

Different results:Different results:

– Different models can yield a completely different Different models can yield a completely different resultsresults

– Choosing and tuning a good model is a difficult jobChoosing and tuning a good model is a difficult job

– In this tutorial, we only introduce the process of data In this tutorial, we only introduce the process of data mining onlymining only

Assignment 1Assignment 1

Assignment 1 – Problem Statement Assignment 1 – Problem Statement

Situation: Situation: – You are a financial analyst of a bank You are a financial analyst of a bank

– You have to predict whether a customer is Good or Bad You have to predict whether a customer is Good or Bad based on some demographic informationbased on some demographic information

Data Set: Data Set: – A data set about your past customers has been collected A data set about your past customers has been collected

– Each customer is either Good or Bad Each customer is either Good or Bad

Assignment 1 – Field definitionsAssignment 1 – Field definitions

VARIABLE ROLE DEFINITION DESCRIPTION

CHECKING input Nominal Checking account status

HISTORY input Nominal Credit history

AMOUNT input Interval Amount in Bank

SAVINGS input Nominal No. of Savings (bonds, stocks, etc)

EMPLOYED input Nominal Employment Type (Gov., private, etc)

INSTALLP input Nominal Type of installment rate

MARITAL input Nominal Martial status

PROPERTY input Nominal Type of Property

AGE input Interval Age in years

OTHER input Nominal Type of other installment plan

HOUSING input Nominal Type of House

EXISTCR input Interval Number of existing credits

JOB input Nominal Job Nature

FOREIGN input Binary Foreign worker or Local worker

GOOD_BAD Output Binary Good or bad credit rating

Assignment 1 – Data Mining ProcessAssignment 1 – Data Mining Process

Data CollectionData Collection– Please download Please download CreditRisk CreditRisk data set from data set from

http://www.se.cuhk.edu.hk/~ect7470/http://www.se.cuhk.edu.hk/~ect7470/– Two data sets: Two data sets:

(i) creditRisk1.csv is for training (i) creditRisk1.csv is for training (ii) creditRisk2.csv is for testing(ii) creditRisk2.csv is for testing

Data PreprocessingData Preprocessing– Please explore the data and think critically whether any Please explore the data and think critically whether any

data preprocessing is necessary data preprocessing is necessary » Hints: Two of the interval variables are highly skewedHints: Two of the interval variables are highly skewed

http://www.se.cuhk.edu.hk/~ect7470/

Assignment 1 – Data Mining ProcessAssignment 1 – Data Mining Process

Modeling Modeling – Please build a prediction models using default settings: Please build a prediction models using default settings:

» C5.0 Decision Tree C5.0 Decision Tree

Model Assessment Model Assessment – Please use the testing data set to evaluate the Please use the testing data set to evaluate the

performance of the prediction models performance of the prediction models

Assignment 1 –Assignment 1 –SubmissionSubmission

Save the stream as “Save the stream as “id.strid.str” ” – E.g, 00123456.strE.g, 00123456.str

Upload your stream to the course accountUpload your stream to the course account Deadline:Deadline:

– 4 April 20044 April 2004

This is an individual assignmentThis is an individual assignment

NoteNote::We strongly encourage you to submit this assignment We strongly encourage you to submit this assignment during the class!!! during the class!!!

Documents

Introduction to Clementine