Upload
tommy96
View
2.153
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Introduction to ClementineIntroduction to Clementine
Tutors: Tutors: Cecia ChanCecia Chan & Gabriel Fung & Gabriel Fung
Data Mining TutorialData Mining Tutorial
A Brief Review of Data Mining (I)A Brief Review of Data Mining (I)
Data mining is…Data mining is…– A process of extracting A process of extracting previously unknownpreviously unknown, , validvalid and and
actionable knowledgeactionable knowledge from large databases from large databases
A rule of thumb:A rule of thumb:– If we know clearly the shape and likely content of what If we know clearly the shape and likely content of what
we are looking for, we are probably not dealing with we are looking for, we are probably not dealing with data miningdata mining
A Brief Review of Data Mining (II)A Brief Review of Data Mining (II)
Therefore, data mining is Therefore, data mining is notnot……– SQL queries against any number of disparate database or data SQL queries against any number of disparate database or data
warehousewarehouse
– SQL queries in a parallel or massively parallel environmentSQL queries in a parallel or massively parallel environment
– IInformation retrieval, for example, through intelligent agentsnformation retrieval, for example, through intelligent agents
– Multidimensional database analysis (MDA)Multidimensional database analysis (MDA)
– OLAPOLAP
– Exploratory data analysis (EDA)Exploratory data analysis (EDA)
– GGraphical visualizationraphical visualization
– Traditional statistical processing against a data warehouseTraditional statistical processing against a data warehouse
However, they are all However, they are all related related to data miningto data mining
Data Mining ProcessData Mining Process
1.1. Business objective(s) determinationBusiness objective(s) determination– What is your goal? What is your goal?
2.2. Data collectionData collection– You can learn nothing without data…You can learn nothing without data…
3.3. Data preprocessing (or Data preparation)Data preprocessing (or Data preparation)– Remove outlier / filter noise / modify fields / etcRemove outlier / filter noise / modify fields / etc
4.4. ModelingModeling– The core part of data miningThe core part of data mining
5.5. EvaluationEvaluation– See what you have learn!See what you have learn!
Data Mining SoftwareData Mining Software
Existing Data mining software:Existing Data mining software:– Clementine from SPSS (we have this software)Clementine from SPSS (we have this software), ,
Enterprise Minter from SAS (we have this software)Enterprise Minter from SAS (we have this software),,Intelligence Miner from IBM (we have this software)Intelligence Miner from IBM (we have this software), , MineSet from Silicon Graphics, MineSet from Silicon Graphics, K-wiz from Compression Sciences Ltd., K-wiz from Compression Sciences Ltd., DBMiner from DBMiner Tech. Inc.,DBMiner from DBMiner Tech. Inc.,PolyAnalyst from Megaputer Intelligence, PolyAnalyst from Megaputer Intelligence, StatServer from MathsoftStatServer from Mathsoft::::
Problem StatementProblem Statement
Situation:Situation:– You are a researcher compiling data for a medical You are a researcher compiling data for a medical
studystudy
– You have collected data about a set of patients, all of You have collected data about a set of patients, all of whom suffered from the same illnesswhom suffered from the same illness
– Each patient responded to one of five drug treatmentsEach patient responded to one of five drug treatments
Step 1: Business objectiveStep 1: Business objective
Figure out which drug might be appropriate for a Figure out which drug might be appropriate for a future patient with the same illnessfuture patient with the same illness
Here are the data collected:Here are the data collected:– AgeAge
– Sex (M or F)Sex (M or F)
– BP (Blood pressure: High, normal, or low)BP (Blood pressure: High, normal, or low)
– Weight (The weight of the patient)Weight (The weight of the patient)
– Cholesterol (Blood cholesterol: Normal or high)Cholesterol (Blood cholesterol: Normal or high)
– Na (Blood sodium concentration)Na (Blood sodium concentration)
– K (Blood potassium concentration)K (Blood potassium concentration)
– Drug (Drug to which the patient responded) Drug (Drug to which the patient responded)
Using Clementine (1)Using Clementine (1)
Clementine is located in…Clementine is located in…– Start Start All Programs All Programs Clementine 6.0.2 Clementine 6.0.2
ModelsModels
NodesNodes
Work-SpaceWork-Space
Using Clementine (2)Using Clementine (2)
Nodes in the workspace represent different objects Nodes in the workspace represent different objects and actions. You connect the nodes to form and actions. You connect the nodes to form streams, which, when executed, let you visualize streams, which, when executed, let you visualize relationships and draw conclusions.relationships and draw conclusions.
Step 2: Data Collection (1)Step 2: Data Collection (1)
Double Click
Nodes for inputting Nodes for inputting the collected datathe collected data
Data Collection (2)Data Collection (2)
Location of your fileLocation of your file
Use how many columns from the fileUse how many columns from the file
Is the first row specify the names of the Is the first row specify the names of the fields or not fields or not
Other detailsOther details
Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (1)(1)
Nodes for exploration/visualization:Nodes for exploration/visualization:– Table (in the Output panel)Table (in the Output panel)
– Plot (in the Graphs Panel)Plot (in the Graphs Panel)
– Histogram (in the Graphs Panel)Histogram (in the Graphs Panel)
– Distribution (in the Graphs Panel)Distribution (in the Graphs Panel)
– Web (in the Graphs Panel)Web (in the Graphs Panel)
Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (2)(2)
Note:Note: Connect the nodes by click-and-drag the middle button of the mouseConnect the nodes by click-and-drag the middle button of the mouse
Double Click
Connect the nodes:Connect the nodes:
Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (3)(3)
Execution:Execution:
Note:Note:Right click on the table nodeRight click on the table nodeto display this menuto display this menu
Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (4)(4)
Other nodes (Please try the other nodes yourself):Other nodes (Please try the other nodes yourself):– Histogram:Histogram:
Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (1)(1)
Replacing values:Replacing values:– Use Filler node:Use Filler node:
» SupposeSuppose we want to transform all weights to its log value we want to transform all weights to its log value (Note: we usually only transform variables to log when it is (Note: we usually only transform variables to log when it is highly skewed):highly skewed):
Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (2)(2)
Derive a new value:Derive a new value:– Use Derive node:Use Derive node:
» SupposeSuppose we want to combine Na and K: we want to combine Na and K:
Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (3)(3)
Remove some fieldsRemove some fields– Use Filter nodeUse Filter node
» SupposeSuppose we have derived a new field Na_Over_K, now we we have derived a new field Na_Over_K, now we need to remove the field Na and K:need to remove the field Na and K:
Step 4: Modeling – Define fieldsStep 4: Modeling – Define fields
Define the fieldsDefine the fields– Use Type node:Use Type node:
Step 4: Modeling – Build a Model (1)Step 4: Modeling – Build a Model (1) It is the core part of data mining. It is the core part of data mining. Supervised Learning:Supervised Learning:
– Train Net (Neural Network)Train Net (Neural Network)– C5.0 (C5.0 Decision Tree)C5.0 (C5.0 Decision Tree)– Linear Reg. (Linear regression)Linear Reg. (Linear regression)– C & R Tree (Classification and Regression Tree, CART)C & R Tree (Classification and Regression Tree, CART)
Unsupervised Learning:Unsupervised Learning:– Train Kohonen (Self-Organized Map, SOM)Train Kohonen (Self-Organized Map, SOM)– Train KMeans (K-means Clustering)Train KMeans (K-means Clustering)– TwoStep (A kind of Hierarchical Clustering)TwoStep (A kind of Hierarchical Clustering)
Others:Others:– GRI (Association Rule mining)GRI (Association Rule mining)– Apriori (Association Rule mining)Apriori (Association Rule mining)– Factor / PCA (Factor analysis, attribute selection technique)Factor / PCA (Factor analysis, attribute selection technique)
Step 4: Modeling – Build a Model (2)Step 4: Modeling – Build a Model (2)
Build what model?Build what model?– Recall that our objective is to determine which type of drugs is Recall that our objective is to determine which type of drugs is
suitable for a specific patient.suitable for a specific patient.
– Thus, it is a classification problem (supervised learning)Thus, it is a classification problem (supervised learning)
In this tutorial, we use:In this tutorial, we use:– C5.0 and C & R TreeC5.0 and C & R Tree
Step 4: Modeling – Build a Model (3)Step 4: Modeling – Build a Model (3)
Note:Note:– There are many complex settings for each modelThere are many complex settings for each model
– In this tutorial, we use default settingIn this tutorial, we use default setting
– Fine tuning a model requires solid experiences in data miningFine tuning a model requires solid experiences in data mining
Step 5: Evaluation (1)Step 5: Evaluation (1)
It means NOTHING even if we have learned It means NOTHING even if we have learned SOMETHING, until the knowledge that we have SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALIDlearned are ACTIONABLE and VALID
Remember:Remember:– The data set of training and testing are ALWAYS The data set of training and testing are ALWAYS
different (why?)different (why?)
Step 5: Step 5: Evaluation (2)Evaluation (2)
Create the following flowCreate the following flow
Note:Note:Must have the same flowMust have the same flowas the training stageas the training stage
Step 5: Step 5: Evaluation (3)Evaluation (3)
Different results:Different results:
– Different models can yield a completely different Different models can yield a completely different resultsresults
– Choosing and tuning a good model is a difficult jobChoosing and tuning a good model is a difficult job
– In this tutorial, we only introduce the process of data In this tutorial, we only introduce the process of data mining onlymining only
Assignment 1Assignment 1
Assignment 1 – Problem Statement Assignment 1 – Problem Statement
Situation: Situation: – You are a financial analyst of a bank You are a financial analyst of a bank
– You have to predict whether a customer is Good or Bad You have to predict whether a customer is Good or Bad based on some demographic informationbased on some demographic information
Data Set: Data Set: – A data set about your past customers has been collected A data set about your past customers has been collected
– Each customer is either Good or Bad Each customer is either Good or Bad
Assignment 1 – Field definitionsAssignment 1 – Field definitions
VARIABLE ROLE DEFINITION DESCRIPTION
CHECKING input Nominal Checking account status
HISTORY input Nominal Credit history
AMOUNT input Interval Amount in Bank
SAVINGS input Nominal No. of Savings (bonds, stocks, etc)
EMPLOYED input Nominal Employment Type (Gov., private, etc)
INSTALLP input Nominal Type of installment rate
MARITAL input Nominal Martial status
PROPERTY input Nominal Type of Property
AGE input Interval Age in years
OTHER input Nominal Type of other installment plan
HOUSING input Nominal Type of House
EXISTCR input Interval Number of existing credits
JOB input Nominal Job Nature
FOREIGN input Binary Foreign worker or Local worker
GOOD_BAD Output Binary Good or bad credit rating
Assignment 1 – Data Mining ProcessAssignment 1 – Data Mining Process
Data CollectionData Collection– Please download Please download CreditRisk CreditRisk data set from data set from
http://www.se.cuhk.edu.hk/~ect7470/http://www.se.cuhk.edu.hk/~ect7470/– Two data sets: Two data sets:
(i) creditRisk1.csv is for training (i) creditRisk1.csv is for training (ii) creditRisk2.csv is for testing(ii) creditRisk2.csv is for testing
Data PreprocessingData Preprocessing– Please explore the data and think critically whether any Please explore the data and think critically whether any
data preprocessing is necessary data preprocessing is necessary » Hints: Two of the interval variables are highly skewedHints: Two of the interval variables are highly skewed
Assignment 1 – Data Mining ProcessAssignment 1 – Data Mining Process
Modeling Modeling – Please build a prediction models using default settings: Please build a prediction models using default settings:
» C5.0 Decision Tree C5.0 Decision Tree
Model Assessment Model Assessment – Please use the testing data set to evaluate the Please use the testing data set to evaluate the
performance of the prediction models performance of the prediction models
Assignment 1 –Assignment 1 –SubmissionSubmission
Save the stream as “Save the stream as “id.strid.str” ” – E.g, 00123456.strE.g, 00123456.str
Upload your stream to the course accountUpload your stream to the course account Deadline:Deadline:
– 4 April 20044 April 2004
This is an individual assignmentThis is an individual assignment
NoteNote::We strongly encourage you to submit this assignment We strongly encourage you to submit this assignment during the class!!! during the class!!!