29
www.company.com Lab4 CPIT 440 Data Mining and Warehouse

Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

Embed Size (px)

Citation preview

Page 1: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Lab4

CPIT 440Data Mining and Warehouse

Page 2: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Lab4: Outlines• Data Mining Process• Data Gathering and Preparation (Preprocessing)

– Techniques of the Data Preprocessing• Data Integration Techniques

• Data Cleaning Techniques

• Data Transformation Techniques

• Data Discritization Techniques

– Definition and Exercises

CPIT 440Data Mining and Warehouse

Page 3: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Mining Process

CPIT 440Data Mining and Warehouse

Page 4: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Gathering and Preparation• The data understanding phase involves data

collection and exploration. • You can take a closer look at the data, you can

determine how well it addresses the business problem.

• You might decide to remove some of the data or add additional data.

• Data preparation can significantly improve the information that can be discovered through data mining.

CPIT 440Data Mining and Warehouse

Page 5: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

• The data preparation phase covers all the tasks involved in creating the case table you will use to build the model.

• Tasks include data cleansing, binning and transformation.

• For example,– you might transform a DATE_OF_BIRTH column to

AGE; – you might insert the average income in cases where

the INCOME column is null.

CPIT 440Data Mining and Warehouse

Data Gathering and Preparation

Page 6: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Preprocessing Techniques• Data Integration Techniques:

– Correlation (Numerical Data) by using Excel– Correlation (Categorical Data-Chi-Square Test) by

using Excel

• Data Cleaning Techniques:– Fill the Missing Values by using ODM– Outlier Treatment for Reducing Noise by using ODM

• Data Transformation Technique:– Normalization by using ODM

• Data Discritization Technique: – Discritization by using ODM

CPIT 440Data Mining and Warehouse

Page 7: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Integration TechniqueDefinition:• Sometimes too much information can reduce the

effectiveness of data mining. • Data sets with many attributes may contain

groups of attributes that are:• Irrelevant attributes which is simply add noise

to the data and affect model accuracy. – Noise increases the size of the model and the time and

system resources needed for model building and scoring.

CPIT 440Data Mining and Warehouse

Page 8: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Integration Technique• Or, correlated attributes that may actually be

measuring the same underlying feature.– Their presence together in the build data can skew the

logic of the algorithm and affect the accuracy of the model.

• To minimize the effects of noise, the technique like correlation is sometimes a desirable preprocessing step for data mining.

CPIT 440Data Mining and Warehouse

Page 9: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Integration TechniqueExercises:• Correlation (Numerical Data) by using Excel.

• Open Excel file Corr.xlsx

• Correlation Results will always be between -1 and 1– 1 = Positive Correlation– 0 = No Correlation– -1 = Negative Correlation

CPIT 440Data Mining and Warehouse

Page 10: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Cleaning Technique

CPIT 440Data Mining and Warehouse

1. Fill the Missing Values by using ODM:– When building or applying a model, Oracle Data Mining

automatically replaces missing values of – numerical attributes:

• with the mean, max/min, avg, specific value or zero values.

– categorical attributes• with the mode.

Page 11: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– Open ODM and import File demo_missing.csv

• Take a view on this file in the attribute length_of_residence there are some data missing;

– Now we will apply a technique of data cleaning to fill out the missing data.

• From ODM open Data Transform Missing Value

CPIT 440Data Mining and Warehouse

Page 12: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– This will open Missing Value Transformation Wizard

CPIT 440Data Mining and Warehouse

Page 13: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– In the 4th step of wizard Select the Column (attribute) on

which you are going to apply missing Value technique and then press on Transform button.

– You will see three option select Replace With – Mean.

– Continue with next button till finish.

CPIT 440Data Mining and Warehouse

Page 14: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

ExerciseSee the difference by using histogram, between

Missing Data and after Fill Out Data.

CPIT 440Data Mining and Warehouse

With Missing After solving Missing

Page 15: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Cleaning Technique2. Outlier Treatment for Reducing Noise by using ODM:– A value is considered an outlier if it deviates significantly from

most other values in the column. – The presence of outliers can have a skewing effect on the

data and then can result in the inaccurate model– Outlier treatment methods such as trimming or clipping can be

implemented to minimize the effect of outliers.

CPIT 440Data Mining and Warehouse

Page 16: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– Import File demo_outliers.csv

• Take a view on this file in the attribute years_details_listed, there are some outliers (Noise), means there are some values under this attribute which are very far from other.

CPIT 440Data Mining and Warehouse

Page 17: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– Now we will apply a technique of data cleaning to

reduce this noise from the data.– Open Data Transform Outlier Treatment

CPIT 440Data Mining and Warehouse

Page 18: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– This will open Outlier Treatment Transformation Wizard– In the 4th Step of wizard Select the Column (attribute) on

which you are going to apply outlier treatment technique – then press std.deviation button then select edge/null

values to be replaced with.

CPIT 440Data Mining and Warehouse

Page 19: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise

– Continue with next button till finish.

CPIT 440Data Mining and Warehouse

Page 20: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

ExerciseSee the difference by using histogram, between

Noisy data and after outlier treatment applied.

CPIT 440Data Mining and Warehouse

Page 21: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Transformation Technique:• Normalization by using ODM:

– Normalization is the technique that transforming numerical values into a specific range, such as [–1.0…1.0] or [0.0…1.0]

CPIT 440Data Mining and Warehouse

Page 22: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– Import File demo_original.csv

• Take a view on this file in the attribute family_income_indicator, we will apply normalize technique.

CPIT 440Data Mining and Warehouse

Page 23: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– Open Data Transform Normalize– This will open Normalize Transformation Wizard– In the 3rd Step of wizard Select the Column (attribute)

on which you are going to apply normalize technique and

– then press Define button then select min-max transformation algorithm.

CPIT 440Data Mining and Warehouse

Page 24: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise

• Continue with next button till finish.

CPIT 440Data Mining and Warehouse

Page 25: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

ExerciseNotice the difference by using histogram,

before and after normalization.

CPIT 440Data Mining and Warehouse

Page 26: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Data Discritization Technique: • Discritization by using ODM

– Also called binning, is a technique for reducing the cardinality of continuous and discrete data.

– It groups related values together in bins to reduce the number of distinct values.

– Discritization can improve resource utilization and model build response time dramatically without significant loss in model quality.

CPIT 440Data Mining and Warehouse

Page 27: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise– Import File demo_original.csv

• Take a view on this file in the attribute family_income_indicator, we will apply discritize technique.

– Open Data Transform Discritize– This will open Discritize Transformation Wizard– In the 4th Step of wizard Select the Column (attribute)

on which you are going to apply discritize technique and

– then press Equal Width button then write 10 number of bins.

CPIT 440Data Mining and Warehouse

Page 28: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

Exercise

• Continue with next button till finish.

CPIT 440Data Mining and Warehouse

Page 29: Www.company.com Lab4 CPIT 440 Data Mining and Warehouse

www.company.com

ExerciseSee the difference by using histogram, before

and after discritization.

CPIT 440Data Mining and Warehouse

BeforeAfter