Overview of the Data Mining Process

Preview:

DESCRIPTION

Core Ideas in Data Mining Classification Prediction Association Rules Predictive Analytics Data Reduction and Dimension Reduction Data Exploration and Visualization Supervised and Unsupervised Learning

Citation preview

Overview of the Data Mining Process

Data Mining for Business AnalyticsShmueli, Patel & Bruce

Core Ideas in Data Mining

ClassificationPredictionAssociation RulesPredictive AnalyticsData Reduction and Dimension ReductionData Exploration and VisualizationSupervised and Unsupervised Learning

Supervised LearningGoal: Predict a single “target” or “outcome”

variable

Training data, where target value is known

Score to data where value is not known

Methods: Classification and Prediction

Unsupervised Learning

Goal: Segment data into meaningful segments; detect patterns

There is no target (outcome) variable to predict or classify

Methods: Association rules, data reduction & exploration, visualization

Supervised: ClassificationGoal: Predict categorical target (outcome)

variable Examples: Purchase/no purchase, fraud/no

fraud, creditworthy/not creditworthy…Each row is a case (customer, tax return,

applicant)Each column is a variableTarget variable is often binary (yes/no)

Supervised: PredictionGoal: Predict numerical target (outcome)

variable Examples: sales, revenue, performanceAs in classification:Each row is a case (customer, tax return,

applicant)Each column is a variableTaken together, classification and

prediction constitute “predictive analytics”

Unsupervised: Association RulesGoal: Produce rules that define “what goes

with what”Example: “If X was purchased, Y was also

purchased”Rows are transactionsUsed in recommender systems – “Our

records show you bought X, you may also like Y”

Also called “affinity analysis”

Unsupervised: Data ReductionDistillation of complex/large data into

simpler/smaller dataReducing the number of variables/columns

(e.g., principal components)Reducing the number of records/rows (e.g.,

clustering)

Unsupervised: Data VisualizationGraphs and plots of dataHistograms, boxplots, bar charts,

scatterplotsEspecially useful to examine relationships

between pairs of variables

Data Exploration

Data sets are typically large, complex & messy

Need to review the data to help refine the task

Use techniques of Reduction and Visualization

The Process of Data Mining

Steps in Data Mining1. Define/understand purpose2. Obtain data (may involve random

sampling)3. Explore, clean, pre-process data4. Reduce the data, transform variables5. Determine DM task (classification,

clustering, etc.)6. Partition the data (for supervised tasks)7. Choose the techniques (regression, neural

networks, etc.)8. Perform the task9. Interpret – compare models10. Deploy the best model

SAS SEMMA Sample Take a sample and partitionExplore Examine statistically and graphicallyModify Transform variables and impute missing valuesModel Fit predictive modelAssess Compare model using validation dataset

Obtaining Data: SamplingData mining typically deals with huge

databasesAlgorithms and models are typically applied

to a sample from a database, to produce statistically-valid results

XLMiner, e.g., limits the “training” partition to 10,000 records

Once you develop and select a final model, you use it to “score” the observations in the larger database

Rare event oversamplingOften the event of interest is rareExamples: response to mailing, fraud in

taxes, …Sampling may yield too few “interesting”

cases to effectively train a modelA popular solution: oversample the rare

cases to obtain a more balanced training set

Later, need to adjust results for the oversampling

Pre-processing Data

Types of VariablesDetermine the types of pre-processing

needed, and algorithms usedMain distinction: Categorical vs. numericNumeric

ContinuousInteger

CategoricalOrdered (low, medium, high)Unordered (male, female)

Variable handlingNumeric

Most algorithms in XLMiner can handle numeric data

May occasionally need to “bin” into categories

CategoricalNaïve Bayes can use as-isIn most other algorithms, must create binary

dummies (number of dummies = number of categories – 1)

Variable selectionParsimony10 records/variable6 x m x p records

m = no. of outcome classesP = no. of variables

Redundancy must be avoidedDomain experts must be consulted

Detecting OutliersAn outlier is an observation that is

“extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)

Outliers can have disproportionate influence on models (a problem if it is spurious)

An important step in data pre-processing is detecting outliers

Once detected, domain knowledge is required to determine if it is an error, or truly extreme.

Detecting OutliersIn some contexts, finding outliers is the

purpose of the DM exercise (airport security screening). This is called “anomaly detection”.

Handling Missing DataMost algorithms will not process records

with missing values. Default is to drop those records.

Solution 1: Omission If a small number of records have missing

values, can omit them If many records are missing values on a small

set of variables, can drop those variables (or use proxies)

If many records have missing values, omission is not practical

Solution 2: Imputation Replace missing values with reasonable

substitutes Lets you keep the record and use the rest of its

(non-missing) information

Normalizing (Standardizing) DataUsed in some techniques when variables

with the largest scales would dominate and skew results

Puts all variables on same scaleNormalizing function: Subtract mean and

divide by standard deviation (used in XLMiner)

Alternative function: scale to 0-1 by subtracting minimum and dividing by the range

Partitioning the DataProblem: How well will our

model perform with new data?

Solution: Separate data into two parts Training partition to develop

the modelValidation partition to

implement the model and evaluate its performance on “new” data

Addresses the issue of overfitting

Test PartitionWhen a model is developed on

training data, it can overfit the training data (hence need to assess on validation)

Assessing multiple models on same validation data can overfit validation data

Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data

Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data

OverfittingStatistical models can produce highly

complex explanations of relationships between variables

The “fit” may be excellentWhen used with new data, models of great

complexity do not do so well.

100% fit – not useful for new data

200 300 400 500 600 700 800 900 10000

200

400

600

800

1000

1200

1400

1600

Expenditure

Revenue

Overfitting (cont.)Causes:

Too many predictors A model with too many parameters Trying many different models

Consequence: Deployed model will not work as well as expected with completely new data.

Building Predictive Model with XLMINERExample – West Roxbury Home Value Dataset

TOTAL VALUE TAX

LOT SQ FT

YR BUILT

GROSS AREA

LIVING AREA FLOORS ROOMS

BEDROOMS

FULL BATH

HALF BATH

KITCHEN

FIREPLACEREMODEL

344.2 330 9965 1880 2436 1352 2 6 3 1 1 1 0None412.6 5190 6590 1945 3108 1976 2 10 4 2 1 1 0Recent330.1 4152 7500 1890 2294 1371 2 8 4 1 1 1 0None498.6 6272 13773 1957 5032 2608 1 9 5 1 1 1 1None331.5 4170 5000 1910 2370 1438 2 7 3 2 0 1 0None337.4 4244 5142 1950 2124 1060 1 6 3 1 0 1 1Old359.4 4521 5000 1954 3220 1916 2 7 3 1 1 1 0None320.4 4030 10000 1950 2208 1200 1 6 3 1 0 1 0None333.5 4195 6835 1958 2582 1092 1 5 3 1 0 1 1Recent409.4 5150 5093 1900 4818 2992 2 8 4 2 0 1 0None

TOTAL VALUE Total assessed value for property, in thousands of USDTAX Tax bill amount based on total assessed value multiplied by the

tax rate, in USDLOT SQ FT Total lot size of parcel in square feetYR BUILT Year property was builtGROSS AREA Gross floor areaLIVING AREA Total living area for residential properties (ftz)FLOORS Number of floorsROOMS Total number of roomsBEDROOMS Total number of bedroomsFULL BATH Total number of full bathsHALF BATH Total number of half bathsKITCHEN Total number of kitchensFIREPLACE Total number of fireplacesREMODEL When house was remodeled (Recent/Old/None)

Modeling Process

1. Determine the purposePredict value of homes in West Roxbury

2. Obtain the dataWest Roxbury Housing.XLSX

3. Explore, Clean and Preprocess the dataa. Tax is circular, not useful as a predictorb. Generate descriptive statistics and look for

unusual valuesc. Generate plots and examine them

Modeling Process

4. Reduce the dimensiona. Condense number of categoriesb. Consolidate multiple numerical variables using

Principal Component Analysis5. Determine Data Mining Task6. Partition the data (for supervised tasks)7. Choose the technique

For the example multiple linear regression

Modeling Process

8. Use the algorithm to perform the task9. Interpret results10. Deploy the model

Step 6: Partitioning the data

Step 8: Using XLMiner for Multiple Linear Regression

Summary of errors

RMS errorError = actual - predictedRMS = Root-mean-squared error = Square

root of average squared error

In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable

Using Excel and XLMiner for Data MiningExcel is limited in data capacityHowever, the training and validation of DM

models can be handled within the modest limits of Excel and XLMiner

Models can then be used to score larger databases

XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)

SummaryData Mining consists of supervised

methods (Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization)

Before algorithms can be applied, data must be characterized and pre-processed

To evaluate performance and to avoid overfitting, data partitioning is used

Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database

Recommended