Upload
robert-stephens
View
223
Download
0
Embed Size (px)
Citation preview
Chapter 5Chapter 5
Data mining : A Data mining : A Closer LookCloser LookData mining : A Data mining : A Closer LookCloser Look
Chapter 52Data Warehouse and Data Mining
Chapter ObjectivesChapter ObjectivesChapter ObjectivesChapter Objectives
Determine an appropriate data mining Determine an appropriate data mining strategy for a specific problem.strategy for a specific problem.
Know about several data mining techniques Know about several data mining techniques and how each technique builds a generalized and how each technique builds a generalized model to represent data.model to represent data.
Understand how a confusion matrix is used Understand how a confusion matrix is used to help evaluate supervised learner models.to help evaluate supervised learner models.
Chapter 53Data Warehouse and Data Mining
Understand basic techniques for evaluating Understand basic techniques for evaluating supervised learner models with numeric supervised learner models with numeric output.output.Know how measuring lift can be used to Know how measuring lift can be used to compare the performance of several compare the performance of several competing supervised learner models.competing supervised learner models.Understand basic techniques for evaluating Understand basic techniques for evaluating unsupervised learner models.unsupervised learner models.
Chapter ObjectivesChapter ObjectivesChapter ObjectivesChapter Objectives
Chapter 54Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
ClassificationClassification is probably the best understood of is probably the best understood of all data mining strategies. all data mining strategies.
Classification tasks have three common Classification tasks have three common characteristics.characteristics.
• Learning is supervised.Learning is supervised.
• The dependent variable is categorical.The dependent variable is categorical.
• The emphasis is on The emphasis is on building modelsbuilding models able to able to assign new instances to one of a set of well-assign new instances to one of a set of well-defined classes.defined classes.
Chapter 55Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
• Some example classification tasks include the Some example classification tasks include the following:following:
•Determine those characteristics that differentiate individuals Determine those characteristics that differentiate individuals who have suffered a heart attack from those who have not.who have suffered a heart attack from those who have not.
• Develop a profile of a “successful” person.Develop a profile of a “successful” person.
• Determine if a credit card purchase is fraudulent.Determine if a credit card purchase is fraudulent.
• Classify a car loan applicant as a good or a poor credit risk.Classify a car loan applicant as a good or a poor credit risk.
• Develop a profile to differentiate female and male stroke Develop a profile to differentiate female and male stroke victims.victims.
Chapter 56Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
Chapter 57Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
Chapter 58Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
Chapter 59Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
Chapter 510Data Warehouse and Data Mining
Data Mining StrategiesData Mining StrategiesData Mining StrategiesData Mining Strategies
34% are healthy within these max heart rate range
Chapter 511Data Warehouse and Data Mining
Supervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining Techniques
Chapter 512Data Warehouse and Data Mining
Supervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining Techniques
Chapter 513Data Warehouse and Data Mining
Supervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining Techniques
Chapter 514Data Warehouse and Data Mining
Supervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining Techniques
Chapter 515Data Warehouse and Data Mining
Supervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining TechniquesSupervised Data Mining Techniques
Chapter 516Data Warehouse and Data Mining
Association RulesAssociation RulesAssociation RulesAssociation Rules
Chapter 517Data Warehouse and Data Mining
Clustering TechniquesClustering TechniquesClustering TechniquesClustering Techniques
Chapter 518Data Warehouse and Data Mining
Clustering TechniquesClustering TechniquesClustering TechniquesClustering Techniques
Chapter 519Data Warehouse and Data Mining
Evaluating PerformanceEvaluating PerformanceEvaluating PerformanceEvaluating Performance
Chapter 520Data Warehouse and Data Mining
Evaluating PerformanceEvaluating PerformanceEvaluating PerformanceEvaluating Performance
Chapter 521Data Warehouse and Data Mining
Evaluating PerformanceEvaluating PerformanceEvaluating PerformanceEvaluating Performance
Chapter 522Data Warehouse and Data Mining
Evaluating PerformanceEvaluating PerformanceEvaluating PerformanceEvaluating Performance
Chapter 523Data Warehouse and Data Mining
Evaluating PerformanceEvaluating PerformanceEvaluating PerformanceEvaluating Performance
Chapter 524Data Warehouse and Data Mining
Chapter SummaryChapter SummaryChapter SummaryChapter SummaryData mining strategies include Data mining strategies include classification, classification,
estimation, estimation, predictionprediction, , unsupervised clusteringunsupervised clustering, , and and market basket analysis. market basket analysis.
Classification and estimation strategies are similar in Classification and estimation strategies are similar in that each strategy is employed to build models able to that each strategy is employed to build models able to
generalize current outcome. generalize current outcome.
However, the output of a classification strategy is However, the output of a classification strategy is categorical, whereas categorical, whereas the output of an estimation strategy is the output of an estimation strategy is
numeric.numeric.
Chapter 525Data Warehouse and Data Mining
Chapter SummaryChapter SummaryChapter SummaryChapter SummaryA predictive strategyA predictive strategy differs from a classification or differs from a classification or
estimation strategy in that it is used to design models for estimation strategy in that it is used to design models for predicting future outcome rather than current behavior.predicting future outcome rather than current behavior.
Unsupervised clusteringUnsupervised clustering strategies are employed to strategies are employed to discover hidden concept structures in data as well as to locate discover hidden concept structures in data as well as to locate
atypical data instances. atypical data instances.
The purpose of The purpose of market basket analysismarket basket analysis is to find is to find interesting relationships among retail products.interesting relationships among retail products.
Discovered relationships can be used to design Discovered relationships can be used to design promotions, arrange shelf or catalog items, or develop cross-promotions, arrange shelf or catalog items, or develop cross-
marketing strategies.marketing strategies.
Chapter 526Data Warehouse and Data Mining
A data mining technique applies a data A data mining technique applies a data mining strategy to a set of data. mining strategy to a set of data.
Data mining techniques are defined by Data mining techniques are defined by an an algorithm and a knowledge structure.algorithm and a knowledge structure.
Common features that distinguish the various Common features that distinguish the various techniques are whether learning is techniques are whether learning is supervised supervised or unsupervised or unsupervised and whether theirand whether their output is output is
categorical or numeric. categorical or numeric.
Chapter SummaryChapter SummaryChapter SummaryChapter Summary
Chapter 527Data Warehouse and Data Mining
Familiar Familiar supervised data miningsupervised data mining techniques include decision techniques include decision tree methods, production rule generators, neural networks, and tree methods, production rule generators, neural networks, and
statistical methods. statistical methods.
Association rules are a favorite technique for marketing Association rules are a favorite technique for marketing applications. applications.
Clustering techniques employ some Clustering techniques employ some measure of similarity to measure of similarity to group instancesgroup instances into disjoint partitions. into disjoint partitions.
Clustering methods are frequently used to help determine a Clustering methods are frequently used to help determine a best set of input attributes for building supervised learner best set of input attributes for building supervised learner
models. models.
Chapter SummaryChapter SummaryChapter SummaryChapter Summary
Chapter 528Data Warehouse and Data Mining
Chapter SummaryChapter SummaryChapter SummaryChapter Summary
Performance evaluationPerformance evaluation is probably the most is probably the most critical of all the steps in the data mining critical of all the steps in the data mining
process. process.
Supervised model evaluation is often Supervised model evaluation is often performed using a performed using a training/test set scenariotraining/test set scenario. .
Supervised models with numeric output can Supervised models with numeric output can be evaluated by be evaluated by computing average absolute or computing average absolute or
average squared error differences between average squared error differences between computed and desired outcome.computed and desired outcome.
Chapter 529Data Warehouse and Data Mining
Chapter SummaryChapter SummaryChapter SummaryChapter Summary
Marketing applications that focus on mass mailings are Marketing applications that focus on mass mailings are interested in developing models for increasing response rates to interested in developing models for increasing response rates to
promotions.promotions.
A marketing application measures the goodness of a model by A marketing application measures the goodness of a model by its ability to lift response rate thresholds to levels well above its ability to lift response rate thresholds to levels well above
those achieved by nathose achieved by naïve (mass) mailing strategies. ïve (mass) mailing strategies.
Unsupervised models support some measure of Unsupervised models support some measure of cluster qualitycluster quality that can be used for evaluative purposes. that can be used for evaluative purposes.
Supervised learning can also be employed to Supervised learning can also be employed to evaluate the evaluate the quality of the clusters formedquality of the clusters formed by an unsupervised model. by an unsupervised model.
Chapter 530Data Warehouse and Data Mining
Key TermsKey TermsKey TermsKey Terms
Classification. Classification. A supervised learning strategy where the A supervised learning strategy where the output attribute is categorical. The emphasis is on output attribute is categorical. The emphasis is on building models able to assign new instances to one of building models able to assign new instances to one of a set of well-defined classes.a set of well-defined classes.
Association rule.Association rule. A production rule whose consequent A production rule whose consequent may contain multiple conditions and attribute may contain multiple conditions and attribute relationships. An output attribute in one association rule relationships. An output attribute in one association rule can be an input attribute in other rule.can be an input attribute in other rule.
Confusion matrix.Confusion matrix. A matrix used to summarize the A matrix used to summarize the results of a supervised classification. Entries along the results of a supervised classification. Entries along the main diagonal represent the total number of correct main diagonal represent the total number of correct classifications. Entries other than those on the main classifications. Entries other than those on the main diagonal represent classification errors.diagonal represent classification errors.
Chapter 531Data Warehouse and Data Mining
Key TermsKey TermsKey TermsKey Terms
DataData mining strategy.mining strategy. An outline of an approach for An outline of an approach for problem solution.problem solution.
Data mining technique.Data mining technique. One or more algorithms together One or more algorithms together with an associated knowledge structure.with an associated knowledge structure.
Dependent variable.Dependent variable. A variable whose value is A variable whose value is determined by a combination of one or more determined by a combination of one or more independent variables.independent variables.
Estimation.Estimation. A supervised learning strategy where the A supervised learning strategy where the output attribute is numeric. Emphasis is on determining output attribute is numeric. Emphasis is on determining current rather than future outcome.current rather than future outcome.
Chapter 532Data Warehouse and Data Mining
Key TermsKey Terms
Independent variable.Independent variable. An input attribute used for building An input attribute used for building supervised or unsupervised learner models.supervised or unsupervised learner models.
Lift.Lift. The probability of class The probability of class CCii given a sample taken given a sample taken
from population from population PP divided by the probability of divided by the probability of CCii
given the entire population given the entire population PP..
Lift chart.Lift chart. A graph that displays the performance of a data A graph that displays the performance of a data mining model as a function of sample size.mining model as a function of sample size.
Linear regression.Linear regression. A supervised learning technique that A supervised learning technique that generalizes numeric data as a linear equation. The generalizes numeric data as a linear equation. The equation defines the value of an output attribute as a equation defines the value of an output attribute as a linear sum of weighted input attribute values.linear sum of weighted input attribute values.
Chapter 533Data Warehouse and Data Mining
Key TermsKey TermsKey TermsKey TermsMarket basket analysis.Market basket analysis. A data mining strategy that A data mining strategy that
attempts to find interesting relationships among retail attempts to find interesting relationships among retail products.products.
Mean absolute error.Mean absolute error. For a set of training or test set For a set of training or test set instances, the mean absolute error is the average instances, the mean absolute error is the average absolute difference between classifier predicted output absolute difference between classifier predicted output and actual output.and actual output.
Mean squared error.Mean squared error. For a set of training or test set For a set of training or test set instances, the mean squared error is the average of the instances, the mean squared error is the average of the sum of squared differences between classifier predicted sum of squared differences between classifier predicted output and actual output.output and actual output.
Neural network.Neural network. A set of interconnected nodes designed A set of interconnected nodes designed to imitate the functioning of the human brain.to imitate the functioning of the human brain.
Chapter 534Data Warehouse and Data Mining
Key TermsKey TermsKey TermsKey Terms
Outliers.Outliers. Atypical data instances. Atypical data instances.
Prediction.Prediction. A supervised learning strategy designed to A supervised learning strategy designed to determine future outcome.determine future outcome.
Root mean squared error.Root mean squared error. The square root of the mean The square root of the mean squared error.squared error.
Rule Maker.Rule Maker. A supervised learner model for generating A supervised learner model for generating production rules from data.production rules from data.
Statistical regression.Statistical regression. A supervised learning technique A supervised learning technique that generalizes numerical data as a mathematical that generalizes numerical data as a mathematical equation. The equation defines the value of an output equation. The equation defines the value of an output attribute as a sum of weighted input attribute values.attribute as a sum of weighted input attribute values.
Chapter 535Data Warehouse and Data Mining