Upload
curtis-miller
View
229
Download
1
Tags:
Embed Size (px)
Citation preview
Introduction To Data Introduction To Data MiningMining
What Is Data Mining?What Is Data Mining?
•A toolA tool•Extraction of interesting (Extraction of interesting (non-non-
trivial, implicit, previously trivial, implicit, previously unknown and potentially useful) unknown and potentially useful) patterns or knowledge from patterns or knowledge from huge amount of datahuge amount of data
•Core of KDDCore of KDD•Integration of Multiple Integration of Multiple
technologiestechnologies
adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
DataTargetData
Selection
KnowledgeKnowledge
PreprocessedData
Patterns
Data Mining
Interpretation/Evaluation
Part of KDDPart of KDD (Knowledge Discovery in Databases)(Knowledge Discovery in Databases)
Preprocessing
Integration of Multiple Integration of Multiple TechnologiesTechnologies
MachineLearning
DatabaseManagement
ArtificialIntelligence
Statistics
DataMining
VisualizationAlgorithms
Other knowledge
Why Data Mining?Why Data Mining?• We are drowning in data (Data explosion problem We are drowning in data (Data explosion problem
), but starving for knowledge! ), but starving for knowledge!
• Solution: Data warehousing and data miningSolution: Data warehousing and data mining– Data warehousing and on-line analytical processingData warehousing and on-line analytical processing– Mining interesting knowledge (rules, regularities, Mining interesting knowledge (rules, regularities,
patterns, constraints) from data in large databasespatterns, constraints) from data in large databases
• A lot of potential applicationsA lot of potential applications– Market analysis and managementMarket analysis and management
• Target marketing, customer relationship management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market (CRM), market basket analysis, cross selling, market segmentationsegmentation
– Risk analysis and managementRisk analysis and management• Forecasting, customer retention, improved underwriting, Forecasting, customer retention, improved underwriting,
quality control, competitive analysisquality control, competitive analysis– Health care …Health care …
Data mining processData mining process
SQL QueriesOperationalDatabase
DataWarehouse
ResultApplication
Interpretation&
EvaluationData Mining
Knowledge-base
State the problem
+hypothesis
Knowledge from Data Knowledge from Data MiningMining
• Association rules Association rules
• Sequential AssociationSequential Association
• Classification rulesClassification rules
• ClusteringClustering
• Deviation DetectionDeviation Detection
• ……
Association RulesAssociation Rules
• Identify association in the data: Identify association in the data: (correlation [A,B] and causality[A->B])(correlation [A,B] and causality[A->B])
• Indicate significance of each Indicate significance of each association association (only interesting if its (only interesting if its confidence exceed a certain measure)confidence exceed a certain measure)
• Not all the Association is Not all the Association is interestinginteresting
(too trivial, negative (too trivial, negative association)association)
E.g. E.g. market-basket analysismarket-basket analysis
““Find groups of items Find groups of items commonly purchased commonly purchased together”together”– People who purchase People who purchase
fish are likely to fish are likely to purchase winepurchase wine
Sequential AssociationsSequential Associations
• Find event sequences that Find event sequences that are unusually likelyare unusually likely
• Requires “training” event Requires “training” event list, known “interesting” list, known “interesting” eventsevents
• Must be robust in the face of Must be robust in the face of additional “noise” eventsadditional “noise” events
Uses:Uses:• Failure analysis and Failure analysis and
predictionpredictionTechnologies:Technologies:• Dynamic programming Dynamic programming
(Dynamic time warping)(Dynamic time warping)• ““Custom” algorithmsCustom” algorithms
““Find common sequences Find common sequences of warnings/faults of warnings/faults within 10 minute within 10 minute periods”periods”– Warn 2 on Switch C Warn 2 on Switch C
preceded by Fault 21 on preceded by Fault 21 on Switch BSwitch B
– Fault 17 on any switch Fault 17 on any switch preceded by Warn 2 on preceded by Warn 2 on any switchany switchTime Switch Event21:10 B Fault 2121:11 A Warn 221:13 C Warn 221:20 A Fault 17
Classification rulesClassification rules
• Classify a set of data based Classify a set of data based on their values in certain on their values in certain attributesattributes
• Requires “training data”: Requires “training data”: have predefined attributeshave predefined attributes
Uses:Uses:
• ProfilingProfiling
Technologies:Technologies:
• Generate decision trees Generate decision trees (results are human (results are human understandable)understandable)
• Neural NetsNeural Nets
““Route documents to Route documents to most likely most likely interested parties”interested parties”– English or non-English or non-
english?english?– Domestic or Domestic or
Foreign?Foreign?
Groups
Training Data
tool produces
classifier
ClusteringClustering
• Group a set of data base on Group a set of data base on the conceptual clustering the conceptual clustering principle(i.e. maximizing principle(i.e. maximizing the intraclass similarity and the intraclass similarity and minimizing the interclass minimizing the interclass similarity)similarity)
• No “training data”: Without No “training data”: Without predefined attributespredefined attributes
Uses:Uses:• Demographic analysisDemographic analysisTechnologies:Technologies:• Self-Organizing MapsSelf-Organizing Maps• Probability DensitiesProbability Densities• Conceptual ClusteringConceptual Clustering
““Group people with Group people with similar travel similar travel profiles”profiles”– George, PatriciaGeorge, Patricia– Jeff, Evelyn, ChrisJeff, Evelyn, Chris– RobRob
Clusters
Deviation DetectionDeviation Detection
• Find unexpected values, Find unexpected values, outliersoutliers
Uses:Uses:• Failure analysisFailure analysis• Anomaly discovery for Anomaly discovery for
analysisanalysis
Technologies:Technologies:• clustering/classification clustering/classification
methodsmethods• Statistical techniquesStatistical techniques• visualizationvisualization
• ““Find unusual Find unusual occurrences in IBM occurrences in IBM stock prices”stock prices”
Date Close Volume Spread58/07/02 369.50 314.08 .02256158/07/03 369.25 313.87 .02256158/07/04 Market Closed58/07/07 370.00 314.50 .022561
Sample date Event Occurrences 58/07/04 Market closed 317 times 59/01/06 2.5% dividend 2 times 59/04/04 50% stock split 7 times 73/10/09 not traded 1 time
Popular Data Mining Popular Data Mining TechniquesTechniques
• Supervised Supervised – Decision trees Decision trees – Rule induction Rule induction – Regression models Regression models – Neural Networks Neural Networks ……
• Unsupervised Unsupervised ——K-means clustering K-means clustering
——Self organized mapsSelf organized maps
……
Supervised vs. Unsupervised• Supervised algorithms
» Learning by example:
– Use training data which the value of the response variable is already known
– Create a model by running the algorithm on the training data
– Identify a class label for the incoming new data
» Driven by a real business problems and historical data
• Unsupervised algorithms» Do not use training data.
» Patterns may not be known in advance
Supervised Algorithms
Decision TreesDecision Trees• A tree structure where non-terminal nodes A tree structure where non-terminal nodes
represent tests on one or more attributes and represent tests on one or more attributes and terminal nodes reflect decision outcomes.terminal nodes reflect decision outcomes.
• Advantages of decision treesAdvantages of decision trees—Understandable —Relatively fast —Easy to translate into SQL queries
• Disadvantages of decision treesDisadvantages of decision trees— Limited to one output attributeLimited to one output attribute— Decision tree algorithms are not so stableDecision tree algorithms are not so stable
• Types of decision treesTypes of decision trees—CHAID: Chi-Square Automatic Interaction Detection —CART: Classification and Regression Trees ……
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold
Figure 1.1 A decision tree for the Figure 1.1 A decision tree for the data in Table 1.1data in Table 1.1
SwollenGlands
Fever
No
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
Table 1.2 • Data Instances with an Unknown Classification
Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ? 12 Yes Yes No No Yes ? 13 No No No No Yes ?
Rule inductionRule induction
• The extraction of useful The extraction of useful independent if-then rules independent if-then rules from data based on from data based on statistical significance statistical significance
• If rules cause prediction If rules cause prediction confliction -> solve it confliction -> solve it according to confidenceaccording to confidence
• Advantage and Advantage and disadvantagedisadvantage
—Understandable —not cover all the
possible situation
E.g.E.g.
IF IF Swollen Glands = YesSwollen Glands = Yes
THEN THEN Diagnosis = Strep Diagnosis = Strep ThroatThroat
IF IF Swollen Glands = No Swollen Glands = No & & Fever = YesFever = Yes
THEN THEN Diagnosis = ColdDiagnosis = Cold
IF IF Swollen Glands = No & Swollen Glands = No & Fever = NoFever = No
THEN THEN Diagnosis = AllergyDiagnosis = Allergy
IF =
Antecedent
THEN =
Consequence
Neural NetworksNeural Networks
• Non-linear predictive models that learn Non-linear predictive models that learn through training and resemble biological through training and resemble biological neural networks in structure neural networks in structure
• Means of efficiently modeling large and Means of efficiently modeling large and complex problems in which there may be complex problems in which there may be hundreds of predictor variables that have hundreds of predictor variables that have many interactionsmany interactions
• DisadvantageDisadvantage– Difficult understandDifficult understand– Can require significant amounts of time to train, Can require significant amounts of time to train,
to prepare datato prepare data– ……
Figure 2.2 A multilayer fully Figure 2.2 A multilayer fully connected neural networkconnected neural network
InputLayer
OutputLayer
HiddenLayer
Regression ModelsRegression Models
• Statistical techniquesStatistical techniques
• Using existing values to forecast Using existing values to forecast what other values will be.what other values will be.
Y = a + b1(X1) + b2(X2) + b3(X3) + Y = a + b1(X1) + b2(X2) + b3(X3) + b4(X4) + b5(X5) …b4(X4) + b5(X5) …
• A lot of types regression (linear A lot of types regression (linear regression, logistic regression …)regression, logistic regression …)
K-Means ClusteringK-Means Clustering
• Unsupervised algorithmUnsupervised algorithm
• Steps of algorithmSteps of algorithm1.1. Choose a value for Choose a value for KK, the total number of , the total number of
clusters.clusters.
2.2. Randomly choose Randomly choose KK points as cluster centers. points as cluster centers.
3.3. Assign the remaining instances to their closest Assign the remaining instances to their closest cluster center.cluster center.
4.4. Calculate a new cluster center for each cluster.Calculate a new cluster center for each cluster.
5.5. Repeat steps 3-5 until the cluster centers do not Repeat steps 3-5 until the cluster centers do not change.change.
Table 2.3 • The Credit Card Promotion Database
Income Magazine Watch Life Insurance Credit CardRange ($) Promotion Promotion Promotion Insurance Sex Age
40–50K Yes No No No Male 4530–40K Yes Yes Yes No Female 4040–50K No No No No Male 4230–40K Yes Yes Yes Yes Male 4350–60K Yes No Yes No Female 3820–30K No No No No Female 5530–40K Yes No Yes Yes Male 3520–30K No Yes No No Male 2730–40K Yes No No No Male 4330–40K Yes Yes Yes No Female 4140–50K No Yes Yes No Female 4320–30K No Yes Yes No Male 2950–60K Yes Yes Yes No Female 3940–50K No Yes No No Male 5520–30K No No Yes Yes Female 19
A Hypothesis for the Credit A Hypothesis for the Credit Card Promotion DatabaseCard Promotion Database
A combination of one or more of the dataset A combination of one or more of the dataset attributes differentiate Acme Credit Card attributes differentiate Acme Credit Card Company card holders who have taken Company card holders who have taken advantage of the life insurance promotion advantage of the life insurance promotion and those card holders who have chosen not and those card holders who have chosen not to participate in the promotional offer. to participate in the promotional offer.
Figure 2.3 An unsupervised Figure 2.3 An unsupervised cluster of the credit card databasecluster of the credit card database
# Instances: 5Sex: Male => 3
Female => 2Age: 37.0Credit Card Insurance: Yes => 1
No => 4Life Insurance Promotion: Yes => 2
No => 3
Cluster 1
Cluster 2
Cluster 3
# Instances: 3Sex: Male => 3
Female => 0Age: 43.3Credit Card Insurance: Yes => 0
No => 3Life Insurance Promotion: Yes => 0
No => 3
# Instances: 7Sex: Male => 2
Female => 5Age: 39.9Credit Card Insurance: Yes => 2
No => 5Life Insurance Promotion: Yes => 7
No => 0
Choosing a Data Mining Choosing a Data Mining TechniqueTechnique
• Know which kind knowledge you want to getKnow which kind knowledge you want to get • Know your data Know your data
--What is the interaction between input and --What is the interaction between input and output attributes?output attributes?--What is the Distribution of the Data?--What is the Distribution of the Data?--Which Attributes Best Define the Data?--Which Attributes Best Define the Data?
• Know the difference among different data Know the difference among different data mining techniquesmining techniques
Questions to Determine Questions to Determine Data Mining ApplicabilityData Mining Applicability
1.1. Can the problem be clearly defined?Can the problem be clearly defined?
2.2. Does potentially meaningful data Does potentially meaningful data exist?exist?
3.3. Does data contain hidden Does data contain hidden knowledge or is it just filled with knowledge or is it just filled with facts?facts?
4.4. Is the “juice worth the squeeze?”Is the “juice worth the squeeze?”
Data Mining vs. OLAPData Mining vs. OLAP
• Discovery-basedDiscovery-based
(deductive (deductive process)process)
• Mine data Mine data warehouse and warehouse and othersothers
• Can provide Can provide information you information you didn’t expect didn’t expect
• Verification-basedVerification-based
(inductive process)(inductive process)
• DSS tool for data DSS tool for data warehousewarehouse
• Pre-defined queriesPre-defined queries
Data Mining vs. Data QueryData Mining vs. Data Query
• For hidden knowledge
• Try to get the answer as accurate as possible
• Results are the analysis of the data
• Data need to be prepare before producing results
• For specific question
• Answer to query is 100% accurate if data correct
• Results are subset of data
• Need not prepare data