Introduction To Data Mining. What Is Data Mining? A toolA tool Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)

Introduction To Data Introduction To Data MiningMining

What Is Data Mining?What Is Data Mining?

•A toolA tool•Extraction of interesting (Extraction of interesting (non-non-

trivial, implicit, previously trivial, implicit, previously unknown and potentially useful) unknown and potentially useful) patterns or knowledge from patterns or knowledge from huge amount of datahuge amount of data

•Core of KDDCore of KDD•Integration of Multiple Integration of Multiple

technologiestechnologies

adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTargetData

Selection

KnowledgeKnowledge

PreprocessedData

Patterns

Data Mining

Interpretation/Evaluation

Part of KDDPart of KDD (Knowledge Discovery in Databases)(Knowledge Discovery in Databases)

Preprocessing

Integration of Multiple Integration of Multiple TechnologiesTechnologies

MachineLearning

DatabaseManagement

ArtificialIntelligence

Statistics

DataMining

VisualizationAlgorithms

Other knowledge

Why Data Mining?Why Data Mining?• We are drowning in data (Data explosion problem We are drowning in data (Data explosion problem

), but starving for knowledge! ), but starving for knowledge!

• Solution: Data warehousing and data miningSolution: Data warehousing and data mining– Data warehousing and on-line analytical processingData warehousing and on-line analytical processing– Mining interesting knowledge (rules, regularities, Mining interesting knowledge (rules, regularities,

patterns, constraints) from data in large databasespatterns, constraints) from data in large databases

• A lot of potential applicationsA lot of potential applications– Market analysis and managementMarket analysis and management

• Target marketing, customer relationship management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market (CRM), market basket analysis, cross selling, market segmentationsegmentation

– Risk analysis and managementRisk analysis and management• Forecasting, customer retention, improved underwriting, Forecasting, customer retention, improved underwriting,

quality control, competitive analysisquality control, competitive analysis– Health care …Health care …

Data mining processData mining process

SQL QueriesOperationalDatabase

DataWarehouse

ResultApplication

Interpretation&

EvaluationData Mining

Knowledge-base

State the problem

+hypothesis

Knowledge from Data Knowledge from Data MiningMining

• Association rules Association rules

• Sequential AssociationSequential Association

• Classification rulesClassification rules

• ClusteringClustering

• Deviation DetectionDeviation Detection

• ……

Association RulesAssociation Rules

• Identify association in the data: Identify association in the data: (correlation [A,B] and causality[A->B])(correlation [A,B] and causality[A->B])

• Indicate significance of each Indicate significance of each association association (only interesting if its (only interesting if its confidence exceed a certain measure)confidence exceed a certain measure)

• Not all the Association is Not all the Association is interestinginteresting

(too trivial, negative (too trivial, negative association)association)

E.g. E.g. market-basket analysismarket-basket analysis

““Find groups of items Find groups of items commonly purchased commonly purchased together”together”– People who purchase People who purchase

fish are likely to fish are likely to purchase winepurchase wine

Sequential AssociationsSequential Associations

• Find event sequences that Find event sequences that are unusually likelyare unusually likely

• Requires “training” event Requires “training” event list, known “interesting” list, known “interesting” eventsevents

• Must be robust in the face of Must be robust in the face of additional “noise” eventsadditional “noise” events

Uses:Uses:• Failure analysis and Failure analysis and

predictionpredictionTechnologies:Technologies:• Dynamic programming Dynamic programming

(Dynamic time warping)(Dynamic time warping)• ““Custom” algorithmsCustom” algorithms

““Find common sequences Find common sequences of warnings/faults of warnings/faults within 10 minute within 10 minute periods”periods”– Warn 2 on Switch C Warn 2 on Switch C

preceded by Fault 21 on preceded by Fault 21 on Switch BSwitch B

– Fault 17 on any switch Fault 17 on any switch preceded by Warn 2 on preceded by Warn 2 on any switchany switchTime Switch Event21:10 B Fault 2121:11 A Warn 221:13 C Warn 221:20 A Fault 17

Classification rulesClassification rules

• Classify a set of data based Classify a set of data based on their values in certain on their values in certain attributesattributes

• Requires “training data”: Requires “training data”: have predefined attributeshave predefined attributes

Uses:Uses:

• ProfilingProfiling

Technologies:Technologies:

• Generate decision trees Generate decision trees (results are human (results are human understandable)understandable)

• Neural NetsNeural Nets

““Route documents to Route documents to most likely most likely interested parties”interested parties”– English or non-English or non-

english?english?– Domestic or Domestic or

Foreign?Foreign?

Groups

Training Data

tool produces

classifier

ClusteringClustering

• Group a set of data base on Group a set of data base on the conceptual clustering the conceptual clustering principle(i.e. maximizing principle(i.e. maximizing the intraclass similarity and the intraclass similarity and minimizing the interclass minimizing the interclass similarity)similarity)

• No “training data”: Without No “training data”: Without predefined attributespredefined attributes

Uses:Uses:• Demographic analysisDemographic analysisTechnologies:Technologies:• Self-Organizing MapsSelf-Organizing Maps• Probability DensitiesProbability Densities• Conceptual ClusteringConceptual Clustering

““Group people with Group people with similar travel similar travel profiles”profiles”– George, PatriciaGeorge, Patricia– Jeff, Evelyn, ChrisJeff, Evelyn, Chris– RobRob

Clusters

Deviation DetectionDeviation Detection

• Find unexpected values, Find unexpected values, outliersoutliers

Uses:Uses:• Failure analysisFailure analysis• Anomaly discovery for Anomaly discovery for

analysisanalysis

Technologies:Technologies:• clustering/classification clustering/classification

methodsmethods• Statistical techniquesStatistical techniques• visualizationvisualization

• ““Find unusual Find unusual occurrences in IBM occurrences in IBM stock prices”stock prices”

Date Close Volume Spread58/07/02 369.50 314.08 .02256158/07/03 369.25 313.87 .02256158/07/04 Market Closed58/07/07 370.00 314.50 .022561

Sample date Event Occurrences 58/07/04 Market closed 317 times 59/01/06 2.5% dividend 2 times 59/04/04 50% stock split 7 times 73/10/09 not traded 1 time

Popular Data Mining Popular Data Mining TechniquesTechniques

• Supervised Supervised – Decision trees Decision trees – Rule induction Rule induction – Regression models Regression models – Neural Networks Neural Networks ……

• Unsupervised Unsupervised ——K-means clustering K-means clustering

——Self organized mapsSelf organized maps

……

Supervised vs. Unsupervised• Supervised algorithms

» Learning by example:

– Use training data which the value of the response variable is already known

– Create a model by running the algorithm on the training data

– Identify a class label for the incoming new data

» Driven by a real business problems and historical data

• Unsupervised algorithms» Do not use training data.

» Patterns may not be known in advance

Supervised Algorithms

Decision TreesDecision Trees• A tree structure where non-terminal nodes A tree structure where non-terminal nodes

represent tests on one or more attributes and represent tests on one or more attributes and terminal nodes reflect decision outcomes.terminal nodes reflect decision outcomes.

• Advantages of decision treesAdvantages of decision trees—Understandable —Relatively fast —Easy to translate into SQL queries

• Disadvantages of decision treesDisadvantages of decision trees— Limited to one output attributeLimited to one output attribute— Decision tree algorithms are not so stableDecision tree algorithms are not so stable

• Types of decision treesTypes of decision trees—CHAID: Chi-Square Automatic Interaction Detection —CART: Classification and Regression Trees ……

Table 1.1 • Hypothetical Training Data for Disease Diagnosis

Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold

Figure 1.1 A decision tree for the Figure 1.1 A decision tree for the data in Table 1.1data in Table 1.1

SwollenGlands

Fever

No

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Yes

Diagnosis = Strep Throat

Table 1.2 • Data Instances with an Unknown Classification

Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis

11 No No Yes Yes Yes ? 12 Yes Yes No No Yes ? 13 No No No No Yes ?

Rule inductionRule induction

• The extraction of useful The extraction of useful independent if-then rules independent if-then rules from data based on from data based on statistical significance statistical significance

• If rules cause prediction If rules cause prediction confliction -> solve it confliction -> solve it according to confidenceaccording to confidence

• Advantage and Advantage and disadvantagedisadvantage

—Understandable —not cover all the

possible situation

E.g.E.g.

IF IF Swollen Glands = YesSwollen Glands = Yes

THEN THEN Diagnosis = Strep Diagnosis = Strep ThroatThroat

IF IF Swollen Glands = No Swollen Glands = No & & Fever = YesFever = Yes

THEN THEN Diagnosis = ColdDiagnosis = Cold

IF IF Swollen Glands = No & Swollen Glands = No & Fever = NoFever = No

THEN THEN Diagnosis = AllergyDiagnosis = Allergy

IF =

Antecedent

THEN =

Consequence

Neural NetworksNeural Networks

• Non-linear predictive models that learn Non-linear predictive models that learn through training and resemble biological through training and resemble biological neural networks in structure neural networks in structure

• Means of efficiently modeling large and Means of efficiently modeling large and complex problems in which there may be complex problems in which there may be hundreds of predictor variables that have hundreds of predictor variables that have many interactionsmany interactions

• DisadvantageDisadvantage– Difficult understandDifficult understand– Can require significant amounts of time to train, Can require significant amounts of time to train,

to prepare datato prepare data– ……

Figure 2.2 A multilayer fully Figure 2.2 A multilayer fully connected neural networkconnected neural network

InputLayer

OutputLayer

HiddenLayer

Regression ModelsRegression Models

• Statistical techniquesStatistical techniques

• Using existing values to forecast Using existing values to forecast what other values will be.what other values will be.

Y = a + b1(X1) + b2(X2) + b3(X3) + Y = a + b1(X1) + b2(X2) + b3(X3) + b4(X4) + b5(X5) …b4(X4) + b5(X5) …

• A lot of types regression (linear A lot of types regression (linear regression, logistic regression …)regression, logistic regression …)

K-Means ClusteringK-Means Clustering

• Unsupervised algorithmUnsupervised algorithm

• Steps of algorithmSteps of algorithm1.1. Choose a value for Choose a value for KK, the total number of , the total number of

clusters.clusters.

2.2. Randomly choose Randomly choose KK points as cluster centers. points as cluster centers.

3.3. Assign the remaining instances to their closest Assign the remaining instances to their closest cluster center.cluster center.

4.4. Calculate a new cluster center for each cluster.Calculate a new cluster center for each cluster.

5.5. Repeat steps 3-5 until the cluster centers do not Repeat steps 3-5 until the cluster centers do not change.change.

Table 2.3 • The Credit Card Promotion Database

Income Magazine Watch Life Insurance Credit CardRange ($) Promotion Promotion Promotion Insurance Sex Age

40–50K Yes No No No Male 4530–40K Yes Yes Yes No Female 4040–50K No No No No Male 4230–40K Yes Yes Yes Yes Male 4350–60K Yes No Yes No Female 3820–30K No No No No Female 5530–40K Yes No Yes Yes Male 3520–30K No Yes No No Male 2730–40K Yes No No No Male 4330–40K Yes Yes Yes No Female 4140–50K No Yes Yes No Female 4320–30K No Yes Yes No Male 2950–60K Yes Yes Yes No Female 3940–50K No Yes No No Male 5520–30K No No Yes Yes Female 19

A Hypothesis for the Credit A Hypothesis for the Credit Card Promotion DatabaseCard Promotion Database

A combination of one or more of the dataset A combination of one or more of the dataset attributes differentiate Acme Credit Card attributes differentiate Acme Credit Card Company card holders who have taken Company card holders who have taken advantage of the life insurance promotion advantage of the life insurance promotion and those card holders who have chosen not and those card holders who have chosen not to participate in the promotional offer. to participate in the promotional offer.

Figure 2.3 An unsupervised Figure 2.3 An unsupervised cluster of the credit card databasecluster of the credit card database

# Instances: 5Sex: Male => 3

Female => 2Age: 37.0Credit Card Insurance: Yes => 1

No => 4Life Insurance Promotion: Yes => 2

No => 3

Cluster 1

Cluster 2

Cluster 3




No => 3




No => 0

Choosing a Data Mining Choosing a Data Mining TechniqueTechnique

• Know which kind knowledge you want to getKnow which kind knowledge you want to get • Know your data Know your data

--What is the interaction between input and --What is the interaction between input and output attributes?output attributes?--What is the Distribution of the Data?--What is the Distribution of the Data?--Which Attributes Best Define the Data?--Which Attributes Best Define the Data?

• Know the difference among different data Know the difference among different data mining techniquesmining techniques

Questions to Determine Questions to Determine Data Mining ApplicabilityData Mining Applicability

1.1. Can the problem be clearly defined?Can the problem be clearly defined?

2.2. Does potentially meaningful data Does potentially meaningful data exist?exist?

3.3. Does data contain hidden Does data contain hidden knowledge or is it just filled with knowledge or is it just filled with facts?facts?

4.4. Is the “juice worth the squeeze?”Is the “juice worth the squeeze?”

Data Mining vs. OLAPData Mining vs. OLAP

• Discovery-basedDiscovery-based

(deductive (deductive process)process)

• Mine data Mine data warehouse and warehouse and othersothers

• Can provide Can provide information you information you didn’t expect didn’t expect

• Verification-basedVerification-based

(inductive process)(inductive process)

• DSS tool for data DSS tool for data warehousewarehouse

• Pre-defined queriesPre-defined queries

Data Mining vs. Data QueryData Mining vs. Data Query

• For hidden knowledge

• Try to get the answer as accurate as possible

• Results are the analysis of the data

• Data need to be prepare before producing results

• For specific question

• Answer to query is 100% accurate if data correct

• Results are subset of data

• Need not prepare data

Documents

Introduction To Data Mining. What Is Data Mining? A toolA tool Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)