Upload
marilyn-lyons
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
Data MiningData MiningCS157B Fall 04
Professor Lee
By Yanhua Xue
Over ViewOver View
What is Data Mining?Why do we need Data MiningMajor tasks of Data Mining
Here is a problemHere is a problem
You are a marketing manager for a brokerage company– Problem: Churn is too high
Turnover(after six month introductory period ends) is 40%
– Customers receive incentives (average cost: $160) when account is opened
– Giving new incentives to everyone who might leave is very expensive
– Bring back a customer after they leave is both difficult and costly
A solutionA solution
One month before the end of the introductory period is over, predict which customers will leave– If you want to keep a customer that is predicted to
churn, offer them something based on their predicted value
The ones that are not predicted to churn need no attention
– If you don’t want to keep the customer, do nothing
Data Mining DefinitionData Mining Definition
The automatic discovery of relationships in typically large database and, in some instances, the use of the discovery results in predicting relationships.
An essential process where intelligent methods are applied in order to extract data patterns.
Data mining lets you be proactive– Prospective rather than Retrospective
Why Mine Data?Why Mine Data?Commercial Viewpoint…Commercial Viewpoint…
Lots of data is being collected and warehoused.
Computing has become affordable.Competitive Pressure is Strong
– Provide better, customized services for an edge.– Information is becoming product in its own
right.
Why Mine Data?Why Mine Data?Scientific Viewpoint…Scientific Viewpoint…
Data collected and stored at enormous speeds– Remote sensor on a satellite– Telescope scanning the skies– Microarrays generating gene expression data– Scientific simulations generating terabytes of data
Traditional techniques are infeasible for raw data Data mining for data reduction
– Cataloging, classifying, segmenting data– Helps scientists in Hypothesis Formation
Major Data Mining TasksMajor Data Mining Tasks
Classification: Predicting an item class Association Rule Discovery: descriptive Clustering: descriptive, finding groups of items Sequential Pattern Discovery: descriptive Deviation Detection: predictive, finding changes Forecasting: predicting a parameter value Description: describing a group Link analysis: finding relationships and
associations
Classification:DefinitionClassification:Definition
Given a collection of records(training set)– Each record contains a set of attributes, one of the
attributes is the class. Find a model for class attribute as a function of the
values of other attributes. Goal: previously unseen records should be assigned
a class as accurately as possible.– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification: ApplicationClassification: Application
Direct Marketing– Goal: Reduce cost of mailing by targeting a set of
customers likely to buy a new cell-phone product.– Approach:
Use the data for a similar product introduced before. We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier
model.
Classification (Cont’n)Classification (Cont’n)A sample table
Age Smoke Risk20 No Low
25 Yes High
44 Yes High
18 No Low
55 No High
35 No Low
To identify the riskof a group of insuranceApplicants.
The class here are:Risk = LowRisk = High
Classification (Cont’n)Classification (Cont’n)
The following techniques could be used:-– Decision Tree– Naïve Bayesian classifiers– Using association rule– Neural networks– etc……..
Decision TreeDecision Tree
A widely used technique for classification. Each leaf node of the tree has an associated class. Each internal node has a predicate(or more
generally, a function) associated with it. To classify a new instance, we start at the root,
and traverse the tree to reach a leaf; at an internal node we evaluate the predicate(or function) on the data instance, to find which child to go to.
A series of nested if/then rules
Decision TreeDecision Tree 20 No Low
25 Yes High
44 Yes High
18 No Low
55 No High
35 No Low
Smoke
Age
Yes No
0-35 36 - 100
InsuranceRisk
High HighLow
Age Smoke RiskAge Smoke Risk
Benefits of Decision TreeBenefits of Decision Tree
UnderstandableRelatively fastEasy to translate to SQL queries
AssociationsAssociations
I = {i1, i2, …im}: a set of literals, called items.
Transaction d: a set of items such that d IDatabase D: a set of transactionsA transaction d contains X, a set of some
items in L, if X d.An association rule is an implication of the
form X Y, where X, Y I.
Association RuleAssociation Rule Used to find all rules in a basket data Basket data also called transaction data analyze how items purchased by customers in a
shop are related discover all rules that have:-
– support greater than minsup specified by user– confidence greater than minconf specified by user
Example of transaction data:-– CD player, music’s CD, music’s book– CD player, music’s CD– music’s CD, music’s book– CD player
Association RuleAssociation Rule
Let I = {i1, i2, …im} be a total set of items D a set of transactions d is one transaction consists of a set of items
– d I Association rule:-
– X Y where X I ,Y I and X Y = – support = (#of transactions contain X Y ) / D– confidence = (#of transactions contain X Y ) /
#of transactions contain X
Association RuleAssociation Rule
Example of transaction data:-– CD player, music’s CD, music’s book– CD player, music’s CD– music’s CD, music’s book– CD player
I = {CD player, music’s CD, music’s book} D = 4 #of transactions contain both CD player, music’s CD =2 #of transactions contain CD player =3 CD player music’s CD (sup=2/4 , conf =2/3 )
Association RuleAssociation Rule
How are association rules mined from large databases ?
Two-step process:-– find all frequent item sets– generate strong association rules from frequent
item sets
Classification vs. AssociationClassification vs. Association
Classification– to mine a small set of rules existing in the data to form a
classifier or predictor– it has a target attribute– dataset are in the form of relation table
Association– dataset are transaction data– has no fixed target– can fixed it, thus can be used for classification– A=a, B=b Class = yes– A=c Class = no
Clustering DefinitionClustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to
one another.– Data points in separate clusters are less similar
to one another.
Clustering ApplicationClustering Application
Market Segmentation:– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:– Collect different attributes of customers based on their
geographical and lifestyle related information– Find clusters of similar customers.– Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from different clusters.
ReferencesReferences
Professor Lee’s lectures– http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html
Website– http://www.thearling.com/dmintro/dmintro.pdf