Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue

Data MiningData MiningCS157B Fall 04

Professor Lee

By Yanhua Xue

Over ViewOver View

What is Data Mining?Why do we need Data MiningMajor tasks of Data Mining

Here is a problemHere is a problem

You are a marketing manager for a brokerage company– Problem: Churn is too high

Turnover(after six month introductory period ends) is 40%

– Customers receive incentives (average cost: $160) when account is opened

– Giving new incentives to everyone who might leave is very expensive

– Bring back a customer after they leave is both difficult and costly

A solutionA solution

One month before the end of the introductory period is over, predict which customers will leave– If you want to keep a customer that is predicted to

churn, offer them something based on their predicted value

The ones that are not predicted to churn need no attention

– If you don’t want to keep the customer, do nothing

Data Mining DefinitionData Mining Definition

The automatic discovery of relationships in typically large database and, in some instances, the use of the discovery results in predicting relationships.

An essential process where intelligent methods are applied in order to extract data patterns.

Data mining lets you be proactive– Prospective rather than Retrospective

Why Mine Data?Why Mine Data?Commercial Viewpoint…Commercial Viewpoint…

Lots of data is being collected and warehoused.

Computing has become affordable.Competitive Pressure is Strong

– Provide better, customized services for an edge.– Information is becoming product in its own

right.

Why Mine Data?Why Mine Data?Scientific Viewpoint…Scientific Viewpoint…

Data collected and stored at enormous speeds– Remote sensor on a satellite– Telescope scanning the skies– Microarrays generating gene expression data– Scientific simulations generating terabytes of data

Traditional techniques are infeasible for raw data Data mining for data reduction

– Cataloging, classifying, segmenting data– Helps scientists in Hypothesis Formation

Major Data Mining TasksMajor Data Mining Tasks

Classification: Predicting an item class Association Rule Discovery: descriptive Clustering: descriptive, finding groups of items Sequential Pattern Discovery: descriptive Deviation Detection: predictive, finding changes Forecasting: predicting a parameter value Description: describing a group Link analysis: finding relationships and

associations

Classification:DefinitionClassification:Definition

Given a collection of records(training set)– Each record contains a set of attributes, one of the

attributes is the class. Find a model for class attribute as a function of the

values of other attributes. Goal: previously unseen records should be assigned

a class as accurately as possible.– A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification: ApplicationClassification: Application

Direct Marketing– Goal: Reduce cost of mailing by targeting a set of

customers likely to buy a new cell-phone product.– Approach:

Use the data for a similar product introduced before. We know which customers decided to buy and which decided

otherwise. This {buy, don’t buy} decision forms the class attribute.

Collect various demographic, lifestyle, and company-interaction related information about all such customers.

– Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier

model.

Classification (Cont’n)Classification (Cont’n)A sample table

Age Smoke Risk20 No Low

25 Yes High

44 Yes High

18 No Low

55 No High

35 No Low

To identify the riskof a group of insuranceApplicants.

The class here are:Risk = LowRisk = High

Classification (Cont’n)Classification (Cont’n)

The following techniques could be used:-– Decision Tree– Naïve Bayesian classifiers– Using association rule– Neural networks– etc……..

Decision TreeDecision Tree

A widely used technique for classification. Each leaf node of the tree has an associated class. Each internal node has a predicate(or more

generally, a function) associated with it. To classify a new instance, we start at the root,

and traverse the tree to reach a leaf; at an internal node we evaluate the predicate(or function) on the data instance, to find which child to go to.

A series of nested if/then rules

Decision TreeDecision Tree 20 No Low

25 Yes High

44 Yes High

18 No Low

55 No High

35 No Low

Smoke

Age

Yes No

0-35 36 - 100

InsuranceRisk

High HighLow

Age Smoke RiskAge Smoke Risk

Benefits of Decision TreeBenefits of Decision Tree

UnderstandableRelatively fastEasy to translate to SQL queries

AssociationsAssociations

I = {i1, i2, …im}: a set of literals, called items.

Transaction d: a set of items such that d IDatabase D: a set of transactionsA transaction d contains X, a set of some

items in L, if X d.An association rule is an implication of the

form X Y, where X, Y I.

Association RuleAssociation Rule Used to find all rules in a basket data Basket data also called transaction data analyze how items purchased by customers in a

shop are related discover all rules that have:-

– support greater than minsup specified by user– confidence greater than minconf specified by user

Example of transaction data:-– CD player, music’s CD, music’s book– CD player, music’s CD– music’s CD, music’s book– CD player

Association RuleAssociation Rule

Let I = {i1, i2, …im} be a total set of items D a set of transactions d is one transaction consists of a set of items

– d I Association rule:-

– X Y where X I ,Y I and X Y = – support = (#of transactions contain X Y ) / D– confidence = (#of transactions contain X Y ) /

#of transactions contain X


Example of transaction data:-– CD player, music’s CD, music’s book– CD player, music’s CD– music’s CD, music’s book– CD player

I = {CD player, music’s CD, music’s book} D = 4 #of transactions contain both CD player, music’s CD =2 #of transactions contain CD player =3 CD player music’s CD (sup=2/4 , conf =2/3 )


How are association rules mined from large databases ?

Two-step process:-– find all frequent item sets– generate strong association rules from frequent

item sets

Classification vs. AssociationClassification vs. Association

Classification– to mine a small set of rules existing in the data to form a

classifier or predictor– it has a target attribute– dataset are in the form of relation table

Association– dataset are transaction data– has no fixed target– can fixed it, thus can be used for classification– A=a, B=b Class = yes– A=c Class = no

Clustering DefinitionClustering Definition

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to

one another.– Data points in separate clusters are less similar

to one another.

Clustering ApplicationClustering Application

Market Segmentation:– Goal: subdivide a market into distinct subsets of customers

where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

Approach:– Collect different attributes of customers based on their

geographical and lifestyle related information– Find clusters of similar customers.– Measure the clustering quality by observing buying

patterns of customers in same cluster vs. those from different clusters.

ReferencesReferences

Professor Lee’s lectures– http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html

Website– http://www.thearling.com/dmintro/dmintro.pdf

Documents

Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue