Data Mining – Day 2 Fabiano Dalpiaz Department of Information and Communication Technology University of Trento - Italy dalpiaz

Data Mining – Day 2

Fabiano Dalpiaz

Department of Information and

Communication Technology

University of Trento - Italy

http://www.dit.unitn.it/~dalpiaz

Database e Business Intelligence

A.A. 2007-2008

© P. Giorgini, F. Dalpiaz 2

Knowledge Discovery (KDD) Process

Databases

Data Cleaning

Data Warehouse

Data Mining

Pattern Evaluation

Selection

Data Integration

Task-relevant Data

Presented yesterday

Today


Outline

Data Mining techniques Frequent patterns, association rules

• Support and confidence

Classification and prediction• Decision trees• Bayesian classifiers• Support Vector Machines• Lazy learning

Cluster Analysis

Visualization of the results Summary


Data Mining techniques


Frequent pattern analysis What is it?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data

• Which products are bought together? Yesterday’s wine and spaghetti example

• What are the subsequent purchases after buying a PC?• Can we automatically classify web documents?

Applications• Basket data analysis• Cross-marketing• Catalog design• Sale campaign analysis


Basic Concepts: Frequent Patterns and Association Rules (1)

Transaction-id Items bought

1 Wine, Bread, Spaghetti

2 Wine, Cocoa, Spaghetti

3 Wine, Spaghetti, Cheese

4 Bread, Cheese, Sugar

5 Bread, Cocoa, Spaghetti, Cheese, Sugar

Itemsets (= transactionsin this example)

Goal: find all rules of type X Y between items in an itemsetwith minimum:Support s - probability that an itemset contains X YConfidence c – conditional probability that an itemset containing Xcontains also Y









Suppose:support s = 50%confidence c=50%

Support is used to define frequent patterns (sets of products in more than s% itemsets){Wine} in itemsets 1, 2, 3 (support = 60%){Bread} in itemsets 1, 4, 5 (support = 60%){Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%){Cheese} in itemsets 3, 4, 5 (support = 60%){Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)









Suppose:support s = 50%confidence c=50%

Confidence defines association rules: X Y rules in frequent patterns whose confidence is bigger than cSuggestion: {Wine, Spaghetti} is the only frequent pattern to be considered. Why?

Association rules:Wine Spaghetti (support=60%, confidence=100%)Spaghetti Wine (support=60%, confidence=75%)


Advanced concepts in Association Rules discovery

Algorithms must face scalability problems Apriori: If there is any itemset which is infrequent, its superset

should not be generated/tested!

Advanced problems Boolean vs. quantitative associations

age(x, “30..39”) and income(x, “42..48K”) buys(x, “car”) [s=1%, c=75%]

Single level vs. multiple-level analysis

What brands of wine are associated with what brands of spaghetti?

Are support and confidenceclear?


Another example for association rules


1 Margherita, Beer, Coke

2 Margherita, Beer

3 Quattro stagioni, Coke

4 Margherita, Coke

Frequent itemsets:{Margherita} = 75%{Beer} = 50%{Coke} = 75%{Margherita, Beer} = 50%{Margherita, Coke} = 50%

Support s = 40%Confidence c = 70%

Association rules:Beer Margherita [c=50%,s=100%]


Classification vs. Prediction

Classification Characterizes (describes) a set of items belonging to a training

set; these items are already classified according to a label attribute

The characterization is a model The model can be applied to classify new data (predict the class

they should belong to)

Prediction models continuous-valued functions, i.e., predicts unknown or

missing values

Applications Credit approval, target marketing, fraud detection


Classification: the process

1. Model construction The class label attribute defines the class each item should

belong to The set of items used for model construction is called training set The model is represented as classification rules, decision trees,

or mathematical formulae

2. Model usage Estimate accuracy of the model

• On the training set

• On a generalization of the training set If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known


Classification: the processModel construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)


Classification: the processModel usage

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’


Supervised vs. Unsupervised Learning

Supervised learning (classification) Supervision: The training data (observations, measurements,

etc.) are accompanied by labels indicating the class of the

observations

New data is classified based on the training set

Unsupervised learning (clustering) The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters in the data


Evaluating generated models Accuracy

classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes

Speed time to construct the model (training time) time to use the model (classification/prediction time)

Robustness handling noise and missing values

Scalability efficiency in disk-resident databases

Interpretability understanding and insight provided by the model


Classification techniquesDecision Trees (1)

Income > 20K€

Investment type choice

Age > 60

Married?

Low risk

no yes

no

Mid risk

yes

no

yes

High risk Mid risk


Classification techniquesDecision Trees (2)

How are the attributes in decision trees selected? Two well-known indexes are used

• Information gain selects the most informative attribute in distinguishing the items between the classes

• It biases towards attributes with a large set of values

• Gain ratio faces the information gain limitations


Classification techniquesBayesian classifiers (2)

Bayesian classification A statistical classification technique

• Predicts class membership probabilities

Founded on the Bayes theorem

• What if X = “Red and rounded” and H = “Apple”?

Performance• The simplest implementation (Naïve Bayes) can be compared to decision

trees and neural networks

Incremental• Each training example can increase/decrease the probability that an

hypothesis in correct

)(

)()|()|(

XP

HPHXPXHP


5 minutes break!


Classification techniquesSupport Vector Machines One of the most advanced classification techniques

Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right figure margin


Classification techniquesSVMs + Kernel Functions

Is data always linearly separable? NO!!! Solution: SVMs + Kernel Functions

How to split this? SVM SVM + KernelFunctions


Classification techniquesLazy learning

Lazy learning Simply stores training data (or only minor processing) and waits

until it is given a test tuple Less time in training but more time in predicting Uses a richer hypothesis space (many local linear functions),

and hence the accuracy is higher

Instance-based learning Subcategory of lazy learning Store training examples and delay the processing (“lazy

evaluation”) until a new instance must be classified An example: k-nearest neighbor approach


Classification techniquesk-nearest neighbor All instances correspond to points in the n-Dimensional

space – x is the instance to be classified The nearest neighbor are defined in terms of Euclidean

distance, dist(X1, X2) For discrete-valued, k-NN returns the most common value

among the k training examples nearest to x

It depends on k!!!k=3 RedK=5 Blue

Which class should thegreen circle belong to?


Prediction techniquesAn overview Prediction is different from classification

Classification refers to predict categorical class label Prediction models continuous-valued functions

Major method for prediction: regression model the relationship between one or more independent or

predictor variables and a dependent or response variable

Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson

regression, log-linear models, regression trees No details here


What is cluster Analysis?

Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

Cluster analysis Finding similarities between data according to the characteristics

found in the data and grouping similar data objects into clusters

It belongs to unsupervised learning Typical applications

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms (day 1 slides)


Examples of cluster analysis

Marketing: Help marketers discover distinct groups in their customer bases

Land use: Identification of areas of similar land use in an earth observation

database

Insurance: Identifying groups of motor insurance policy holders with a high

average claim cost

City-planning: Identifying groups of houses according to their house type, value,

and geographical location


Good clustering

A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)

The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.

It is hard to define “similar enough” or “good enough”


A small exampleHow to cluster this data?

This process is noteasy in practice. Why?


Visualization of the results

Presentation of the results or knowledge obtained from data mining in visual forms

Examples Scatter plots Association rules Decision trees Clusters


Scatter plots (SAS Enterprise miner)


Association rules (SGI/Mineset)


Decision trees (SGI/Mineset)


Clusters (IBM Intelligent Miner)


Summary

Why Data Mining?

Data Miningand KDD

Data preprocessing

Classification

Clustering

Some scenarios

Documents

Data Mining – Day 2 Fabiano Dalpiaz Department of Information and Communication Technology University of Trento - Italy dalpiaz