Upload
tyler-wilkinson
View
215
Download
0
Embed Size (px)
Citation preview
Data Mining – Day 2
Fabiano Dalpiaz
Department of Information and
Communication Technology
University of Trento - Italy
http://www.dit.unitn.it/~dalpiaz
Database e Business Intelligence
A.A. 2007-2008
© P. Giorgini, F. Dalpiaz 2
Knowledge Discovery (KDD) Process
Databases
Data Cleaning
Data Warehouse
Data Mining
Pattern Evaluation
Selection
Data Integration
Task-relevant Data
Presented yesterday
Today
© P. Giorgini, F. Dalpiaz 3
Outline
Data Mining techniques Frequent patterns, association rules
• Support and confidence
Classification and prediction• Decision trees• Bayesian classifiers• Support Vector Machines• Lazy learning
Cluster Analysis
Visualization of the results Summary
© P. Giorgini, F. Dalpiaz 5
Frequent pattern analysis What is it?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti example
• What are the subsequent purchases after buying a PC?• Can we automatically classify web documents?
Applications• Basket data analysis• Cross-marketing• Catalog design• Sale campaign analysis
© P. Giorgini, F. Dalpiaz 6
Basic Concepts: Frequent Patterns and Association Rules (1)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
2 Wine, Cocoa, Spaghetti
3 Wine, Spaghetti, Cheese
4 Bread, Cheese, Sugar
5 Bread, Cocoa, Spaghetti, Cheese, Sugar
Itemsets (= transactionsin this example)
Goal: find all rules of type X Y between items in an itemsetwith minimum:Support s - probability that an itemset contains X YConfidence c – conditional probability that an itemset containing Xcontains also Y
© P. Giorgini, F. Dalpiaz 7
Basic Concepts: Frequent Patterns and Association Rules (2)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
2 Wine, Cocoa, Spaghetti
3 Wine, Spaghetti, Cheese
4 Bread, Cheese, Sugar
5 Bread, Cocoa, Spaghetti, Cheese, Sugar
Suppose:support s = 50%confidence c=50%
Support is used to define frequent patterns (sets of products in more than s% itemsets){Wine} in itemsets 1, 2, 3 (support = 60%){Bread} in itemsets 1, 4, 5 (support = 60%){Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%){Cheese} in itemsets 3, 4, 5 (support = 60%){Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
© P. Giorgini, F. Dalpiaz 8
Basic Concepts: Frequent Patterns and Association Rules (3)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
2 Wine, Cocoa, Spaghetti
3 Wine, Spaghetti, Cheese
4 Bread, Cheese, Sugar
5 Bread, Cocoa, Spaghetti, Cheese, Sugar
Suppose:support s = 50%confidence c=50%
Confidence defines association rules: X Y rules in frequent patterns whose confidence is bigger than cSuggestion: {Wine, Spaghetti} is the only frequent pattern to be considered. Why?
Association rules:Wine Spaghetti (support=60%, confidence=100%)Spaghetti Wine (support=60%, confidence=75%)
© P. Giorgini, F. Dalpiaz 9
Advanced concepts in Association Rules discovery
Algorithms must face scalability problems Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
Advanced problems Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”) buys(x, “car”) [s=1%, c=75%]
Single level vs. multiple-level analysis
What brands of wine are associated with what brands of spaghetti?
Are support and confidenceclear?
© P. Giorgini, F. Dalpiaz 10
Another example for association rules
Transaction-id Items bought
1 Margherita, Beer, Coke
2 Margherita, Beer
3 Quattro stagioni, Coke
4 Margherita, Coke
Frequent itemsets:{Margherita} = 75%{Beer} = 50%{Coke} = 75%{Margherita, Beer} = 50%{Margherita, Coke} = 50%
Support s = 40%Confidence c = 70%
Association rules:Beer Margherita [c=50%,s=100%]
© P. Giorgini, F. Dalpiaz 11
Classification vs. Prediction
Classification Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label attribute
The characterization is a model The model can be applied to classify new data (predict the class
they should belong to)
Prediction models continuous-valued functions, i.e., predicts unknown or
missing values
Applications Credit approval, target marketing, fraud detection
© P. Giorgini, F. Dalpiaz 12
Classification: the process
1. Model construction The class label attribute defines the class each item should
belong to The set of items used for model construction is called training set The model is represented as classification rules, decision trees,
or mathematical formulae
2. Model usage Estimate accuracy of the model
• On the training set
• On a generalization of the training set If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
© P. Giorgini, F. Dalpiaz 13
Classification: the processModel construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
© P. Giorgini, F. Dalpiaz 14
Classification: the processModel usage
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
© P. Giorgini, F. Dalpiaz 15
Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
New data is classified based on the training set
Unsupervised learning (clustering) The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
© P. Giorgini, F. Dalpiaz 16
Evaluating generated models Accuracy
classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes
Speed time to construct the model (training time) time to use the model (classification/prediction time)
Robustness handling noise and missing values
Scalability efficiency in disk-resident databases
Interpretability understanding and insight provided by the model
© P. Giorgini, F. Dalpiaz 17
Classification techniquesDecision Trees (1)
Income > 20K€
Investment type choice
Age > 60
Married?
Low risk
no yes
no
Mid risk
yes
no
yes
High risk Mid risk
© P. Giorgini, F. Dalpiaz 18
Classification techniquesDecision Trees (2)
How are the attributes in decision trees selected? Two well-known indexes are used
• Information gain selects the most informative attribute in distinguishing the items between the classes
• It biases towards attributes with a large set of values
• Gain ratio faces the information gain limitations
© P. Giorgini, F. Dalpiaz 19
Classification techniquesBayesian classifiers (2)
Bayesian classification A statistical classification technique
• Predicts class membership probabilities
Founded on the Bayes theorem
• What if X = “Red and rounded” and H = “Apple”?
Performance• The simplest implementation (Naïve Bayes) can be compared to decision
trees and neural networks
Incremental• Each training example can increase/decrease the probability that an
hypothesis in correct
)(
)()|()|(
XP
HPHXPXHP
© P. Giorgini, F. Dalpiaz 21
Classification techniquesSupport Vector Machines One of the most advanced classification techniques
Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right figure margin
© P. Giorgini, F. Dalpiaz 22
Classification techniquesSVMs + Kernel Functions
Is data always linearly separable? NO!!! Solution: SVMs + Kernel Functions
How to split this? SVM SVM + KernelFunctions
© P. Giorgini, F. Dalpiaz 23
Classification techniquesLazy learning
Lazy learning Simply stores training data (or only minor processing) and waits
until it is given a test tuple Less time in training but more time in predicting Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
Instance-based learning Subcategory of lazy learning Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified An example: k-nearest neighbor approach
© P. Giorgini, F. Dalpiaz 24
Classification techniquesk-nearest neighbor All instances correspond to points in the n-Dimensional
space – x is the instance to be classified The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2) For discrete-valued, k-NN returns the most common value
among the k training examples nearest to x
It depends on k!!!k=3 RedK=5 Blue
Which class should thegreen circle belong to?
© P. Giorgini, F. Dalpiaz 25
Prediction techniquesAn overview Prediction is different from classification
Classification refers to predict categorical class label Prediction models continuous-valued functions
Major method for prediction: regression model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees No details here
© P. Giorgini, F. Dalpiaz 26
What is cluster Analysis?
Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters
Cluster analysis Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
It belongs to unsupervised learning Typical applications
As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms (day 1 slides)
© P. Giorgini, F. Dalpiaz 27
Examples of cluster analysis
Marketing: Help marketers discover distinct groups in their customer bases
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning: Identifying groups of houses according to their house type, value,
and geographical location
© P. Giorgini, F. Dalpiaz 28
Good clustering
A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
It is hard to define “similar enough” or “good enough”
© P. Giorgini, F. Dalpiaz 29
A small exampleHow to cluster this data?
This process is noteasy in practice. Why?
© P. Giorgini, F. Dalpiaz 30
Visualization of the results
Presentation of the results or knowledge obtained from data mining in visual forms
Examples Scatter plots Association rules Decision trees Clusters