60
Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy http://www.disi.unitn.it/~dalpiaz Database e Business Intelligence A.A. 2009-2010 © P. Giorgini, F. Dalpiaz

Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Embed Size (px)

Citation preview

Page 1: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Data Mining and WEKA

Fabiano Dalpiaz

Dipartimento di Ingegneria dei Sistemi e dell’Informazione

Università di Trento - Italy

http://www.disi.unitn.it/~dalpiaz

Database e Business Intelligence

A.A. 2009-2010

© P. Giorgini, F. Dalpiaz

Page 2: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 2

Acknowledgements

This presentation is partially based on the slides for the book:

Data Mining: Concepts and Techniques, 2° edJiawei Han and Micheline Kamber

Page 3: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 3

Outline

1. Data Mining and KDD

2. Applied Data Mining

3. WEKA: A tool for Data Mining

4. German credit: a case study

5. Data Preprocessing

6. Data Mining techniques

7. Summary

Page 4: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 4

1. Data Mining and KDD

Page 5: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 5

Looking for knowledge

The Explosive Growth of Data

The World Wide Web

Business: e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation

Society and everyone: news, digital cameras, YouTube,

forums, blogs, Google & Co

We are drowning in data, but starving for knowledge!

Avoid data tombs

Data mining: “Automated analysis of massive data sets”.

Page 6: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 6

What is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, etc.

Questions: Are simple search engines data mining? Are queries data mining?

Page 7: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 7

Knowledge Discovery (KDD) Process

Data sources

Data CleaningData Integration

Data Warehouse

Data Mining

Pattern Evaluation

Selection

Task-relevantData

Page 8: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz

8

Data Mining and Business Intelligence

Potential support tobusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Quantity of data

Page 9: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 9

Data Mining: confluence of multiple disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

Algorithms

OtherDisciplines

Visualization

Page 10: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 10

Why Data Mining?

Tremendous amount of data Walmart – Customer buying patterns – a data warehouse 7.5

Terabytes large in 1995 VISA – Detecting credit card interoperability issues – 6800

payment transactions per second

High-dimensionality of data Many dimensions to be combined together Data cube example: time, location, product sales

High complexity of data Time-series data, temporal data, sequence data Spatial, spatiotemporal, multimedia, text and Web data

Page 11: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 11

1. Applied Data Mining

Page 12: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 12

Market Analysis and Management Data sources:

credit card transactions, loyalty cards, smart cards, discount coupons, ...

Target marketing Find clusters of “model” customers who share the same

characteristics: • Geographics (lives in Rome, lives in Trentino)

• Demographics (married, between 21-35, at least one child, family income more than 40.000€/year)

• Psychographics (likes new products, consistently uses the Web)

• Behaviors (searches info in Internet, always defends her decisions)

Determine customer purchasing patterns over time

Page 13: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 13

Market Analysis and Management Cross-market analysis

Find associations between product sales, and predict based on such association

Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success

Customer profiling What types of customers buy what products Customers with age between 20-30 and income > 20K€ will buy

product A Customer requirement analysis

Identify the best products for different groups of customers Predict what factors will attract new customers

Page 14: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 14

Corporate Analysis

Finance Planning and Asset Evaluation Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend

analysis)

Resource Planning summarize and compare the resources and spending

Competition monitor competitors and market directions group customers into classes and a class-based pricing

procedure set pricing strategy in a highly competitive market

Page 15: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 15

1. WEKA: a tool for Data Mining

Page 16: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 16

What is WEKA?

WEKA = Waikato Environment for Knowledge Analysis University of Waikato, New Zealand

Completes the book “Data Mining” by Witten & Frank Main features:

Complete set of tools for data-preprocessing, learning, and evaluation

Graphical user interfaces Environment to compare different algorithms

http://www.cs.waikato.ac.nz/ml/weka/

Page 17: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 17

Page 18: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 18

WEKA supported file formats ARFF is the proprietary format CSV, C4.5, binary

Pre-processing might be required Data can be read

From URLs Connecting to an SQL database (via JDBC)

Page 19: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 19

1. German credit: a case study

Page 20: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 20

A real case study: german creditGo to: http://disi.unitn.it/~dalpiaz

2 versions: noisy and clean

1000 instances, 20 attributes: approved vs not approved@relation german_credit@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}@attribute duration real@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'}@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other}@attribute credit_amount real...@data'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female

Page 21: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 21

German credit - attributes

A1: Status of existing checking account < 0 DM Fra 0 e 200 DM > 200 DM no checking account

A2: Duration of the requested credit (in months) A3: Credit history

no credits taken / all credits paid back duly all credits at this bank paid back duly existing credits paid back duly till now delay in paying off in the past critical account / other credits existing (not at this bank)

Page 22: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 22

German credit - attributes

A4: Purpose car (new) car (used) furniture/equipment radio/television domestic appliances repairs education vacation retraining business other

Page 23: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 23

German credit - attributes A5: Credit amount A6: Savings account/bonds

X <100 DM 100 <= X < 500 DM 500 <= X < 1000 DM >= 1000DM Unknown / no savings account

A7: Present employment since unemployed X < 1 year 1 <= X < 4 years 4 <= X < 7 years X>= 7 years

Page 24: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 24

German credit - attributes

A8: Installment rate in percentage of disposable income A9: Personal status and sex

male : divorced/separated female : divorced/separated/married male : single male : married/widowed female : single

A10: Other debtors / guarantors None Co-applicant guarantor

Page 25: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 25

German credit - attributes

A11: Present residence since A12: Property (the most relevant)

Real estate building society savings agreement / life insurance Car or other No property

A13: Age A14: Other installment plans

Bank Stores None

Page 26: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 26

German credit - attributes

A14: Housing Rent Own For free

A15: Age A16: Number of existing credits at this bank A17: Job

unemployed/ unskilled - non-resident unskilled – resident skilled employee / official management/ self-employed/highly qualified employee/ officer

Page 27: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 27

German credit - attributes

A18: Number of people being liable to provide maintenance for

A19: Telephone None Yes

A20: Foreign worker Yes No

When you find this symbol, use WEKA!

HANDS ON

Page 28: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Tool supporte.g., WEKA

© P. Giorgini, F. Dalpiaz 28

So far: a simple KDD cycle

Raw data

Pre-processed data

Information

Pre-processing

Data Mining

Page 29: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 29

1. Data Preprocessing

HANDS ON: open the german

credit data set and visualize it

with the explorer

Page 30: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 30

Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data• e.g., occupation=“ ”, birthdate=“31/12/2099”

noisy: containing errors or outliers• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2009!!)• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records. In one copy of the data

customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything.

Page 31: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 31

Why is data dirty?

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)

Page 32: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 32

Data cleaning – missing values

“Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball

Fill in missing values Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record

class) Most probable value: inference!

Page 33: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 33

Missing values in WEKA

Manually insert missing values (“Edit” button)

HANDS ON:Attribute “Purpose”

Page 34: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 34

Missing values in WEKA

ReplaceMissingValues filter: mean / mode value

HANDS ON:Attribute “Purpose”

Page 35: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 35

Data Integration

Data Integration combines data from multiple sources into a coherent store

Schema integration Integrate metadata from different sources A.cust-id B.cust-number

Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill

Clinton = William Clinton

Detecting and resolving data value conflicts For the same real world entity, attribute values from different

sources are different (e.g., cm vs. inch)

D1 D2 D3

D1,2,3

Page 36: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 36

Data Integration

Data integration can lead to redundant attributes Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome)

Redundant attributes can be discoverd via correlation analysis A mathematical method detecting the correletion between two

attributes Correlation coefficient (Pearson’s product moment coefficient):

the higher it is, the stronger the correlation between attributes Χ2 (chi-square) test No details on these methods here

Page 37: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 37

Data transformation

Aggregation: Sum the sales of different branches (in different data sources)

to compute the company sales

Generalization: From integer attribute age to classes of age (children, adult,

old)

Normalization: scaled to fall within a small, specified range Change the range from [-∞,+ ∞] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}

Page 38: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 38

Data transformation in WEKA

Generalization Discretize filter->unsupervised->attribute

Normalization Normalize filter->unsupervised->attribute between [0,1]

HANDS ON:Attribute “Age”

Page 39: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 39

Data reduction

Data reduction Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the same) analytical results

Different reduction types (dimensions, numerosity, discretization)

Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False)

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A1? A6?

Class 1

A4?

Class 1Class 2 Class 2

Reduced attribute set: {A1, A4, A6}

Page 40: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Data reduction

Dimensionality: Principal Components Analysis Given N data vectors from n-dimensions, find k ≤ n orthogonal

vectors (principal components) that can be best used to represent data

Works for numeric data only Used when the number of dimensions is large

© P. Giorgini, F. Dalpiaz 40

Page 41: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Attribute subset selection in WEKA

HANDS ON:Play with the tab

“Select attributes”

© P. Giorgini, F. Dalpiaz 41

Page 42: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 42

Data reduction

Numerosity: Clustering Partition data set into clusters based on similarity, and store

cluster representation (e.g., centroid and diameter) only

2 clustersSparse data leadsto many clusters – non effective

Page 43: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 43

Clustering in WEKA

HANDS ON:Use the tab “clustering”

Page 44: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 44

Data reduction

Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough – representative samples

should be preserved Stratified sampling: Approximate the percentage of each class

(or subpopulation of interest) in the overall database

No samples from here

Random sampling Stratified sampling

Page 45: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Sampling in WEKA

Random sampling

Stratified sampling

HANDS ON:Reduce the

numerosity of the data set

© P. Giorgini, F. Dalpiaz 45

Page 46: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 46

Discretization

Three types of attributes Nominal — values from an unordered set (color, profession) Ordinal — values from an ordered set (military or academic

rank) Continuous — numbers (integer or real numbers)

Discretization Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types,

and in those cases discretization is mandatory WEKA: we already introduced the discretize filter

Page 47: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

© P. Giorgini, F. Dalpiaz 47

1. Data Mining techniques

Page 48: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Frequent pattern analysis What is it?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data

• Which products are bought together? Yesterday’s wine and spaghetti example

• What are the subsequent purchases after buying a PC?• Can we automatically classify web documents?

Applications• Basket data analysis• Cross-marketing• Catalog design• Sale campaign analysis

© P. Giorgini, F. Dalpiaz 48

Page 49: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Association rules (theory)Transaction-id Items bought

1 Wine, Bread, Spaghetti

2 Wine, Cocoa, Spaghetti

3 Wine, Spaghetti, Cheese

4 Bread, Cheese, Sugar

5 Bread, Cocoa, Spaghetti, Cheese, Sugar

Itemsets (= transactionsin this example)

Goal: find all rules of type X Y between items in an itemsetwith minimum:Support s - probability that an itemset contains X YConfidence c – conditional probability that an itemset containing Xcontains also Y

© P. Giorgini, F. Dalpiaz 49

Association rules:Wine Spaghetti (support=60%, confidence=100%)Spaghetti Wine (support=60%, confidence=75%)

Page 50: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Association rules in WEKA

Apriori algorithm Does not work with numeric attributes HANDS ON:

Use Apriori in the “Associate” tab

© P. Giorgini, F. Dalpiaz 52

Page 51: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Classification and Prediction

Classification Characterizes (describes) a set of items belonging to a training

set; these items are already classified according to a label attribute

The characterization is a model The model can be applied to classify new data (predict the

class they should belong to) Meta-algorithms used to enhance results (e.g., cost matrix)

Prediction models continuous-valued functions, i.e., predicts unknown or

missing values

© P. Giorgini, F. Dalpiaz 53

Page 52: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Classification: model construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

© P. Giorgini, F. Dalpiaz 54

Page 53: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Classification: model usage

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

© P. Giorgini, F. Dalpiaz 55

Page 54: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Decision Trees

Income > 20K€

Investment type choice

Age > 60

Married?

Low risk

no yes

no

Mid risk

yes

no yes

High risk Mid risk

© P. Giorgini, F. Dalpiaz 56

Page 55: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Decision Trees

How are the attributes in decision trees selected? Two well-known indexes are used

• Information gain selects the most informative attribute in distinguishing the items between the classes

• It biases towards attributes with a large set of values• Gain ratio faces the information gain limitations

© P. Giorgini, F. Dalpiaz 57

Page 56: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Decision Trees in WEKA

HANDS ON:Use ADTree and

see results

Model Truth

Good Bad

Good 0 1

Bad 5 0

HANDS ON:Use J48 and apply

a cost matrix

© P. Giorgini, F. Dalpiaz 58

Page 57: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Bayesian classifiers

Bayesian classification A statistical classification technique

• Predicts class membership probabilities Founded on the Bayes theorem

• What if X = “Red and rounded” and H = “Apple”? Performance

• The simplest implementation (Naïve Bayes) can be compared to decision trees and neural networks

Incremental• Each training example can increase/decrease the

probability that an hypothesis in correct

P H ∣X =P X ∣H P H

P X

HANDS ON:Use NaiveBayes

algorithm

© P. Giorgini, F. Dalpiaz 59

Page 58: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Classification techniquesSupport Vector Machines

One of the most advanced classification techniques Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right

figure margin

© P. Giorgini, F. Dalpiaz 60

Page 59: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Classification techniquesSVMs + Kernel Functions

Is data always linearly separable? NO!!! Solution: SVMs + Kernel Functions

How to split this? SVM SVM + KernelFunctions

HANDS ON:Try WEKA’s SMO

algorithm!

© P. Giorgini, F. Dalpiaz 61

Page 60: Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

1. Summary

Why Data Mining?

Data Miningand KDD

Data preprocessing

Classification

Clustering

Application areas

© P. Giorgini, F. Dalpiaz 62