78
Ethem Alpaydın Department of Computer Engineering Boğaziçi University [email protected] Intelligent Data Mining

Intelligent Data Mining

Embed Size (px)

DESCRIPTION

Intelligent Data Mining. Ethem Alpaydın Department of Computer Engineering Boğaziçi University [email protected]. What is Data Mining ?. Search for very strong patterns (correlations , dependencies) in big data that can generalise to accurate future decisions . - PowerPoint PPT Presentation

Citation preview

Page 1: Intelligent Data Mining

Ethem AlpaydınDepartment of Computer Engineering

Boğaziçi University [email protected]

Intelligent Data Mining

Page 2: Intelligent Data Mining

What is Data Mining ?

• Search for very strong patterns (correlations, dependencies) in big data that can generalise to accurate future decisions.

• Aka Knowledge discovery in databases, Business Intelligence

Page 3: Intelligent Data Mining

Example Applications• Association

“30% of customers who buy diapers also buy beer.” Basket Analysis

• Classification“Young women buy small inexpensive cars.” “Older wealthy men buy big cars.”

• RegressionCredit Scoring

Page 4: Intelligent Data Mining

Example Applications

• Sequential Patterns“Customers who latepay two or more of the first three installments have a 60% probability of defaulting.”

• Similar Time Sequences“The value of the stocks of company X has been similar to that of company Y’s.”

Page 5: Intelligent Data Mining

Example Applications

• Exceptions (Deviation Detection)“Is any of my customers behaving differently than usual?”

• Text mining (Web mining)“Which documents on the internet are similar to this document?”

Page 6: Intelligent Data Mining

IDIS – US Forest Service

• Identifies forest stands (areas similar in age, structure and species composition)

• Predicts how different stands would react to fire and what preventive measures should be taken?

Page 7: Intelligent Data Mining

GTE Labs

• KEFIR (Key findings reporter)• Evaluates health-care utilization

costs• Isolates groups whose costs are

likely to increase in the next year. • Find medical conditions for which

there is a known procedure that improves health condition and decreases costs.

Page 8: Intelligent Data Mining

Lockheed

• RECON Stock portfolio selection• Create a portfolio of 150-200

securities from an analysis of a DB of the performance of 1,500 securities over a 7 years period.

Page 9: Intelligent Data Mining

VISA

• Credit Card Fraud Detection• CRIS: Neural Network software

which learns to recognize spending patterns of card holders and scores transactions by risk.“If a card holder normally buys gas and groceries and the account suddenly shows purchase of stereo equipment in Hong Kong, CRIS sends a notice to bank which in turn can contact the card holder.”

Page 10: Intelligent Data Mining

ISL Ltd (Clementine) - BBC

• Audience prediction• Program schedulers must be able

to predict the likely audience for a program and the optimum time to show it.

• Type of program, time, competing programs, other events affect audience figures.

Page 11: Intelligent Data Mining

Data Mining is NOT Magic!

Data mining draws on the concepts and methods of databases, statistics, and machine learning.

Page 12: Intelligent Data Mining

From the Warehouse to the Mine

DataWarehouse

Standardform

TransactionalDatabases Extract,

transform,cleanse data

Define goals,data transformations

Page 13: Intelligent Data Mining

How to mine?

Verification Discovery

Computer-assisted, User-directed, Top-down

Query and ReportOLAP (Online Analytical Processing) tools

Automated, Data-driven, Bottom-up

Page 14: Intelligent Data Mining

Steps: 1. Define Goal

• Associations between products ?• New market segments or potential

customers?• Buying patterns over time or product

sales trends?• Discriminating among classes of

customers ?

Page 15: Intelligent Data Mining

Steps:2. Prepare Data

• Integrate, select and preprocess existing data (already done if there is a warehouse)

• Any other data relevant to the objective which might supplement existing data

Page 16: Intelligent Data Mining

Steps:2. Prepare Data (Cont’d)• Select the data: Identify relevant variables• Data cleaning: Errors, inconsistencies,

duplicates, missing data.• Data scrubbing: Mappings, data

conversions, new attributes• Visual Inspection: Data distribution,

structure, outliers, correlations btw attributes

• Feature Analysis: Clustering, Discretization

Page 17: Intelligent Data Mining

Steps:3. Select Tool• Identify task class

Clustering/Segmentation, Association, Classification,

Pattern detection/Prediction in time series

• Identify solution classExplanation (Decision trees, rules) vs Black Box (neural network)

• Model assesment, validation and comparisonk-fold cross validation, statistical tests

• Combination of models

Page 18: Intelligent Data Mining

Steps:4. Interpretation

• Are the results (explanations/predictions) correct, significant?

• Consultation with a domain expert

Page 19: Intelligent Data Mining

Example

• Data as a table of attributes

Name

Income Owns a house? Marital status

Ali

25,000 $

Yes Married

Veli 18,000 $ No Married

We would like to be able to explain the value of one attribute in terms of the values of other attributes that are relevant.

Default

NoYes

Page 20: Intelligent Data Mining

Modelling Data

Attributes x are observable

y =f (x) where f is unknown and probabilistic

fx y

Page 21: Intelligent Data Mining

Building a Model for Data

fxy

f*

-

Page 22: Intelligent Data Mining

Learning from Data

Given a sample X={xt,yt}t

we build f*(xt) a predictor to f (xt)

that minimizes the difference between our prediction and actual value

ttt xfyE 2)(*

Page 23: Intelligent Data Mining

Types of Applications

• Classification: y in {C1, C2,…,CK}

• Regression: y in Re• Time-Series Prediction: x

temporally dependent

• Clustering: Group x according to similarity

Page 24: Intelligent Data Mining

Example

Yearly income

savingsOKDEFAULT

Page 25: Intelligent Data Mining

Example Solution

2

RULE: IF yearly-income> 1 AND savings> 2 THEN OK ELSE DEFAULT

x2 : savings

x1 : yearly-income1

OKDEFAULT

Page 26: Intelligent Data Mining

Decision Treesx1 : yearly incomex2 : savingsy = 0: DEFAULTy = 1: OK

x1 > 1

x2 > 2 y = 0

y = 1 y = 0

yes

no

no

yes

Page 27: Intelligent Data Mining

Clustering

yearly-income

savingsOKDEFAULTType

1

Type 2 Type 3

Page 28: Intelligent Data Mining

Time-Series Prediction

timeJan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Jan

PresentPast Future

?

Discovery of frequent episodes

Page 29: Intelligent Data Mining

Methodology

InitialStandardForm

Testset

Trainset

Predictor 1

Predictor 2

Predictor L

Choosebest

Data reduction:Value and featureReductions

Train alternativepredictors ontrain set

Test trainedpredictors ontest data andchoose best

BestPredictor

Acceptbest ifgoodenough

Page 30: Intelligent Data Mining

Data Visualisation

• Plot data in fewer dimensions (typically 2) to allow visual analysis

• Visualisation of structure, groups and outliers

Page 31: Intelligent Data Mining

Data Visualisation

Yearly income

savings

Exceptions

Rule

Page 32: Intelligent Data Mining

Techniques for Training Predictors

• Parametric multivariate statistics• Memory-based (Case-based)

Models • Decision Trees• Artificial Neural Networks

Page 33: Intelligent Data Mining

Classification

• x : d-dimensional vector of attributes

• C1 , C2 ,... , CK : K classes

• Reject or doubt

• Compute P(Ci|x) from data and

choose k such that P(Ck|x)=maxj P(Cj|x)

Page 34: Intelligent Data Mining

Bayes’ Rulep(x|Cj) : likelihood that an object of class j has its features xP(Cj) : prior probability of class jp(x) : probability of an object (of any class) with feature xP(Cj|x) : posterior probability that object with feature x is of class j

Page 35: Intelligent Data Mining

Statistical Methods

• Parametric e.g., Gaussian, model for class densities, p(x|Cj)

Univariate

Multivariate

x

2

2

2

)(exp

2

1)|(

j

j

j

j

xCxp

dx

)()(

21

exp)2(

1)|( 1

2/ jjT

jj

djCp μxΣμxΣ

x

Page 36: Intelligent Data Mining

Training a Classifier

• Given data {xt}t of class Cj

Univariate: p(x|Cj) is N (j,j)

Multivariate: p(x|Cj) is Nd (j,j)

j

Cx

t

j n

xj

1

)ˆ(

ˆ

2

2

j

Cxj

t

j n

xj

t

n

nCP j

j )(ˆ

j

C

t

j nj

t x

x

μ̂1

)ˆ)(ˆ(

ˆ 2

j

C

Tj

tj

t

j nj

tx

μxμx

Page 37: Intelligent Data Mining

Example: 1D Case

Page 38: Intelligent Data Mining

Example: Different Variances

Page 39: Intelligent Data Mining

Example: Many Classes

Page 40: Intelligent Data Mining

2D Case: Equal Spheric Classes

Page 41: Intelligent Data Mining

Shared Covariances

Page 42: Intelligent Data Mining

Different Covariances

Page 43: Intelligent Data Mining

Actions and Risks

i : Action i

(i|Cj) : Loss of taking action i when the situation is Cj

R(i |x) = j (i|Cj) P(Cj |x)

Choose k st

R(k |x) = mini R(i |x)

Page 44: Intelligent Data Mining

Function Approximation (Scoring)

Page 45: Intelligent Data Mining

Regression

where is noise. In linear regression, Find w,w0 st

)|( tt xfy

00),|( wwxwwxf tt 2

00 )(),( wwxywwE t

t

t

0,00

wE

wE

E

w

Page 46: Intelligent Data Mining

Linear Regression

Page 47: Intelligent Data Mining

Polynomial Regression

• E.g., quadratic

01

2

2012 ),,|( wxwxwwwwxf ttt

201

2

2012 )(),,( wxwxwywwwE tt

t

t

Page 48: Intelligent Data Mining

Polynomial Regression

Page 49: Intelligent Data Mining

Multiple Linear Regression

• d inputs:

xwTtdd

tt

dt

dtt

wxwxwxw

wwwwxxxf

02211

21021 ),,,,|,,,(

221021

210

),,,,|,,,(

),,,,(

td

td

ttt

d

wwwwxxxfy

wwwwE

Page 50: Intelligent Data Mining

Feature Selection

• Subset selectionForward and backward

methods• Linear Projection

Principal Components Analysis (PCA)

Linear Discriminant Analysis (LDA)

Page 51: Intelligent Data Mining

Sequential Feature Selection

(x1) (x2) (x3) (x4)

(x1 x3) (x2 x3) (x3 x4)

(x1 x2 x3) (x2 x3 x4)

Forward Selection

(x1 x2 x3 x4)

(x1 x2 x3) (x1 x2 x4) (x1 x3 x4) (x2 x3 x4)

(x2 x4) (x1 x4) (x1 x2)

Backward Selection

Page 52: Intelligent Data Mining

Principal Components Analysis (PCA)

z2

x1

z1x2

z2

z1

Whiteningtransform

Page 53: Intelligent Data Mining

Linear Discriminant Analysis (LDA)

x1

z1x2

z1

Page 54: Intelligent Data Mining

Memory-based Methods

• Case-based reasoning • Nearest-neighbor algorithms• Keep a list of known instances and

interpolate response from those

Page 55: Intelligent Data Mining

Nearest Neighbor

x1

x2

Page 56: Intelligent Data Mining

Local Regression

x

y

Mixture of Experts

Page 57: Intelligent Data Mining

Missing Data

• Ignore cases with missing data• Mean imputation• Imputation by regression

Page 58: Intelligent Data Mining

Training Decision Trees

x1 > 1

x2 > 2 y = 0

y = 1 y = 0

yes

no

no

yes

x2

x11

2

Page 59: Intelligent Data Mining

Measuring Disorder

x1x1

70

19

85

04

x2x2

Page 60: Intelligent Data Mining

Entropy

n

n

n

n

nn

nn

e rightrightleftleft loglog

Page 61: Intelligent Data Mining

Artificial Neural Networks

x1

xd

x2

x0=+1

w1

w2

wd

w0

yg

Regression: IdentityClassification: Sigmoid (0/1)

)(

)( 02211

xwTg

wwxwxgy

Page 62: Intelligent Data Mining

Training a Neural Network

• d inputs:

d

iii

T xwggo0

)( xw

2

2)|(

Xt i

iit

Xt

tt xwgyoyXE w

Find w that min E on X

tt yX ,xTraining set:

Page 63: Intelligent Data Mining

Nonlinear Optimization

E

ii w

Ew

Gradient-descent:Iterative learning Starting from random w is learning factor

Page 64: Intelligent Data Mining

Neural Networks for ClassificationK outputs oj , j=1,..,KEach oj estimates P (Cj|x)

)exp(11

)(

xw

xw

Tj

Tjj sigmoido

Page 65: Intelligent Data Mining

Multiple Outputs

x0=+1

oK

xdx2x1

o2o1

wKd

d

i

tiji

tTj

tj xwggo

0

)( xw

Page 66: Intelligent Data Mining

Iterative Training

ti

tj

tj

ji

j

jjiji

tTtj

t j

tj

tj

xgoyw

o

oE

wE

w

go

oyXE

j

)('

)(

)|(2

xw

w

ttX yx ,

i

tj

tj

tj

tjji

itj

tjji

xoooyw

xoyw

)1(

LinearNonlinear

Page 67: Intelligent Data Mining

Nonlinear classification

Linearly separable NOT Linearly separable;requires a nonlineardiscriminant

Page 68: Intelligent Data Mining

Multi-Layer Networks

x0=+1

hH

xdx2x1

h2h1

wKdh0=+1

tKH

o1 o2 oK

H

p

tpjp

tj htgo

0

d

i

tipi

tp xwsigmoidh

0

Page 69: Intelligent Data Mining

Probabilistic Networks

,...1.0)|(,05.0)|(

1.0)(

pp

p

Page 70: Intelligent Data Mining

Evaluating Learners

1. Given a model M, how can we assess its performance on real (future) data?

2. Given M1, M2, ..., ML which one is the best?

Page 71: Intelligent Data Mining

Cross-validation

1 2 3 k-1 k

1 2 3 k-1 k

Repeat k times and average

Page 72: Intelligent Data Mining

Combining Learners: Why?

InitialStandardForm

Validationset

Trainset

Predictor 1

Predictor 2

Predictor L

Choosebest

BestPredictor

Page 73: Intelligent Data Mining

Combining Learners: How?

InitialStandardForm

Validationset

Trainset

Predictor 1

Predictor 2

Predictor L

Voting

Page 74: Intelligent Data Mining

Conclusions:The Importance of Data

• Extract valuable information from large amounts of raw data

• Large amount of reliable data is a must. The quality of the solution depends highly on the quality of the data

• Data mining is not alchemy; we cannot turn stone into gold

Page 75: Intelligent Data Mining

Conclusions: The Importance of the Domain Expert

• Joint effort of human experts and computers

• Any information (symmetries, constraints, etc) regarding the application should be made use of to help the learning system

• Results should be checked for consistency by domain experts

Page 76: Intelligent Data Mining

Conclusions: The Importance of Being Patient

• Data mining is not straightforward; repeated trials are needed before the system is finetuned.

• Mining may be lengthy and costly. Large expectations lead to large disappointments !

Page 77: Intelligent Data Mining

Once again: Important Requirements for Mining

• Large amount of high quality data• Devoted and knowledgable experts on:

1. Application domain2. Databases (Data warehouse)3. Statistics and Machine Learning

• Time and patience

Page 78: Intelligent Data Mining

That’s all folks!