49
1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo

1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

Embed Size (px)

Citation preview

Page 1: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

1 1

CSE 4705Artificial Intelligence

Jinbo BiDepartment of Computer Science & Engineering

http://www.engr.uconn.edu/~jinbo

Page 2: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

2

Tasks may be in Machine Learning/Data Mining

Prediction tasks (supervised learning problem)– Classification, regression, ranking – Use some variables to predict unknown or

future values of other variables.

Description tasks (unsupervised learning problem)– Cluster analysis, novelty detection, – Find human-interpretable patterns that

describe the data.From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 3: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

3

Classification: Definition

Given a collection of examples (training set )– Each example contains a set of attributes, one of

the attributes is the class. Find a model for class attribute as a function

of the values of other attributes. Goal: previously unseen examples should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 4: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

4

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Page 5: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

5

Classification: Application 1

High Risky Patient Detection– Goal: Predict if a patient will suffer major complication

after a surgery procedure– Approach:

Use patients vital signs before and after surgical operation.– Heart Rate, Respiratory Rate, etc.

Monitor patients by expert medical professionals to label which patient has complication, which has not.

Learn a model for the class of the after-surgery risk. Use this model to detect potential high-risk patients for a

particular surgical procedure

Page 6: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

6

Classification: Application 2

Face recognition

– Goal: Predict the identity of a face image

– Approach: Align all images to derive the features Model the class (identity) based on these features

Page 7: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

7

Classification: Application 3

Cancer Detection

– Goal: To predict class (cancer or normal) of a sample (person), based on the microarray gene expression data

– Approach: Use expression levels of all

genes as the features Label each example as cancer

or normal Learn a model for the class of

all samples

Page 8: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

8

Classification: Application 4

Alzheimer's Disease Detection

– Goal: To predict class (AD or normal) of a sample (person), based on neuroimaging data such as MRI and PET

– Approach: Extract features from

neuroimages Label each example as AD or

normal Learn a model for the class of

all samples

Reduced gray matter volume (colored areas) detected by MRI voxel-basedmorphometry in AD patients compared to normal healthy controls.

Page 9: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

9

Regression

Predict a value of a real-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

Extensively studied in statistics, neural network fields. Find a model to predict the dependent variable

as a function of the values of independent variables.

Goal: previously unseen examples should be predicted as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 10: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

10

Regression application 1

categoric

al

categoric

al

continuous

Continuous ta

rget

Refund Marital Status

Taxable Income Loss

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ? 10

TestSet

Training Set

ModelLearn

RegressorPast transaction records, label them

Current data, want to use the model to predict

goals: Predict the possible loss from a customer

Tid Refund MaritalStatus

TaxableIncome Loss

1 Yes Single 125K 100

2 No Married 100K 120

3 No Single 70K -200

4 Yes Married 120K -300

5 No Divorced 95K -400

6 No Married 60K -500

7 Yes Divorced 220K -190

8 No Single 85K 300

9 No Married 75K -240

10 No Single 90K 9010

Page 11: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

11

Regression applications

Examples:– Predicting sales amounts of new product

based on advertising expenditure.– Predicting wind velocities as a function of

temperature, humidity, air pressure, etc.– Time series prediction of stock market indices.

Page 12: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

12

Clustering Definition

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to

one another.– Data points in separate clusters are less

similar to one another. Similarity Measures:

– Euclidean Distance if attributes are continuous.

– Other Problem-specific Measures

Page 13: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

13

Illustrating Clustering

Euclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intracluster distancesare minimized

Intercluster distancesare maximized

Intercluster distancesare maximized

Page 14: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

14

Clustering: Application 1

High Risky Patient Detection– Goal: Predict if a patient will suffer major complication

after a surgery procedure– Approach:

Use patients vital signs before and after surgical operation.– Heart Rate, Respiratory Rate, etc.

Find patients whose symptoms are dissimilar from most of other patients.

Page 15: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

15

Clustering: Application 2

Document Clustering:– Goal: To find groups of documents that are

similar to each other based on the important terms appearing in them.

– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Page 16: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

16

Illustrating Document Clustering

Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in

these documents (after some word filtering).

Category TotalArticles

CorrectlyPlaced

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278

Page 17: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

17

Algorithms to solve these problems

Page 18: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

18

Classification algorithms

K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Trees Logistic Regression Graphical models

Page 19: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

19

Regression methods

Linear Regression Ridge Regression LASSO – Least Absolute Shrinkage and

Selection Operator Neural Networks

Page 20: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

20

Clustering algorithms

K-Means Hierarchical clustering Graph-based clustering (Spectral

clustering) Semi-supervised clustering Others

Page 21: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

21

Challenges of Big Data

Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation

Page 22: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

22

Our Focus

Supervised learning– Classification (support vector machine)– Regression (backpropagation neural

networks)

Before talking about the techniques, let us first understand how a learning model is evaluated.

Page 23: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

23

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 24: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

24

Metrics for Performance Evaluation

Regression– Sum of squares

– Sum of deviation

– Exponential function of the deviation

Page 25: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

25

Metrics for Performance Evaluation

Focus on the predictive capability of a model– Rather than how fast it takes to classify or

build models, scalability, etc. Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Page 26: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

26

Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)

FNFPTNTPTNTP

dcbada

Accuracy

Page 27: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

27

Limitation of Accuracy

Consider a 2-class problem– Number of Class 0 examples = 9990– Number of Class 1 examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– Accuracy is misleading because model does

not detect any class 1 example

Page 28: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

28

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Page 29: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

29

Computing Cost of Classification

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ -1 100

- 1 0

Model M1 PREDICTED CLASS

ACTUALCLASS

+ -

+ 150 40

- 60 250

Model M2 PREDICTED CLASS

ACTUALCLASS

+ -

+ 250 45

- 5 200

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255

Page 30: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

30

Cost vs Accuracy

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

Cost PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes p q

Class=No q p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p) Accuracy]

Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

Page 31: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

31

Cost-Sensitive Measures

ba

aca

a

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes)

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a b

Class=No

c d

A model that declares every record to be the positive class: b = d = 0

A model that assigns a positive class to the (sure) test record: c is small

Recall is high

Precision is high

Page 32: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

32

Cost-Sensitive Measures (Cont’d)

cbaa

prrp

baa

caa

222

(F) measure-F

(r) Recall

(p)Precision

F-measure is biased towards all except C(No|No)

dwcwbwawdwaw

4321

41Accuracy Weighted

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a b

Class=No

c d

Page 33: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

33

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 34: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

34

Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other factors besides the learning algorithm:– Class distribution– Cost of misclassification– Size of training and test sets

Page 35: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

35

Learning Curve

Learning curve shows how accuracy changes with varying sample size

Requires a sampling schedule for creating learning curve:

Arithmetic sampling(Langley, et al)

Geometric sampling(Provost et al)

Effect of small sample size:- Bias in the estimate- Variance of estimate

Page 36: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

36

Methods of Estimation

Holdout– Reserve 2/3 for training and 1/3 for testing

Random subsampling– Repeated holdout

Cross validation– Partition data into k disjoint subsets– k-fold: train on k-1 partitions, test on the remaining one– Leave-one-out: k=n

Stratified sampling – oversampling vs undersampling

Bootstrap– Sampling with replacement

Page 37: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

37

A Useful Link

http://dlib.net/ml_guide.svg

Page 38: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

38

Methods of Estimation (Cont’d)

Holdout method

– Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation

– Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies

obtained Cross-validation (k-fold, where k = 10 is most popular)

– Randomly partition the data into k mutually exclusive subsets, each approximately equal size

– At i-th iteration, use Di as test set and others as training set

– Leave-one-out: k folds where k = # of tuples, for small sized data

– Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

Page 39: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

39

Methods of Estimation (Cont’d)

Bootstrap

– Works well with small data sets

– Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected

again and re-added to the training set

Several boostrap methods, and a common one is .632 boostrap

– Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

– Repeat the sampling procedure k times, overall accuracy of the model:

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc

Page 40: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

40

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

Page 41: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

41

ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive

hits and false alarms ROC curve plots TPR (on the y-axis) against FPR

(on the x-axis) Performance of each classifier represented as a

point on the ROC curve If the classifier returns a real-valued prediction,

– changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point

Page 42: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

42

ROC Curve

At threshold t:

TP=50, FN=50, FP=12, TN=88

PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a(TP)

b(FN)

Class=No

c(FP)

d(TN)

TPR = TP/(TP+FN)FPR = FP/(FP+TN)

Page 43: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

43

ROC Curve

PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a(TP)

b(FN)

Class=No

c(FP)

d(TN)

TPR = TP/(TP+FN)FPR = FP/(FP+TN)

(TPR,FPR): (0,0): declare everything

to be negative class

– TP=0, FP = 0

(1,1): declare everything to be positive class

– FN = 0, TN = 0

(1,0): ideal

– FN = 0, FP = 0

Page 44: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

44

ROC Curve

(TPR,FPR): (0,0): declare everything

to be negative class (1,1): declare everything

to be positive class (1,0): ideal

Diagonal line:

– Random guessing

– Below diagonal line: prediction is opposite of the

true class

Page 45: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

45

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Apply threshold at each unique value of P(+|A)

• Count the number of TP, FP,

TN, FN at each threshold

• TP rate, TPR = TP/(TP+FN)

• FP rate, FPR = FP/(FP + TN)

Page 46: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

46

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Pick a threshold 0.85

• p>= 0.85, predicted to P

• p< 0.85, predicted to N

• TP = 3, FP=3, TN=2, FN=2

• TP rate, TPR = 3/5=60%

• FP rate, FPR = 3/5=60%

Page 47: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

47

How to construct an ROC curve

Class + - + - - - + - + +

P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:

Page 48: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

48

Using ROC for Model Comparison

No model consistently outperforms the other M1 is better for

small FPR M2 is better for

large FPR

Area Under the ROC curve (AUC)

Ideal: Area = 1

Random guess: Area = 0.5

Page 49: 1 1 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering jinbo

49

Data normalization

Example-wise normalization– Each example is normalized

and mapped to unit sphere Feature-wise normalization

– [0,1]-normalization: normalize each feature into a unit space

– Standard normalization: normalize each feature to have mean 0 and standard deviation 1

1

1

1

1