Upload
steven-wood
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1 1
CSE 4705Artificial Intelligence
Jinbo BiDepartment of Computer Science & Engineering
http://www.engr.uconn.edu/~jinbo
2
Tasks may be in Machine Learning/Data Mining
Prediction tasks (supervised learning problem)– Classification, regression, ranking – Use some variables to predict unknown or
future values of other variables.
Description tasks (unsupervised learning problem)– Cluster analysis, novelty detection, – Find human-interpretable patterns that
describe the data.From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
3
Classification: Definition
Given a collection of examples (training set )– Each example contains a set of attributes, one of
the attributes is the class. Find a model for class attribute as a function
of the values of other attributes. Goal: previously unseen examples should be
assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
4
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
5
Classification: Application 1
High Risky Patient Detection– Goal: Predict if a patient will suffer major complication
after a surgery procedure– Approach:
Use patients vital signs before and after surgical operation.– Heart Rate, Respiratory Rate, etc.
Monitor patients by expert medical professionals to label which patient has complication, which has not.
Learn a model for the class of the after-surgery risk. Use this model to detect potential high-risk patients for a
particular surgical procedure
6
Classification: Application 2
Face recognition
– Goal: Predict the identity of a face image
– Approach: Align all images to derive the features Model the class (identity) based on these features
7
Classification: Application 3
Cancer Detection
– Goal: To predict class (cancer or normal) of a sample (person), based on the microarray gene expression data
– Approach: Use expression levels of all
genes as the features Label each example as cancer
or normal Learn a model for the class of
all samples
8
Classification: Application 4
Alzheimer's Disease Detection
– Goal: To predict class (AD or normal) of a sample (person), based on neuroimaging data such as MRI and PET
– Approach: Extract features from
neuroimages Label each example as AD or
normal Learn a model for the class of
all samples
Reduced gray matter volume (colored areas) detected by MRI voxel-basedmorphometry in AD patients compared to normal healthy controls.
9
Regression
Predict a value of a real-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
Extensively studied in statistics, neural network fields. Find a model to predict the dependent variable
as a function of the values of independent variables.
Goal: previously unseen examples should be predicted as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
10
Regression application 1
categoric
al
categoric
al
continuous
Continuous ta
rget
Refund Marital Status
Taxable Income Loss
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ? 10
TestSet
Training Set
ModelLearn
RegressorPast transaction records, label them
Current data, want to use the model to predict
goals: Predict the possible loss from a customer
Tid Refund MaritalStatus
TaxableIncome Loss
1 Yes Single 125K 100
2 No Married 100K 120
3 No Single 70K -200
4 Yes Married 120K -300
5 No Divorced 95K -400
6 No Married 60K -500
7 Yes Divorced 220K -190
8 No Single 85K 300
9 No Married 75K -240
10 No Single 90K 9010
11
Regression applications
Examples:– Predicting sales amounts of new product
based on advertising expenditure.– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.– Time series prediction of stock market indices.
12
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to
one another.– Data points in separate clusters are less
similar to one another. Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures
13
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intracluster distancesare minimized
Intercluster distancesare maximized
Intercluster distancesare maximized
14
Clustering: Application 1
High Risky Patient Detection– Goal: Predict if a patient will suffer major complication
after a surgery procedure– Approach:
Use patients vital signs before and after surgical operation.– Heart Rate, Respiratory Rate, etc.
Find patients whose symptoms are dissimilar from most of other patients.
15
Clustering: Application 2
Document Clustering:– Goal: To find groups of documents that are
similar to each other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
16
Illustrating Document Clustering
Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in
these documents (after some word filtering).
Category TotalArticles
CorrectlyPlaced
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
17
Algorithms to solve these problems
18
Classification algorithms
K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Trees Logistic Regression Graphical models
19
Regression methods
Linear Regression Ridge Regression LASSO – Least Absolute Shrinkage and
Selection Operator Neural Networks
20
Clustering algorithms
K-Means Hierarchical clustering Graph-based clustering (Spectral
clustering) Semi-supervised clustering Others
21
Challenges of Big Data
Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation
22
Our Focus
Supervised learning– Classification (support vector machine)– Regression (backpropagation neural
networks)
Before talking about the techniques, let us first understand how a learning model is evaluated.
23
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
24
Metrics for Performance Evaluation
Regression– Sum of squares
– Sum of deviation
– Exponential function of the deviation
25
Metrics for Performance Evaluation
Focus on the predictive capability of a model– Rather than how fast it takes to classify or
build models, scalability, etc. Confusion Matrix:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
26
Metrics for Performance Evaluation…
Most widely-used metric:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a(TP)
b(FN)
Class=No c(FP)
d(TN)
FNFPTNTPTNTP
dcbada
Accuracy
27
Limitation of Accuracy
Consider a 2-class problem– Number of Class 0 examples = 9990– Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– Accuracy is misleading because model does
not detect any class 1 example
28
Cost Matrix
PREDICTED CLASS
ACTUALCLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class i
29
Computing Cost of Classification
Cost Matrix
PREDICTED CLASS
ACTUALCLASS
C(i|j) + -
+ -1 100
- 1 0
Model M1 PREDICTED CLASS
ACTUALCLASS
+ -
+ 150 40
- 60 250
Model M2 PREDICTED CLASS
ACTUALCLASS
+ -
+ 250 45
- 5 200
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
30
Cost vs Accuracy
Count PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
Cost PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes p q
Class=No q p
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) Accuracy]
Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p
31
Cost-Sensitive Measures
ba
aca
a
(r) Recall
(p)Precision
Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes)
Count PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a b
Class=No
c d
A model that declares every record to be the positive class: b = d = 0
A model that assigns a positive class to the (sure) test record: c is small
Recall is high
Precision is high
32
Cost-Sensitive Measures (Cont’d)
cbaa
prrp
baa
caa
222
(F) measure-F
(r) Recall
(p)Precision
F-measure is biased towards all except C(No|No)
dwcwbwawdwaw
4321
41Accuracy Weighted
Count PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a b
Class=No
c d
33
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
34
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
Performance of a model may depend on other factors besides the learning algorithm:– Class distribution– Cost of misclassification– Size of training and test sets
35
Learning Curve
Learning curve shows how accuracy changes with varying sample size
Requires a sampling schedule for creating learning curve:
Arithmetic sampling(Langley, et al)
Geometric sampling(Provost et al)
Effect of small sample size:- Bias in the estimate- Variance of estimate
36
Methods of Estimation
Holdout– Reserve 2/3 for training and 1/3 for testing
Random subsampling– Repeated holdout
Cross validation– Partition data into k disjoint subsets– k-fold: train on k-1 partitions, test on the remaining one– Leave-one-out: k=n
Stratified sampling – oversampling vs undersampling
Bootstrap– Sampling with replacement
38
Methods of Estimation (Cont’d)
Holdout method
– Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies
obtained Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data
39
Methods of Estimation (Cont’d)
Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several boostrap methods, and a common one is .632 boostrap
– Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
))(368.0)(632.0()( _1
_ settraini
k
isettesti MaccMaccMacc
40
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
41
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive
hits and false alarms ROC curve plots TPR (on the y-axis) against FPR
(on the x-axis) Performance of each classifier represented as a
point on the ROC curve If the classifier returns a real-valued prediction,
– changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
42
ROC Curve
At threshold t:
TP=50, FN=50, FP=12, TN=88
PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a(TP)
b(FN)
Class=No
c(FP)
d(TN)
TPR = TP/(TP+FN)FPR = FP/(FP+TN)
43
ROC Curve
PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a(TP)
b(FN)
Class=No
c(FP)
d(TN)
TPR = TP/(TP+FN)FPR = FP/(FP+TN)
(TPR,FPR): (0,0): declare everything
to be negative class
– TP=0, FP = 0
(1,1): declare everything to be positive class
– FN = 0, TN = 0
(1,0): ideal
– FN = 0, FP = 0
44
ROC Curve
(TPR,FPR): (0,0): declare everything
to be negative class (1,1): declare everything
to be positive class (1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line: prediction is opposite of the
true class
45
How to Construct an ROC curve
Instance P(+|A) True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 -
5 0.85 -
6 0.85 +
7 0.76 -
8 0.53 +
9 0.43 -
10 0.25 +
• Use classifier that produces posterior probability for each test instance P(+|A)
• Sort the instances according to P(+|A) in decreasing order
• Apply threshold at each unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP + TN)
46
How to Construct an ROC curve
Instance P(+|A) True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 -
5 0.85 -
6 0.85 +
7 0.76 -
8 0.53 +
9 0.43 -
10 0.25 +
• Use classifier that produces posterior probability for each test instance P(+|A)
• Sort the instances according to P(+|A) in decreasing order
• Pick a threshold 0.85
• p>= 0.85, predicted to P
• p< 0.85, predicted to N
• TP = 3, FP=3, TN=2, FN=2
• TP rate, TPR = 3/5=60%
• FP rate, FPR = 3/5=60%
47
How to construct an ROC curve
Class + - + - - - + - + +
P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Threshold >=
ROC Curve:
48
Using ROC for Model Comparison
No model consistently outperforms the other M1 is better for
small FPR M2 is better for
large FPR
Area Under the ROC curve (AUC)
Ideal: Area = 1
Random guess: Area = 0.5
49
Data normalization
Example-wise normalization– Each example is normalized
and mapped to unit sphere Feature-wise normalization
– [0,1]-normalization: normalize each feature into a unit space
– Standard normalization: normalize each feature to have mean 0 and standard deviation 1
1
1
1
1