Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CSC314 / CSC763Introduction to Machine Learning
COMSATS Institute of
Information Technology
Dr. Adeel Nawab
More on Evaluating Hypotheses/Learning Algorithms
Lecture Outline:
� Review of Confidence Intervals for Discrete-Valued
Hypotheses
� General Approach to Deriving Confidence Intervals
� Difference in Error of Two Hypotheses� Difference in Error of Two Hypotheses
� Comparing Learning Algorithms k-fold cross-
validation
� Disadvantages of Accuracy as a Measure Confusion
matrices and ROC graphsReading:
Chapter 5 of Mitchell
Chapter 5 of Witten and Frank, 2nd ed.
Reference: M. H. DeGroot, Probability and Statistics, 2nd Ed., Addison-Wesley, 1986.
Review of Confidence Intervals for Discrete-Valued Hypotheses
� Last lecture we examined the question of how
we might evaluate a hypothesis. More
precisely:precisely:
1. Given a hypothesis h and a data sample S of n
instances drawn at random according to D,
what is the best estimate of the accuracy of h
over future instances drawn from D?
2. What is the possible error in this accuracy
esimate?
Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...)In answer, we saw that assuming:
– the n instances in S are drawn
∗ independently of one another
∗
∗
∗
∗ independently of h
∗ according to probability distribution D
– n ≥ 30 Then
1. the most probable value of errorD(h) is
errorS(h)
2. With approximately N% probability, errorD(h)
lies in interval.
Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...)
Review (cont)
� Equation (1) was derived by observing:
– errorS(h) follows a binomial distribution with
∗
∗
∗ mean value errorD(h) and
∗ standard deviation approximated by
Review (cont)
– the N% confidence level for estimating the
mean of a normal distribution of a random
variable Y with observed value y can bevariable Y with observed value y can be
calculated by noting µ falls into y±zNσ N% of
the time.
Two-Sided and One-Sided Bounds
� Confidence intervals discussed so far offer
two-sided bounds – above and below
May only be interested in one-sided bound� May only be interested in one-sided bound
E.g. may only care about upper bound on error –
answer to question:
What is the probability that errorD(h) is at most
U?
and not mind if error is lower than our estimate.
Two-Sided and One-Sided Bounds
� Consider a set of independent, identically
distributed random variables Y1 . . .Yn, all
governed by an arbitrary probability distribution
General Approach to Deriving Confidence Intervals
–Central Limit Theorem
governed by an arbitrary probability distribution
with mean µ and finite variance σ2. Define the
sample mean,
General Approach to Deriving Confidence Intervals
–Central Limit Theorem� Central Limit Theorem. As n→∞, the distribution
governing Y ¯ approaches a Normal distribution, with mean µ and variance σ2 /n.
� Significance: we know the form of the distribution of the sample mean even if we do not know the distribution of sample mean even if we do not know the distribution of the underlying Yi that are being observed.
� Useful because whenever we pick an estimator that is the mean of some sample (e.g. errorS(h)), the distribution governing the estimator can be approximated by the Normal distribution for suitably large n (typically n ≥ 30).
– e.g. use Normal distribution to approximate Binomial distribution that more accurately describes errorS(h).
General Approach to Deriving Confidence Intervals
� Now have a general approach to deriving
confidence intervals for many estimation
problems:
1. Pick parameter p to estimate
– e.g. errorD(h)– e.g. errorD(h)
2. Choose an estimator
– e.g. errorS(h)
3. Determine probability distribution that governs
estimator
– e.g. errorS(h) governed by Binomial distribution,
approximated by Normal when n ≥ 30
General Approach to Deriving Confidence Intervals
4. Find interval (Lower,Upper) such that N% of
probability mass falls in the interval
– e.g Use table of zN values– e.g Use table of zN values
Things are made easier if we pick an estimator
that is the mean of some sample
– then (by Central Limit Theorem) we can ignore
the probability distribution underlying the
sample and approximate the distribution
governing the estimator by the Normal
distribution.
Example: Difference in Error of Two Hypotheses
� Suppose
– we have two hypotheses h1 and h2 for a
discrete-valued target functiondiscrete-valued target function
– h1 is tested on sample S1, h2 on S2, S1, S2
independently drawn from the same distribution
Wish to estimate difference d in true error
between h1 and h2
Use the 4-step generic procedure to derive
confidence interval for d:
Example: Difference in Error of Two Hypotheses
Comparing Learning Algorithms
� Suppose we want to compare two learning
algorithms rather than two specific hypotheses.
Not complete agreement in the machine � Not complete agreement in the machine
learning community about best way to do this.
� One way to do this is to determine whether
learning algorithm LA is better on average for
learning a target function f than learning
algorithm LB.
Comparing Learning Algorithms(Cont.)
� By better on average here we mean relative
performance across all training sets of size n
drawn from instance distribution D.
I.e. we want to estimate:
ES⊂D[errorD(LA(S))−errorD (LB(S))]⊂
where L(S) is the hypothesis output by learner L
using training set S
i.e., the expected difference in true error between
hypotheses output by learners LA and LB,
when trained using randomly selected training
sets S drawn according to distribution D.
Comparing Learning Algorithms(Cont…)
� But, given limited data D0, what is a good
estimator?
– could partition D into training set S and test – could partition D0 into training set S0 and test
set T0, and measure
error T0(LA(S0))−error T0(LB(S0))
– even better, repeat this many times and
average the results
Comparing Learning Algorithms(Cont…)
� Rather than divide limited training/testing data
just once, do so multiple times and average the
results – called k-fold cross-validation:results – called k-fold cross-validation:
1. Partition data D0 into k disjoint test sets T1, T2,
. . . , DK of equal size, where this size is at
least 30.
Comparing Learning Algorithms(Cont.)
Comparing Learning Algorithms(Cont.)
Comparing Learning Algorithms –Further Considerations
� Can determine approximate N% confidence
intervals for estimator ¯ δ using a statistical
test called a paired t testtest called a paired t test
� A paired test is one where hypotheses are
compared over identical samples (unlike
discussion of comparing hypotheses above).
� t test uses the t distribution (instead of Normal
distribution).
Comparing Learning Algorithms –Further Considerations
� Another paired test which is increasingly used
is the Wilcoxon signed rank test
Has advantage that unlike t test does not� Has advantage that unlike t test does not
assume any particular distribution underlying
the error (i.e. it is a non-parametric test)
Rather than partitioning available data D0 into k
disjoint equal-sized partitions, can repeatedly
randomly select a test set of n ≥ 30 examples
from D0 and use rest for training.
Comparing Learning Algorithms –Further Considerations
� can do this indefinitely many times, to shrink
confidence intervals to arbitrary width
however test sets are no longer independently � however test sets are no longer independently
drawn from underlying instance distribution D,
since instances will recur in separate test sets
� in k-fold cross validation each instance is
included in only one test set
Disadvantages of Accuracy as a Measure (I)
Disadvantages of Accuracy as a Measure (I)
� Accuracy not always a good measure.
Consider a two class classification problem
where 995 of 1000 instances in a test samplewhere 995 of 1000 instances in a test sample
are negative and 5 positive
– a classifier that always predicts negative will
have an accuracy of 99.5 % even though it
never correctly predicts positive examples
Confusion Matrices
� Can get deeper insights into classifier
behaviour by using a confusion matrix
Confusion Matrices (cont)
Confusion Matrices (cont)
Confusion Matrices (cont)
Disadvantages of Accuracy as a Measure (II)
� Accuracy ignores possibility of different
misclassification costs
– incorrectly predicting +ve costs may be more or – incorrectly predicting +ve costs may be more or
less important than incorrectly predicting -ve
costs
∗ not treating an ill-patient vs. treating a healthy
one
∗ refusing credit to a credit-worthy client vs.
denying credit a client who defaults
Disadvantages of Accuracy as a Measure (II)
� To address this many classifiers have
parameters that can be adjusted to allow
increased TPR at cost of increased FPR; or increased TPR at cost of increased FPR; or
decreased FPR at the cost of decreased TPR.
� For each such parameter setting a (TPR,FPR)
pair results, and the results may be plotted on
a ROC graph (ROC = “receiver operating
characteristic”).
Disadvantages of Accuracy as a Measure (II)
� provides graphical summary of trade-offs
between sensitivity and specificity
term originated in signal detection theory – e.g. � term originated in signal detection theory – e.g.
identifying radar signals of enemy aircraft in
noisy environments
� See Witten and Frank, Chapter 5.7 and
http://en.wikipedia.org/wiki/Receiver-operating-characteristic for
more
ROC Graphs
ROC Graphs – Example
Summary
� Confidence intervals give us a way of assessing how likely the true error of a hypothesis is to fall within an interval around an observed over a sampleinterval around an observed over a sample
� For many practical purposes we will be interested in a one-sided confidence interval only
� The approach to confidence intervals for sample error many be generalized to apply to any estimator which is the mean of some sample
– E.g. may use this approach to derive confidenceinterval for estimated difference in true error between
two hypotheses
Summary
� Difference between learning algorithms, as opposed to hypotheses, are typically assessed by k-fold cross-
validationvalidation
� Accuracy (complement of error) has a number of disadvantages as the sole measure of a learning algorithm
� Deeper insight may be obtained using a confusion
matrix which allows use to distinguish numbers of false
positives/negatives from true positives/negatives
� Costs of different classification errors may be taken into account using ROC graphs