29
Nonparametric Density Estimation October 1, 2018

Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Nonparametric Density Estimation

October 1, 2018

Page 2: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Introduction

I If we can’t fit a distribution to our data, then we usenonparametric density estimation.

I Start with a histogram.

I But there are problems with using histrograms for densityestimation.

I A better method is kernel density estimation.

I Let’s consider an example in which we predict whethersomeone has diabetes based on their glucode concentration.

I We can also use kernel density estimation with naive Bayes orother probabilistic learners.

Page 3: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Introduction

I Plot of plasma glucose concentration (GLU) for a populationof women who were at least 21 years old, of Pima Indianheritage and living near Phoenix, Arizona, with no evidence ofdiabetes:

0

2

4

6

8

10

12

14

0 50 100 150 200 250

Co

un

ts

GLU

No Diabetes

Page 4: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Introduction

I Assume we want to determine if a person’s GLU is abnormal.

I The population was tested for diabetes according to WorldHealth Organization criteria.

I The data were collected by the US National Institute ofDiabetes and Digestive and Kidney Diseases.

I First, are these data distributed normally?

I No, according to a χ2 test of goodness of fit.

Page 5: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Histograms

I A histogram is a first (and rough) approximation to anunknown probability density function.

I We have a sample of n observations, X1, . . . ,Xi , . . . ,Xn.

I An important parameter is the bin width, h.

I Effectively, it determines the width of each bar.

I We can have thick bars or thin bars, obviously.

I h determines how much we smooth the data.

I Another parameter is the origin, x0.

I x0 determines where we start binning data.

I This obviously effects the number of points in each bin.I We can plot a histogram as

I the number of items in each bin orI the proportion of the total for each bin

Page 6: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Histograms

I We define a bins or intervals as

[x0 + mh, x0 + (m + 1)h] for m ∈ Z

(i.e., the positive and negative integers).

I But for our purposes, it’s best to plot the relative frequency

f (x) =1

nh(number of Xi in same bin as x)

I Notice that this is the density estimate for x .

Page 7: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Problems with Histograms

I One program with using histograms as an estimate of thePDF is there can be discontinuities.

I For example, if we have a bin with no counts, then itsprobability is zero.

I This is also a problem “at the tails” of the distribution, theleft and right side of the histogram.

I First off, with real PDFs, there are no impossible events (i.e.,events with probability zero).

I There are only events with extremely small probabilities.

I The histogram is discrete, rather than continuous, sodepending on the smoothing factor, there could be largejumps in the density with very small changes in x .

I And depending on the bin width, the density may not changeat all with reasonably large changes to x .

Page 8: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Kernel Density Estimator: Motivation

I Research has shown that a kernel density estimator forcontinuous attributes improve the performance of naive Bayesover Gaussian distributions [John and Langley, 1995].

I KDE is more expensive in time and space than a Gaussianestimator, and the result is somewhat intuitive: If the data donot follow the distributional assumptions of your model, thenperformance can suffer.

I With KDE, we start with a histogram, but when we estimatethe density of a value, we smooth the histogram using akernel function.

I Again, start with the histogram.

I A generalization of the histogram method is to use a functionto smooth the histogram.

I We get rid of discontinuities.

I If we do it right, we get a continuous estimate of the PDF.

Page 9: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Kernel Density Estimator[McLachlan, 1992, Silverman, 1998]

I Given the sample Xi and the observation x

f (x) =1

nh

n∑i=1

K

(x − Xi

h

),

where h is the window width, smoothing parameter, orbandwidth.

I K is a kernel function, such that∫ ∞−∞

K (x) dx = 1

I One popular choice for K is the Gaussian kernel

K (t) =1√2π

e−(1/2)t2.

I One of the most important decisions is the bandwidth (h).

I We can just pick a number based on what looks good.

Page 10: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Kernel Density Estimator

Source: https://en.wikipedia.org/wiki/Kernel density estimation

Page 11: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Algorithm for KDE

I Representation: The sample Xi for i = 1, . . . , n.

I Learning: Add a new sample to the collection.

I Performance:

f (x) =1

nh

n∑i=1

K

(x − Xi

h

),

where h is the window width, smoothing parameter, orbandwidth, and K is a kernel function, such as the Gaussiankernel

K (t) =1√2π

e−(1/2)t2.

Page 12: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Kernel Density Estimator

public double getProbability( Number x ) {

int n = this.X.size();

double Pr = 0.0;

for ( int i = 0; i < n; i++ ) {

Pr += X.get(i) * Gaussian.pdf((x - X.get(i)) / this.h );

} // for

return Pr / ( n * this.h );

} // KDE::getProbability

Page 13: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Automatic Bandwidth Selection

I Ideally, we’d like to set h based on the data.

I This is called automatic bandwidth selection.

I Silverman’s [1998] rule-of-thumb method estimates h as

h0 =

(4σ5

3n

)1/5

≈ 1.06σn−1/5 ,

where σ is the sample standard deviation and n is the numberof samples.

I Silverman’s rule of thumb assumes that the kernel is Gaussianand that the underlying distribution is normal.

I This latter assumption may not be true, but we get a simpleexpression that evaluates in constant time, and it seems toperform well.

I Evaluating in constant time doesn’t include the time it takesto compute σ, but we can compute σ as we read the samples.

Page 14: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Automatic Bandwidth Selection

I Sheather and Jones’ [1991] solve-the-equation plug-in methodis a bit more complicated.

I It’s O(n2), and we have to solve numerically a set ofequations, which could fail.

I It is regarded as theoretically and empirically, the best methodwe have.

Page 15: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Simple KDE Example

I Determine if a person’s GLU is abnormal.

0

2

4

6

8

10

12

14

0 50 100 150 200 250

Counts

GLU

No Diabetes

Page 16: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Simple KDE ExampleI Green line: Fixed value, h = 1I Magenta line: Sheather and Jones’ method, h = 1.5I Blue line: Silverman’s method, h = 7.95

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 50 100 150 200 250

Est

. D

ensi

ty

GLU

No Diabetes

Observationsh = 1

Sheather (h = 1.5)Silverman (h = 7.95)

Page 17: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Simple KDE Example

I Assume h = 7.95

I f (100) = 0.018

I f (250) = 3.3× 10−14

I P(0 ≤ x ≤ 100) =∫ 1000 f (x) dx

I P(0 ≤ x ≤ 100) =∑100

0 f (x) dx

I P(0 ≤ x ≤ 100) ≈ 0.393

Page 18: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEsI Assume we have GLU measurements for women with and

without diabetes.I Plot of women with diabetes:

0

1

2

3

4

5

6

0 50 100 150 200 250

Counts

GLU

Diabetes

Page 19: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I Plot of women without:

0

2

4

6

8

10

12

14

0 50 100 150 200 250

Counts

GLU

No Diabetes

Page 20: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I The task is to determine, given a woman’s GLU measurement,if it is more likely that she has diabetes (or vice versa).

I For this, we can use Bayes’ rule.

I Like before, we build a kernel density estimator for both setsof data.

Page 21: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I Without diabetes:

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 50 100 150 200 250

Est

. D

ensi

ty

GLU

No Diabetes

Observationsh = 1

Sheather (h = 1.5)Silverman (h = 7.95)

I Silverman’s rule of thumb gives h0 = 7.95

Page 22: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I With diabetes:

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0 50 100 150 200 250

Est

. D

ensi

ty

GLU

Diabetes

ObservationsSheather (h = 1.5)

h = 1Silverman (h = 11.77)

I Silverman’s rule of thumb gives h1 = 11.77

Page 23: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I All together:

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0 50 100 150 200 250

Est

. D

ensi

ty

GLU

Page 24: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I Now that we’ve built these kernel density estimators, they giveus P(GLU|Diabetes = true) and P(GLU|Diabetes = false).

Page 25: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I We now need to calculate the base rate or the priorprobability of each class.

I There are 355 samples of women without diabetes, and 177samples of women with diabetes.

I Therefore,

P(Diabetes = true) =177

177 + 355= .332

I And,

P(Diabetes = false) =355

177 + 355= .668

I Or,

P(Diabetes = false) = 1−P(Diabetes = true) = 1−.332 = .668

Page 26: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I Bayes rule:

P(D|GLU) =P(D)P(GLU|D)

P(D)P(GLU|D) + P(¬D)P(GLU|¬D)

Page 27: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I Plot of the posterior distribution:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 50 100 150 200 250

Pro

bab

ilit

y

GLU

Posterior Distribution

Page 28: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

Naive Bayes with KDEs

I P(D|GLU = 50)?

P(D|GLU = 50) =(.332)(2.73E − 5)

(.332)(2.73E − 5) + (.668)(3.39E − 4)= .0385

I P(D|GLU = 175)?

P(D|GLU = 175) =(.332)(.009)

(.332)(.009) + (.668)(7.65E − 4)= .854

Page 29: Nonparametric Density Estimation - Georgetown Universitypeople.cs.georgetown.edu/~maloof/cosc575.f18/kde-slides.pdf · 2018. 10. 1. · nonparametric density estimation. I Start with

References

G. H. John and P. Langley. Estimating continuous distributions in Bayesianclassifiers. In Proceedings of the Eleventh Conference on Uncertainty inArtificial Intelligence, pages 338–345, San Francisco, CA, 1995. MorganKaufmann.

G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition.John Wiley & Sons, New York, NY, 1992.

S. J. Sheather and M. C. Jones. A reliable data-based bandwidth selectionmethod for kernel density estimation. Journal of the Royal StatisticalSociety. Series B (Methodological), 53(3):683–690, 1991.

B. W. Silverman. Density estimation for statistics and data analysis, volume 26of Monographs on statistics and applied probability. Chapman & Hall/CRC,Boca Raton, FL, 1998.