Upload
basilio-guitierrez
View
118
Download
2
Tags:
Embed Size (px)
Citation preview
Copyright © Andrew W. Moore Slide 1
Probability Densities in Data
MiningAndrew W. Moore
ProfessorSchool of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599
Copyright © Andrew W. Moore Slide 2
Contenido• Porque son importantes.• Notacion y fundamentos de PDF
continuas.• PDFs multivariadas continuas.• Combinando variables aleatorias
discretas y continuas.
Copyright © Andrew W. Moore Slide 3
Porque son importantes?• Real Numbers occur in at least 50% of
database records• Can’t always quantize them• So need to understand how to describe
where they come from• A great way of saying what’s a
reasonable range of values• A great way of saying how multiple
attributes should reasonably co-occur
Copyright © Andrew W. Moore Slide 4
Porque son importantes?• Can immediately get us Bayes
Classifiers that are sensible with real-valued data
• You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things
• Will introduce us to linear and non-linear regression
Copyright © Andrew W. Moore Slide 5
A PDF of American Ages in 2000
Copyright © Andrew W. Moore Slide 6
Poblacion de PR por grupo de edad
group freq midpoint freq.rela0-4 284593 2.5 0.07375165-9 301424 7.5 0.078113310-14 305025 12.5 0.079046515-19 305577 17.5 0.079189520-24 299362 22.5 0.077578925-29 277415 27.5 0.071891430-34 262959 32.5 0.068145235-39 265154 37.5 0.068714040-44 258211 42.5 0.066914745-49 239965 47.5 0.062186350-54 233597 52.5 0.060536155-59 206552 57.5 0.053527460-64 169796 62.5 0.044002265-69 141869 67.5 0.036765070-74 112416 72.5 0.029132375-79 85137 77.5 0.022063080-84 57953 82.5 0.015018485+ 51801 87.5 0.0134241
Copyright © Andrew W. Moore Slide 7
0 20 40 60 80
0.02
0.04
0.06
0.08
pobpr$midpoint
pobp
r$fre
q.re
lapdf de la edad poblacional en PR en 2000
Copyright © Andrew W. Moore Slide 8
A PDF of American Ages in 2000
Let X be a continuous random variable.
If p(x) is a Probability Density Function for X then…
b
ax
dxxpbXaP )(
50
30age
age)age(50Age30 dpP
= 0.36
Copyright © Andrew W. Moore Slide 9
Properties of PDFs
That means…
h
hxX
hxP
xp
22)( lim
0h
b
ax
dxxpbXaP )(
)(xpxXPx
Copyright © Andrew W. Moore Slide 10
)()()]2/(2/[)()22
(2/
2/
whpwphxhxdttph
xXh
xPhx
hx
Donde x-h/2<w<x+h/2). Luego,
)()2/2/(
wph
hxXhxP
Asi p(w) tiende a p(x) cuando h tiende a cero
Copyright © Andrew W. Moore Slide 11
x
dttpxXP )()(
h
dttp
h
xXPhxXP
xd
xXdPhx
x
h
)()()(lim
)(
)(0
hxwxwph
whph
),(lim)(
lim0
Notar que p(w) tiende a p(x) cuando h tiende a 0
Funcion de distribucion acumulativa. Esta es una funcion No decreciente
Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad
Copyright © Andrew W. Moore Slide 12
Properties of PDFs
b
ax
dxxpbXaP )(
)(xpxXPx
Therefore…
Therefore…
1)(
x
dxxp
0)(: xpx
La dcerivada de una fucnion no dcecreciente es mayor o igual que cero.
Copyright © Andrew W. Moore Slide 13
• Cual es el significado de p(x)?Si
p(5.31) = 0.06 and p(5.92) = 0.03
Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5.31 que a 5.92.
Copyright © Andrew W. Moore Slide 14
Yet another way to view a PDF
A recipe for sampling a random age.
1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age,d)
2. If d < p(age) stop and return age
3. Else try again: go to Step 1.
Copyright © Andrew W. Moore Slide 15
Test your understanding• True or False:
1)(: xpx
0)(: xXPx
Copyright © Andrew W. Moore Slide 16
ExpectationsE[X] = the expected value of random variable X
= the average value we’d see if we took a very large number of random samples of X
x
dxxpx )(
Copyright © Andrew W. Moore Slide 17
ExpectationsE[X] = the expected value of random variable X
= the average value we’d see if we took a very large number of random samples of X
x
dxxpx )(
= the first moment of the shape formed by the axes and the blue curve
= the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error
E[age]=35.897
Copyright © Andrew W. Moore Slide 18
Expectation of a function=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution.
= the average value we’d see if we took a very large number of random samples of f(X)
x
dxxpxf )()(
Note that in general:
])[()]([ XEfxfE
64.1786]age[ 2 E
62.1288])age[( 2 E
Copyright © Andrew W. Moore Slide 19
Variance2 = Var[X] = the expected squared difference between x and E[X]
x
dxxpx )()( 22
= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally
02.498]age[Var
Copyright © Andrew W. Moore Slide 20
Standard Deviation2 = Var[X] = the expected squared difference between x and E[X]
x
dxxpx )()( 22
= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally
= Standard Deviation = “typical” deviation of X from its mean
02.498]age[Var
][Var X
32.22
222 )]([)( XEXE
Copyright © Andrew W. Moore Slide 21
Estadisticas para PR• E(edad)=35.17• Var(edad)=501.16• Desv.Est(edad)=22.38
Copyright © Andrew W. Moore Slide 22
Funciones de densidad mas conocidas
Copyright © Andrew W. Moore Slide 23
Funciones de densidad mas conocidas
• La densidad uniforme o rectangular• La densidad triangular• La densidad exponencial• La densidad Gamma y la Chi-square• La densidad Beta• La densidad Normal o Gaussiana• Las densidades t y F.
Copyright © Andrew W. Moore Slide 24
La distribucion rectangular
-w/2 0 w/2
1/w
0][ XE12
]Var[2w
X
2
w|x|if0
2
w|x|if
1
)( wxp
Copyright © Andrew W. Moore Slide 25
La distribucion triangular
0
w|x|
w|x|w
xwxp
if0
if||
)( 2
6]Var[
2wX
0][ XE
w
1
w
w
Copyright © Andrew W. Moore Slide 26
The Exponential distribution
otherwise
xexpx
if0
0if1
)(/
2]Var[ X
][XE
0 20 40 60 80 100
0.0
00
.02
0.0
40
.06
0.0
80
.10
Densidad exponencial,=.1
x
0.1
* e
xp
(-0
.1 *
x)
Copyright © Andrew W. Moore Slide 27
La distribucion Normal
Estandar
2exp
2
1)(
2xxp
1]Var[ X
0][ XE
Copyright © Andrew W. Moore Slide 28
La distribucion Normal General
2
2
2
)(exp
2
1)(
x
xp
2]Var[ X
μXE ][
=100
=15
Copyright © Andrew W. Moore Slide 29
General Gaussian
2
2
2
)(exp
2
1)(
x
xp
2]Var[ X
μXE ][
=100
=15
Shorthand: We say X ~ N(,2) to mean “X is distributed as a Gaussian with parameters and 2”.
In the above figure, X ~ N(100,152)
Also known as the normal
distribution or Bell-shaped curve
Copyright © Andrew W. Moore Slide 30
The Error FunctionAssume X ~ N(0,1)
Define ERF(x) = P(X<x) = Cumulative Distribution of X
x
z
dzzpxERF )()(
x
z
dzz
2exp
2
1 2
Copyright © Andrew W. Moore Slide 31
Using The Error FunctionAssume X ~ N(,2)
P(X<x| ,2) = )(2x
ERF
Copyright © Andrew W. Moore Slide 32
The Central Limit Theorem• If (X1,X2, … Xn) are i.i.d. continuous
random variables• Then define
• As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi]
Somewhat of a justification for assuming Gaussian noise is common
n
iin x
nxxxfz
121
1),...,(
Copyright © Andrew W. Moore Slide 33
Estimadores de funcion de densidad
• Histograms
• K-nearest neighbors:
• Kernel density estimators
knd
kf
2)(ˆ x
nh
kxf )(ˆ h ancho de clase
dk es la distancia hasta el k-esimo vecino
Copyright © Andrew W. Moore Slide 34
Estimacion de funcion de densidad-histograma
x
Den
sity
0 2 4 6 8
0.00
0.05
0.10
0.15
0.20
0.25
> x=c( 7.3, 6.8, 7.1, 2.5, 7.9, 6.5, 4.2, 0.5, 5.6, 5.9)> hist(x,freq=F,main="Estimacion de funcion de densidad-histograma")> rug(x,col=2)
Copyright © Andrew W. Moore Slide 35
Estimacion de densidad por knn en 20 pts con k=1,3,5,7
0 2 4 6 8
01
23
4
x
fest
0 2 4 6 8
0.1
0.3
x
fest
0 2 4 6 8
0.05
0.20
x
fest
0 2 4 6 8
0.05
0.15
0.25
x
fest
Copyright © Andrew W. Moore Slide 36
Estimación por kernels de una función de densidad univariada.
En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x1,…xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por
el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que
n
xobsxFn
#)(
Copyright © Andrew W. Moore Slide 37
donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2h. La ecuación anterior puede ser escrita como:
donde la función peso K está definida por 0 si |z|>1 K(z)=
1/2 si |z| 1
h
hxFhxFxf nn
2
)()()(ˆ
)(1
)(ˆ1
n
i
i
h
xxK
nhxf
Copyright © Andrew W. Moore Slide 38
Muestra: 6, 8, 9 12, 20, 25,18, 31
hhhdepuntosenproporcionf 2/)15,15()15(ˆ
32/164/28/)19,11()15(ˆ depuntosenporporcionf
]02/1002/1000)[4*8/(1)15(ˆ f
Copyright © Andrew W. Moore Slide 39
este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como:
K(z) debe ser acotado y absolutamente
integrable en (-,) Usualmente, pero no siempre, K(z)0 y
simétrico, luego cualquier función de densidad simétrica puede usarse como kernel.
1)( dzzK
0)(lim
nhn
Copyright © Andrew W. Moore Slide 40
Eleccion del ancho de banda h
2.006.1 snh
Donde n es el numero de datos y s la desviacion estandar de la muestra.
2.0
13 )(79.0 nQQh
Copyright © Andrew W. Moore Slide 41
EL KERNEL GAUSSIANO
En este caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir,
)2
1exp(
2
1)( 2zzK
n
i
h
xx i
e
nhxf
1
2
)(
2
1)(ˆ
2
2
Copyright © Andrew W. Moore Slide 42
EL KERNEL TRIANGULAR K(z)=1- |z| para |z|<1, 0 en otro caso.
EL KERNEL "BIWEIGHT"
15/16(1-z2)2 para |z|<1
K(z)=
0 en otro caso
Copyright © Andrew W. Moore Slide 43
EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso
5
)5
1(54
3 2z
Copyright © Andrew W. Moore Slide 44
Estimacion de densidad en 20 pts usando kernel gaussiano con h=.5,”opt1”,”opt2”, 4
0 2 4 6 8
0.00
0.15
0.30
x
fest
0 2 4 6 8
0.02
0.08
0.14
x
fest
0 2 4 6 8
0.05
0.15
x
fest
0 2 4 6 8
0.04
0.07
x
fest
Copyright © Andrew W. Moore Slide 45
Variables aleatorias bidimensionales
p(x,y) = probability density of random
variables (X,Y) at location (x,y)
Copyright © Andrew W. Moore Slide 46
Estimadores de funcion de densidad bi-dimensionales
• Histogramas
• K-nearest neighbors:
• Kernel density estimators
knA
kxf )(ˆ
nA
kxf )(ˆ A area de la clase
Ak es el area ncluyendo hasta el k-esimo vecino
Copyright © Andrew W. Moore Slide 47
Estimacion de kernel bivariado
)||||
(1
)(ˆ1
2
n
i
i
hK
nhf
xtt
))]()'[((1
)(ˆ 2/11
121
ii
n
iHK
hnhf xtxtt
Sean xi=(x,y) los valores observados y t=(t1,t2) un punto del plano donde se desea estimar la densidad conjunta
2
2
2
1
0
0
h
hH
Si h1=h2=h
Copyright © Andrew W. Moore Slide 48
• (a1, a2)H-12
2
2
2
2
2
1 ||||
2
1
h
a
h
aa
a
a
2
2
1
/10
0/1
h
hH
donde
Copyright © Andrew W. Moore Slide 49
Estimacion de densidad-Kernel Gaussiano bivariado
n
i
h
ty
h
tx
e
hnhf
1
2
)(
2
)(
21 2
1)(ˆ
22
22
21
21
t
Copyright © Andrew W. Moore Slide 50
10
20
30
40
2000
3000
4000
50000 e+00
1 e-05
2 e-05
3 e-05
densidad conjunta estimada por metodo kernel
f1= kde2d(autompg1$V1, autompg1$V5,n=100)persp(f1$x,f1$y,f1$z)
Copyright © Andrew W. Moore Slide 51
mpg
wei
ght
10 20 30 40
1500
2500
3500
4500
grafica de contorno de la densidad estimada
contour(f1, levels=c(8e-6,2e-5, 2.8e-5),col=c(2,3,4), xlab="mpg",ylab="weight")
Copyright © Andrew W. Moore Slide 52
In 2 dimensions
Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Ryx
dydxyxpRYXP),(
),()),((
Copyright © Andrew W. Moore Slide 53
In 2 dimensions
Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Ryx
dydxyxpRYXP),(
),()),((
P( 20<mpg<30 and 2500<weight<3000) =
volumen under the 2-d surface within the red rectangle
Copyright © Andrew W. Moore Slide 54
In 2 dimensions
Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Ryx
dydxyxpRYXP),(
),()),((
P( [(mpg-25)/10]2 + [(weight-3300)/1500]2
< 1 ) =
volumen under the 2-d surface within the red oval
Copyright © Andrew W. Moore Slide 55
In 2 dimensions
Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Ryx
dydxyxpRYXP),(
),()),((
Take the special case of region R = “everywhere”.
Remember that with probability 1, (X,Y) will be drawn from “somewhere”.
So..
x y
dydxyxp 1),(
Copyright © Andrew W. Moore Slide 56
In 2 dimensions
Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Ryx
dydxyxpRYXP),(
),()),((
20h
2222lim h
hyY
hy
hxX
hxP
),( yxp
Copyright © Andrew W. Moore Slide 57
In m dimensions
Let (X1,X2,…Xm) be an n-tuple of continuous random variables, and let R be some region of Rm …
)),...,,(( 21 RXXXP m
Rxxx
mm
m
dxdxdxxxxp),...,,(
1221
21
,,...,),...,,(...
Copyright © Andrew W. Moore Slide 58
Independence
If X and Y are independent then knowing the value of X does not help predict the
value of Y
)()(),( :yx, iff ypxpyxpYX
mpg,weight NOT independent
Copyright © Andrew W. Moore Slide 59
Independence
If X and Y are independent then knowing the value of X does not help predict the
value of Y
)()(),( :yx, iff ypxpyxpYX
the contours say that acceleration and weight
are independent
Copyright © Andrew W. Moore Slide 60
Multivariate Expectation
xxxXμX dpE )(][
E[mpg,weight] =(24.5,2600)
The centroid of the cloud
Copyright © Andrew W. Moore Slide 61
Multivariate Expectation> f1= kde2d(autompg1$mpg, autompg1$weight,n=100)> dx=f1$x[2]-f1$x[1]> dy=f1$y[2]-f1$y[1]> dx[1] 0.379798> dy[1] 35.62626> meanmpg=sum(f1$x*f1$z)*dx*dy[1] 22.48855> meanweight=sum(f1$y*f1$z)*dx*dy[1] 2848.638>#estimated mean> mean(autompg1$weight)[1] 2977.584> mean(autompg1$mpg)[1] 23.44592
Copyright © Andrew W. Moore Slide 62
Multivariate Expectation
xxxX dpffE )()()]([
Copyright © Andrew W. Moore Slide 63
Test your understanding? ][][][ does ever) (if When :Question YEXEYXE
•All the time? Siempre
•Only when X and Y are independent?
•It can fail even if X and Y are independent?
Copyright © Andrew W. Moore Slide 64
Bivariate Expectation
dydxyxpxXE ),(][
dydxyxpyxfyxfE ),(),()],([
dydxyxpyYE ),(][
dydxyxpyxYXE ),()(][
][][][ YEXEYXE
Copyright © Andrew W. Moore Slide 65
Bivariate Covariance)])([(],Cov[ yxxy YXEYX
])[(][],Cov[ 22xxxx XEXVarXX
])[(][],Cov[ 22yyyy YEYVarYY
Copyright © Andrew W. Moore Slide 66
Bivariate Covariance)])([(],Cov[ yxxy YXEYX
])[(][],Cov[ 22xxxx XEXVarXX
])[(][],Cov[ 22yyyy YEYVarYY
then, Write
Y
X X
yxy
xyxTxx ))((E 2
2
][] [
ΣμXμXXCov
Copyright © Andrew W. Moore Slide 67
Covarianza y desviacion estandar estimadas entre mpg y weight
> cov(autompg1[,c(1,5)]) mpg weightmpg 60.91814 -5517.441weight -5517.44070 721484.709> sd(autompg1$mpg)[1] 7.805007> sd(autompg1$weight)[1] 849.4026
Copyright © Andrew W. Moore Slide 68
Covariance Intuition
E[mpg,weight] =(24.5,2600)
8mpg 8mpg
700weight
700weight
Copyright © Andrew W. Moore Slide 69
Covariance Intuition
E[mpg,weight] =(24.5,2600)
8mpg 8mpg
700weight
700weight
PrincipalEigenvector
of
Copyright © Andrew W. Moore Slide 70
Regression Line
)()/(2 x
X
xy
y xxXYE
Notice that the regression line pass trough (x,y)
Copyright © Andrew W. Moore Slide 71
Regression Line>l1=lm(weight~mpg,data=autompg1)> l1
Call:lm(formula = weight ~ mpg, data = autompg1)
Coefficients:(Intercept) mpg 5101.11 -90.57
>#slope of regression line>slope= -5517.44/60.918[1] -90.571
Copyright © Andrew W. Moore Slide 72
Primer Principal component> a=cov(autompg1[,c(1,5)])> eigen(a)$values[1] 721526.90386 18.72329
$vectors [,1] [,2][1,] -0.007647317 0.999970759[2,] 0.999970759 0.007647317
#slope of primer principal component
> .99997/-.00764[1] –130.8861
Copyright © Andrew W. Moore Slide 73
Covariance Fun Facts
yxy
xyxTxx ))((E 2
2
][] [
ΣμXμXXCov
•True or False: If xy = 0 then X and Y are independent. False
•True or False: If X and Y are independent then xy = 0. True
•True or False: If xy = x y then X and Y are deterministically related. True
•True or False: If X and Y are deterministically related then xy = x y.
false
How could you prove or disprove these?
Copyright © Andrew W. Moore Slide 74
Test your understanding? ][][][ does ever) (if When :Question YVarXVarYXVar
•All the time?
•Only when X and Y are independent? Cierto
•It can fail even if X and Y are independent?
Copyright © Andrew W. Moore Slide 75
Marginal Distributions
y
dyyxpxp ),()(
Copyright © Andrew W. Moore Slide 76
Conditional Distributions
yYX
yxp
when of p.d.f.
)|(
)4600weight|mpg( p
)3200weight|mpg( p
)2000weight|mpg( p
Copyright © Andrew W. Moore Slide 77
Conditional Distributions
yYX
yxp
when of p.d.f.
)|(
)4600weight|mpg( p
)(
),()|(
yp
yxpyxp
Why?
Copyright © Andrew W. Moore Slide 78
Independence Revisited
It’s easy to prove that these statements are equivalent…
)()(),( :yx, iff ypxpyxpYX
)()|( :yx,
)()|( :yx,
)()(),( :yx,
ypxyp
xpyxp
ypxpyxp
Copyright © Andrew W. Moore Slide 79
More useful stuff
BayesRule
(These can all be proved from definitions on previous slides)
1)|(
x
dxyxp
)|(
)|,(),|(
zyp
zyxpzyxp
)(
)()|()|(
yp
xpxypyxp
Copyright © Andrew W. Moore Slide 80
Mixing discrete and continuous variables
h
vAh
xXh
xPvAxp
22),( lim
0h
1),(1
An
v x
dxvAxp
BayesRule
BayesRule)(
)()|()|(
AP
xpxAPAxp
)(
)()|()|(
xp
APAxpxAP
Copyright © Andrew W. Moore Slide 81
0 5 10 15
0.00
0.05
0.10
0.15
0.20
0.25
x
fest
clase 1
clase 2
Estimacion de funcion de dendidad conjunta mixta
P(educacion,salario>50k)
Copyright © Andrew W. Moore Slide 82
5 10 15
0.4
0.6
0.8
1.0
1:16
b[, 1
]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
class 1
class 2
conditional density estimation de educacion por clase
Estimation of the posterior P(Class/Education)
Copyright © Andrew W. Moore Slide 83
Mixing discrete and continuous variables
P(EduYears,Wealthy)
Copyright © Andrew W. Moore Slide 84
Mixing discrete and continuous variables
P(EduYears,Wealthy)
P(Wealthy| EduYears)
Copyright © Andrew W. Moore Slide 85
Mixing discrete and continuous variables
Renorm
aliz
ed
Axes
P(EduYears,Wealthy)
P(Wealthy| EduYears)
P(EduYears|Wealthy)
Copyright © Andrew W. Moore Slide 86
Ejercicios
• Suppose X and Y are independent real-valued random variables distributed between 0 and 1:• What is p[min(X,Y)]? • What is E[min(X,Y)]?
• Prove that E[X] is the value u that minimizes E[(X-u)2]
• What is the value u that minimizes E[|X-u|]?