View
143
Download
0
Embed Size (px)
DESCRIPTION
Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir optimizavimas“, „Kompiuterininkų dienos – 2013“, Šiauliai 2013-09-21
Citation preview
MULTIDIMENSIONAL RARE EVENT PROBABILITY
ESTIMATION ALGORITHM
Ingrida Vaičiulytė
Vilnius UniversityMathematics and Informatics Institute
COMPUTER DAYS – 2013Šiauliai
Introduction
This work describes the empirical Bayesian approach applied in the estimation of multi – dimensional frequency. It also introduces the Monte-Carlo Markov Chain (MCMC) procedure, which is designed for Bayesian computation. Modeling of the discrete variable - the number of occurrences of rare, used statistical models: a normal distribution with unknown parameters - mean and variance, and Poisson distribution.
COMPUTER DAYS – 2013Šiauliai
Introduction
Let us consider a set of K populations, where each populationconsists of individuals Assume that some event (e.g., death due to some disease, insured event) can occur in the populations under observation.
K ,,, 21
j
jN .,1 Kj
COMPUTER DAYS – 2013Šiauliai
The aim
Our aim is to estimate unknown probabilities of events when the numbers of events in populations are observedSince a simple estimate of relative risk cannot be used in many cases due to great differences in the population sizethe empirical Bayesian approach is applied.
,mjP
mjY
.,1;,1 MmKj
j
mj
N
Y
,jN
COMPUTER DAYS – 2013Šiauliai
Poisson-Gaussian model
An assumption is often justified that the numbers of cases follow to the Poisson distribution with the parametersand its density is as follows:
mjY
mjj
mj PN
.,,1
!, Kj
YeYf
mj
Ymjm
jmj
mj
mj
COMPUTER DAYS – 2013Šiauliai
Poisson-Gaussian model
The empirical Bayesian method is a two stage procedure, depending on the prior distribution introduced in the second stage. It is of interest to consider a model in which the logits
are normally distributed with the parameters
P
P
1ln
.,
COMPUTER DAYS – 2013Šiauliai
Poisson-Gaussian model
Thus the density of logit is
Then the rates are evaluated as a posteriori means for given
where
2
1
2
exp,,
M
T
g
,,
,,1,
1
1
1
j
M
m
jmj
mj D
dge
NYf
eP
mm
mjP,
.,1,,1,,,1,,
1
MmKjdge
NYfD
M
m
jmjj m
COMPUTER DAYS – 2013Šiauliai
Maximum likelihood method
The Bayesian analysis is often related in statistics to the minimization of a certain function, expressed as the integral of a posteriori density. Thus, in the empirical Bayesian approach, the unknown parameters are estimated by the maximum likelihood method.We get the logarithmic likelihood function after some manipulation such as
which have to be minimized to get estimates for the parameters.
,
,,ln,,
1,ln,
11 1
K
jj
K
j
M
m
jmj Ddg
e
NYfL
m
COMPUTER DAYS – 2013Šiauliai
Derivatives of the maximum likelihood function
Likelihood function is differentiable many times with respect to the parameters and the respective first derivatives of this function are as follows:
,
,,
,,1,
,
1
1
1
K
j j
M
m
jmj
D
dge
NYf
L m
.,
,,1,
,
1
1
111
K
j j
M
m
jmj
T
D
dge
NYf
L m
COMPUTER DAYS – 2013Šiauliai
Poisson-Gaussian model estimates
The maximum likelihood estimates of parameters of Poisson-Gaussian model are found by solving equations, where the first derivatives must be equal to zero:
,
,,
,,1,
1
1 ,
1
K
jmkj
M
m
jmj
D
dge
NYf
K
m
.,
,,1,
1
1 ,
1,
K
jmkj
M
m
jmkj
T
D
dge
NYf
K
COMPUTER DAYS – 2013Šiauliai
Poisson-Gaussian model estimates
For instance, the “fixed point iteration” method is useful to solve these equations in order to get the maximum likelihood estimates of :,
,,
,,1,
1
11
K
j ttj
ttj
j
t D
dge
NYf
K
.,
,,1,
1
11
K
j ttj
ttj
jT
tt
t D
dge
NYf
K
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
The “fixed point iteration” method we can to realize by Monte-Carlo Markov chain approach. Let be generated t chains and in each chain we generate a multivariate Gaussian vector
is the Monte – Carlo sample size at the step.
.,,1),,(~,ttt
kj NkN
tN tht
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
In order to avoid computational problems, when the intermediate results are very small, we have introduced the auxiliary function
or
,)1,(/)
1,(ln
1 1
M
m
M
m
jmjj
jmjjj mm e
NYf
e
NYfr
.
1
1ln
111
m
m
mm
mm
e
eY
ee
eeMr m
j
M
m
jj
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
And then we get estimates of parameters
where the Monte-Carlo estimators are as follows
,~
~1
1
1
K
jtj
tjt
D
m
K
,~
~1
1
1
K
jtj
tjt
D
S
K
,)(
~
1,
tN
kkjj
tj rD
,
~)(2
~
1
2
,
tN
kt
tj
kjjtj N
DrD
,)(~
1,,
tN
kkjkj
tj rm
,)(~~~
1,,,
tN
kkj
Ttjkj
tjkj
tj rmmS
.1
)(
1
,, ,,
t
mkj
N
k
kjtmj
e
rp
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
Next, the estimate of the log-likelihood function is obtained using the Monte-Carlo estimate:
its sample variance estimate:
population of events probabilities estimate:
,
~ln
1
K
j
tj
t DL
,1~2~
12
K
jtj
ttjt
D
NDd
.~
~ ,, t
j
tmjt
mjD
pP
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
The Monte-Carlo chain can be terminated at the step, if difference between estimates of
two current steps differs insignificantly. Thus, the hypothesis on the termination condition is rejected, if
vkkkTkkkk
k
k
K
jtj
tj
t FMSP
D
D
K
KH ,
11111
1
12
ln
~2~
1
tht
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
The next rule of sample size regulation is implemented; in order large samples would be taken only at the moment of making the decision on termination of the Monte-Carlo Markov chain:
- Fisher’s quantile, - is the significance level.
vt
tt F
H
vNN ,
1
vF ,
COMPUTER DAYS – 2013Šiauliai
MCMC algorithm
Application of this rule allows to rational select of samples size in Monte-Carlo Markov chain to ensure the convergence of the maximum likelihood function.
COMPUTER DAYS – 2013Šiauliai
Computer simulation
Next, we used familiar data to construct and estimate this statistical model.The random sample of populations has been simulated to explore the approach developed, in which can occur events. The logits of probabilities are normally distributed with these parameters
K ,,, 21
3M
10K
.
25,000
025,00
0025,0
;
5
4
3
COMPUTER DAYS – 2013Šiauliai
Computer simulation
Next, we have computed the Monte-Carlo Markov chain of estimators. To avoid very small or very large sample sizes, the following limits were applied
The termination conditions started to be valid after iterations.And we have got these means of parameters:
100t
.17000500 kN
6t
COMPUTER DAYS – 2013Šiauliai
Estimates of parametersIteration µ1 µ2 µ3
Log-likelihood function
Confidence interval
Sample size
Statistical hypothesis
1 -2,96 -4,29 -5,52 -62,90 5,57 500 9,55
2 -2,89 -4,04 -5,27 -396,58 4,81 500 6,18
3 -2,91 -4,03 -5,19 -420,42 2,97 500 3,86
4 -2,90 -4,04 -5,16 -424,87 3,2 500 0,35
5 -2,91 -4,04 -5,13 -428,05 1,57 2 963 1,41
6 -2,90 -4,04 -5,14 -427,57 1,32 4 383 0,32
7 -2,91 -4,04 -5,13 -425,54 0,75 13 986 0,40
8 -2,91 -4,04 -5,14 -425,33 0,75 14 345 0,40
9 -2,91 -4,04 -5,13 -425,71 0,75 13 525 0,84
10 -2,91 -4,04 -5,13 -426,47 0,75 15 135 0,22
COMPUTER DAYS – 2013Šiauliai
Conclusions
The empirical Bayesian approach applied in the estimation of multi-dimensional frequency has been described in this work. In this paper we:• presented an iterative method of “fixed point iteration” to
compute the estimates;• introduced the Monte-Carlo Markov Chain procedure with
adaptive regulation sample size and treatment of the simulation error in the statistical manner;
• computed the empirical Bayesian estimation of unknown parameters and probabilities of the events.
The approach developed can be applied in the analysis of social and medical data.
COMPUTER DAYS – 2013Šiauliai
Thank you!
COMPUTER DAYS – 2013Šiauliai