Estimators, Bias and VarianceMLE
Bayesian Statistics
Statistics for Deep Learning
Sung-Yub Kim
Dept of IE, Seoul National University
January 14, 2017
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
Point EstimationBiasVariance and Standard ErrorConsistency
Point Estimation
Point estimation is the attempt to provide the single ’Best’ prediction of somequantity of interest. In general, the quantity of interest can be a singleparameter or a vector of parameters in some parametric model.To distinguish estimates of parameters from their true value, our conventionwill be dentoe a point estimate of a parameter θ by θA good estimator is a function whose output is close to the true underlying θthat generated the training data.
Funtion Estimation
Sometimes we are interested in performing function approximation. Here wetrying to predict a variable y given x. Therefore, we may assume that
y = f (x) + ε (1)
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
Point EstimationBiasVariance and Standard ErrorConsistency
Bias
The bias of an estimator is defined as
bias(θm) = E[θm]− θ (2)
Unbiased
An estimator θm is said to be unbiased if
bias(θm) = 0 (3)
,which implies that E[θm] = θ.
Asymptotically Unbiased
An estimator θm is said to be asymptotically unbiased if
limm→∞
bias(θm) = 0 (4)
, which implies that limm→∞ E[θm] = θ
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
Point EstimationBiasVariance and Standard ErrorConsistency
Variance
The variance of an estimator is simply the variance
Var(θ) (5)
where the random variable is the training set. Alternately, the square root ofthe variance is called the standard error, denoted SE(θ).The variance of an estimator provides a measure of how we would expect theestimate we compute from data to vary as we independently resample thedataset from underlying data-generating process.
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
Point EstimationBiasVariance and Standard ErrorConsistency
Bias-Variance Tradeoff
The realtion between bias and variance can be showed as term of meansquared error(MSE) of the estimtes:
MSE = E[(θm − θ)] = Bias(θm)2 + Var(θm) (6)
Desirable estimators are those with small MSE and these are estimators thatmanage to keep both their bias and variance somewhat in check. Usually, wemeasure generalization error is measured by the MSE.
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
Point EstimationBiasVariance and Standard ErrorConsistency
Weak Consistency
Usually, we are also concerned with the behavior of an estimator as theamount of training data grows. We wish that, as the number of data pointsm in our dataset increases, our point converge to the true value of thecorresponding parameters. More formally, we define an estimate of parameter isweak consistent if
p limm→∞
θm = θ (7)
The symbol p lim indicates convergence in probability, meaning that for anyε > 0, P[|θm − θ| > ε]→ 0 as m→ 0.
Strong Consistency
Also, we can define strong consistency like
P[ limm→∞
θm = θ] = 1 (8)
This equation also means that almost surely convergence,P[lim infm→∞ω ∈ Ω : | ˆθm(ω)− θ(ω)| < ε] = 1, ∀ε > 0
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
Point EstimationBiasVariance and Standard ErrorConsistency
Consistency and Asymptotically Unbiased
Consistency ensures that bias induced by the estimator diminishes as thenumber of data example grows. But, the reverse in not true, since even if wehave enough examples, we can make proabability bigger than a positive numberby controlling ε.
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
MLEProperties of MLE
Maximum Likelihood Estimation
Let pmodel(x; θ) be a parametric family of probability distributions over thesame space indexed by θ. The Maximum Likelehood Estimator(MLE) for θ isdefined as
θMLE = arg maxθ
pmodel(X; θ) = arg maxθ
m∏i=1
pmodel(x(i); θ) (9)
and this maximization problem is equivalent to
arg maxθ
m∑i=1
log pmodel(x(i); θ) = arg maxθ
Ex∼pdata [log pmodel(x; θ)] (10)
where pdata is a empirical distribution. And this can be interpreted as
arg minθ
DKL(pdata‖pmodel) = arg maxθ
Ex∼pdata [log pdata(x)− log pmodel(x; θ)] (11)
since this minimization problem is equivalent to
−arg minθ
Ex∼pdata [log pmodel(x)] (12)
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
MLEProperties of MLE
Minimizing the Cross Entropy is all we do
Any loss consisting of a negative log-likelihood is a cross-entropy between theempirical distribution defined by the training set and the probability distributiondefined by model. For example, MSE is the cross-entropy between theempirical distribution and a Gaussian model.
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian Statistics
MLEProperties of MLE
MLE is a consistent estimator.
If we assume that the true distribution pdata must lie within the model familypmodel and the true distribution pdata must correspond to exactly one value of θ.
Consistent estimator has statistical efficiency.
It means that one consistent estimator may obtain lower generalization error fora fixed number of samples or requires fewer examples to obtain a fixed level ofgeneralization error.
Cramer-Rao Bound
Cramer-Rao lower bound show that no consistent estimator has a lower MSEthan the MLE.
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian StatisticsMAP
Differences between Frequentists and Bayesians
In frequentist statistics, we have no asssumptions on θ except θ is fixed. But inbayesian statistics, we assume that θ is not fixed and have a distribution. Wecall this distribution prior. In ML, we set this prior high entropy to reflect ahigh degree of uncertainty in the value of θ before observing any data.
Differences in estimation
Since estimation is a process of minimization of cross-entropy, in MLE we needto consider θ in finite vector space. But bayesian assumes that θ has adistribution we need to consider distribution of θ in infinite dimensionalvector space(function space).
Sung-Yub Kim Statistics for Deep Learning
Estimators, Bias and VarianceMLE
Bayesian StatisticsMAP
Since we can parameterize prior and posterior distribution, we can make asingle-point estimation. And this is so called Maximum A Posteriori(MAP)estimation.
θMAP = arg maxθ
p(θ|x) = arg maxθ
log p(x|θ) + log p(θ) (13)
and it can be interpreted as minimzation of sum of standard log-likelihoodand log-prior. And this is correspond to weight-decay.
Sung-Yub Kim Statistics for Deep Learning