Statistics for deep learning

Estimators, Bias and VarianceMLE

Bayesian Statistics

Statistics for Deep Learning

Sung-Yub Kim

Dept of IE, Seoul National University

January 14, 2017

Sung-Yub Kim Statistics for Deep Learning


Bayesian Statistics

Point EstimationBiasVariance and Standard ErrorConsistency

Point Estimation

Point estimation is the attempt to provide the single ’Best’ prediction of somequantity of interest. In general, the quantity of interest can be a singleparameter or a vector of parameters in some parametric model.To distinguish estimates of parameters from their true value, our conventionwill be dentoe a point estimate of a parameter θ by θA good estimator is a function whose output is close to the true underlying θthat generated the training data.

Funtion Estimation

Sometimes we are interested in performing function approximation. Here wetrying to predict a variable y given x. Therefore, we may assume that

y = f (x) + ε (1)



Bayesian Statistics


Bias

The bias of an estimator is defined as

bias(θm) = E[θm]− θ (2)

Unbiased

An estimator θm is said to be unbiased if

bias(θm) = 0 (3)

,which implies that E[θm] = θ.

Asymptotically Unbiased

An estimator θm is said to be asymptotically unbiased if

limm→∞

bias(θm) = 0 (4)

, which implies that limm→∞ E[θm] = θ



Bayesian Statistics


Variance

The variance of an estimator is simply the variance

Var(θ) (5)

where the random variable is the training set. Alternately, the square root ofthe variance is called the standard error, denoted SE(θ).The variance of an estimator provides a measure of how we would expect theestimate we compute from data to vary as we independently resample thedataset from underlying data-generating process.



Bayesian Statistics


Bias-Variance Tradeoff

The realtion between bias and variance can be showed as term of meansquared error(MSE) of the estimtes:

MSE = E[(θm − θ)] = Bias(θm)2 + Var(θm) (6)

Desirable estimators are those with small MSE and these are estimators thatmanage to keep both their bias and variance somewhat in check. Usually, wemeasure generalization error is measured by the MSE.



Bayesian Statistics


Weak Consistency

Usually, we are also concerned with the behavior of an estimator as theamount of training data grows. We wish that, as the number of data pointsm in our dataset increases, our point converge to the true value of thecorresponding parameters. More formally, we define an estimate of parameter isweak consistent if

p limm→∞

θm = θ (7)

The symbol p lim indicates convergence in probability, meaning that for anyε > 0, P[|θm − θ| > ε]→ 0 as m→ 0.

Strong Consistency

Also, we can define strong consistency like

P[ limm→∞

θm = θ] = 1 (8)

This equation also means that almost surely convergence,P[lim infm→∞ω ∈ Ω : | ˆθm(ω)− θ(ω)| < ε] = 1, ∀ε > 0



Bayesian Statistics


Consistency and Asymptotically Unbiased

Consistency ensures that bias induced by the estimator diminishes as thenumber of data example grows. But, the reverse in not true, since even if wehave enough examples, we can make proabability bigger than a positive numberby controlling ε.



Bayesian Statistics

MLEProperties of MLE

Maximum Likelihood Estimation

Let pmodel(x; θ) be a parametric family of probability distributions over thesame space indexed by θ. The Maximum Likelehood Estimator(MLE) for θ isdefined as

θMLE = arg maxθ

pmodel(X; θ) = arg maxθ

m∏i=1

pmodel(x(i); θ) (9)

and this maximization problem is equivalent to

arg maxθ

m∑i=1

log pmodel(x(i); θ) = arg maxθ

Ex∼pdata [log pmodel(x; θ)] (10)

where pdata is a empirical distribution. And this can be interpreted as

arg minθ

DKL(pdata‖pmodel) = arg maxθ

Ex∼pdata [log pdata(x)− log pmodel(x; θ)] (11)

since this minimization problem is equivalent to

−arg minθ

Ex∼pdata [log pmodel(x)] (12)



Bayesian Statistics


Minimizing the Cross Entropy is all we do

Any loss consisting of a negative log-likelihood is a cross-entropy between theempirical distribution defined by the training set and the probability distributiondefined by model. For example, MSE is the cross-entropy between theempirical distribution and a Gaussian model.



Bayesian Statistics


MLE is a consistent estimator.

If we assume that the true distribution pdata must lie within the model familypmodel and the true distribution pdata must correspond to exactly one value of θ.

Consistent estimator has statistical efficiency.

It means that one consistent estimator may obtain lower generalization error fora fixed number of samples or requires fewer examples to obtain a fixed level ofgeneralization error.

Cramer-Rao Bound

Cramer-Rao lower bound show that no consistent estimator has a lower MSEthan the MLE.



Bayesian StatisticsMAP

Differences between Frequentists and Bayesians

In frequentist statistics, we have no asssumptions on θ except θ is fixed. But inbayesian statistics, we assume that θ is not fixed and have a distribution. Wecall this distribution prior. In ML, we set this prior high entropy to reflect ahigh degree of uncertainty in the value of θ before observing any data.

Differences in estimation

Since estimation is a process of minimization of cross-entropy, in MLE we needto consider θ in finite vector space. But bayesian assumes that θ has adistribution we need to consider distribution of θ in infinite dimensionalvector space(function space).



Bayesian StatisticsMAP

Since we can parameterize prior and posterior distribution, we can make asingle-point estimation. And this is so called Maximum A Posteriori(MAP)estimation.

θMAP = arg maxθ

p(θ|x) = arg maxθ

log p(x|θ) + log p(θ) (13)

and it can be interpreted as minimzation of sum of standard log-likelihoodand log-prior. And this is correspond to weight-decay.


Data & Analytics

Statistics for deep learning