102
Professor William Greene Stern School of Business IOMS Department Department of Economics Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat- UB.0015.01

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

  • Upload
    yetty

  • View
    107

  • Download
    1

Embed Size (px)

DESCRIPTION

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01. Professor William Greene Stern School of Business IOMS Department Department of Economics. Part 3 – Estimation Theory. Estimation. Nonparametric population features Mean - income - PowerPoint PPT Presentation

Citation preview

Statistics

Professor William GreeneStern School of BusinessIOMS Department Department of EconomicsStatistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

#/98Part 3 Estimation Theory1Immediate Reaction to the WHR Health System Performance Report New York Times, June 21, 2000

#/98Part 3 Estimation Theory

A Model of the Best a Country Could Do vs. what They Actually Do

#/98Part 3 Estimation TheoryThe following was taken from

http://www.msnbc.msn.com/id/27339545/An msnbc.com guide to presidential pollsWhy results, samples and methodology vary from survey to survey

WASHINGTON - A poll is a small sample of some larger number, an estimate of something about that larger number. For instance, what percentage of people reports that they will cast their ballots for a particular candidate in an election? A sample reflects the larger number from whichit is drawn. Lets say you had a perfectly mixed barrel of 1,000 tennis balls, of which 700 are white and 300 orange. You do your sample by scooping up just 50 of those tennis balls. If your barrel was perfectly mixed, you wouldnt need to count all 1,000 tennis balls your samplewould tell you that 30 percent of the balls were orange.

#/98Part 3 Estimation Theory

Use random samples and basic descriptive statistics.What is the breach rate in a pool of tens of thousands of mortgages? (Breach = improperly underwritten or serviced or otherwise faulty mortgage.)

#/98Part 3 Estimation Theory

The forensic analysis was an examination of statistics from a random sample of 1,500 loans.

#/98Part 3 Estimation TheoryPart 3 Estimation Theory

#/98Part 3 Estimation Theory7EstimationNonparametric population featuresMean - incomeCorrelation disease incidence and smokingRatio income per household memberProportion proportion of ASCAP music played that is produced by Dave MatthewsDistribution histogram and density estimationParametersFitting distributions mean and variance of lognormal distribution of incomeParametric models of populations relationship of loan rates to attributes of minorities and others in Bank of America settlement on mortgage bias8

#/98Part 3 Estimation TheoryMeasurements as Observations Population Measurement TheoryCharacteristicsBehavior PatternsChoicesThe theory argues that there are meaningful quantities to be statistically analyzed.

#/98Part 3 Estimation Theory9Application Health and IncomeGerman Health Care Usage Data, 7,293 Households, Observed 1984-1995

Data downloaded from Journal of Applied Econometrics Archive. Some variables in the file areDOCVIS = number of visits to the doctor in the observation periodHOSPVIS = number of visits to a hospital in the observation periodHHNINC = household nominal monthly net income in German marks / 10000. (4 observations with income=0 were dropped)HHKIDS = children under age 16 in the household = 1; otherwise = 0EDUC = years of schooling AGE = age in yearsPUBLIC = decision to buy public health insuranceHSAT = self assessed health status (0,1,,10)

#/98Part 3 Estimation Theory10Observed Data11

#/98Part 3 Estimation TheoryInference about Population Population MeasurementCharacteristicsBehavior PatternsChoices

#/98Part 3 Estimation Theory12Classical Inference Population MeasurementCharacteristicsBehavior PatternsChoicesImprecise inference about the entire population sampling theory and asymptoticsSampleThe population is all 40 million German households (or all households in the entire world).The sample is the 7,293 German households in 1984-1995.

#/98Part 3 Estimation Theory13Bayesian Inference Population MeasurementCharacteristicsBehavior PatternsChoicesSharp, exact inference about only the sample the posterior density is posterior to the data.Sample

#/98Part 3 Estimation Theory14Estimation of Population Features Estimators and EstimatesEstimator = strategy for use of the dataEstimate = outcome of that strategySampling DistributionQualities of the estimatorUncertainty due to random sampling15

#/98Part 3 Estimation TheoryEstimationPoint Estimator: Provides a single estimate of the feature in question based on prior and sample information.

Interval Estimator: Provides a range of values that incorporates both the point estimator and the uncertainty about the ability of the point estimator to find the population feature exactly.16

#/98Part 3 Estimation TheoryRepeated Sampling - A Sampling DistributionThe true mean is 500. Sample means vary around 500, some quite far off.The sample mean has a sampling mean and a sampling variance.The sample mean also has a probability distribution. Looks like a normal distribution.This is a histogram for 1,000 means of samples of 20 observations from Normal[500,1002].

#/98Part 3 Estimation Theory17Application: Credit Modeling1992 American Express analysis ofApplication process: Acceptance or rejection; X = 0 (reject) or 1 (accept).Cardholder behaviorLoan default (D = 0 or 1).Average monthly expenditure (E = $/month)General credit usage/behavior (Y = number of charges)13,444 applications in November, 1992

#/98Part 3 Estimation Theory0.7809 is the true proportion in the population of 13,444 we are sampling from.

#/98Part 3 Estimation TheoryEstimation ConceptsRandom SamplingFinite populationsi.i.d. sample from an infinite populationInformationPriorSample20

#/98Part 3 Estimation TheoryProperties of Estimators21

#/98Part 3 Estimation TheoryUnbiasednessThe sample mean of the 100 sample estimates is 0.7844.The population mean (true proportion) is 0.7809.

#/98Part 3 Estimation Theory

N=144N=1024N=4900.7 to .88.7 to .88.7 to .88Consistency

#/98Part 3 Estimation Theory24

Bank costs are normally distributed with mean . Which is a better estimator of , the mean (11.46) or the median (11.27)?Competing Estimators of a Parameter

#/98Part 3 Estimation TheoryInterval estimates of the acceptance rateBased on the 100 samples of 144 observations

#/98Part 3 Estimation TheoryMethods of EstimationInformation about the source populationApproachesMethod of MomentsMaximum LikelihoodBayesian26

#/98Part 3 Estimation TheoryThe Method of Moments

#/98Part 3 Estimation Theory27Estimating a ParameterMean of Poissonp(y)=exp(-) y / y!, y = 0,1,; > 0E[y]= . E[(1/N)iyi]= . This is the estimatorMean of Exponentialf(y) = exp(-y), y > 0; > 0E[y] = 1/. E(1/N)iyi = 1/. 1/{(1/N)iyi } is the estimator of

#/98Part 3 Estimation Theory28Mean and Variance of a Normal Distribution

#/98Part 3 Estimation Theory29Proportion for BernoulliIn the AmEx data, the true population acceptance rate is 0.7809 = Y = 1 if application accepted, 0 if not.E[y] = E[(1/N)iyi] = paccept = . This is the estimator

30

#/98Part 3 Estimation TheoryGamma Distribution

#/98Part 3 Estimation Theory31Method of Moments

(P) = (P) /(P) = dlog (P)/dP

#/98Part 3 Estimation Theory3233

#/98Part 3 Estimation TheoryEstimate One ParameterAssume known to be 0.1. Estimate PE[y] = P/ = P/.1 = 10Pm1 = mean of y = 31.278Estimate of P is 31.278/10 = 3.1278.One equation in one unknown

34

#/98Part 3 Estimation TheoryApplication

#/98Part 3 Estimation Theory35Method of Moments Solutionscreate ; y1=y ; y2=log(y) ; ysq=y*y$calc ; m1=xbr(y1) ; mlog=xbr(y2); m2=xbr(ysq) $Minimize; start = 2.0, .06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (mlog (psi(p)-log(l)))^2 $----------------------------------------------------P| 2.41074L| .07707--------+-------------------------------------------Minimize; start = 2.0, .06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (m2 p*(p+1)/l^2 )^2 $--------+-------------------------------------------P| 2.06182L| .06589 --------+-------------------------------------------

#/98Part 3 Estimation Theory36Properties of MoM estimatorUnbiased? Sometimes, e.g., normal, Bernoulli and Poisson meansConsistent? Yes by virtue of Slutsky TheoremAssumes parameters can vary continuouslyAssumes moment functions are continuous and smoothEfficient? Maybe remains to be seen. (Which pair of moments should be used for the gamma distribution?)Sampling distribution? Generally normal by virtue of Lindeberg-Levy central limit theorem and the Slutsky theorem.37

#/98Part 3 Estimation TheoryEstimating Sampling VarianceExact sampling results Poisson Mean, Normal Mean and VarianceApproximation based on linearizationBootstrapping discussed later with maximum likelihood estimator.38

#/98Part 3 Estimation TheoryExact Variance of MoMEstimate normal or Poisson meanEstimator is sample mean = (1/N)i Yi.Exact variance of sample mean is1/N * population variance.39

#/98Part 3 Estimation TheoryLinearization Approach 1 Parameter40

#/98Part 3 Estimation TheoryLinearization Approach 1 Parameter41

#/98Part 3 Estimation TheoryLinearization Approach - General42

#/98Part 3 Estimation TheoryExercise: Gamma Parametersm1 = 1/N yi => P/m2 = 1/N yi2 => P(P+1)/ 21. What is the Jacobian? (Derivatives)2. How to compute the variance of m1, the variance of m2 and the covariance of m1 and m2? (The variance of m1 is 1/N times the variance of y; the variance of m2 is 1/N times the variance of y2. The covariance is 1/N times the covariance of y and y2.)43

#/98Part 3 Estimation TheorySufficient Statistics44

#/98Part 3 Estimation TheorySufficient Statistic45

#/98Part 3 Estimation TheorySufficient Statistic46

#/98Part 3 Estimation TheorySufficient Statistics47

#/98Part 3 Estimation TheoryGamma Density48

#/98Part 3 Estimation TheoryRao Blackwell TheoremThe mean squared error of an estimator based on sufficient statistics is smaller than one not based on sufficient statistics.

We deal in consistent estimators, so a large sample (approximate) version of the theorem is that estimators based on sufficient statistics are more efficient than those that are not.49

#/98Part 3 Estimation TheoryMaximum LikelihoodEstimation CriterionComparable to method of momentsSeveral virtues: Broadly, uses all the sample and nonsample information available efficient (better than MoM in many cases)50

#/98Part 3 Estimation TheorySetting Up the MLEThe distribution of the observed random variable is written as a function of the parameter(s) to be estimated P(yi|) = Probability density of data | parameters. L(|yi) = likelihood of parameter | dataThe likelihood function is constructed from the density Construction: Joint probability density function of the observed sample of data generally the product when the data are a random sample.The estimator is chosen to maximize the likelihood of the data (essentially the probability of observing the sample in hand).

#/98Part 3 Estimation Theory51Regularity ConditionsWhy? Regular MLE has known, good properties. Nonregular estimators usually do not have known properties (good or bad).What they are1. logf(.) has three continuous derivatives wrt parameters2. Conditions needed to obtain expectations of derivatives are met. (E.g., range of the variable is not a function of the parameters.)3. Third derivative has finite expectation.What they meanMoment conditions and convergence. We need to obtain expectations of derivatives.We need to be able to truncate Taylor series.We will use central limit theoremsMLE exists for nonregular densities (see text). Questionable statistical properties.

#/98Part 3 Estimation Theory52Regular Exponential DensityExponential density f(yi|)=(1/)exp(-yi/)Average time until failure, , of light bulbs. yi = observed life until failure.Regularity(1) Range of y is 0 to free of (2) logf(yi|) = -log y/ logf(yi|)/ = -1/ + yi/2 E[yi]= , E[logf()/]=0(3) 2logf(yi|)/2 = 1/2 - 2yi/3 finite expectation = -1/2(4) 3logf(yi|)/3 = -2/3 + 6yi/4 has finite expectation = 4/3(5) All derivatives are continuous functions of

#/98Part 3 Estimation Theory53Likelihood FunctionL()=i f(yi|)MLE = the value of that maximizes the likelihood function.Generally easier to maximize the log of L. The same maximizes log LIn random sampling, logL=i log f(yi|)54

#/98Part 3 Estimation TheoryPoisson Likelihood 55

log and ln both mean natural log throughout this course

#/98Part 3 Estimation TheoryThe MLEThe log-likelihood function:

log-L(|data)= i logf(yi|)

The likelihood equation(s) = first derivative:First derivatives of log-L equals zero at the MLE.[i logf(yi|)]/MLE = 0. (Interchange summation and differentiation) i [logf(yi|)/MLE]= 0.

#/98Part 3 Estimation Theory56ApplicationsBernoulliExponentialPoissonNormal Gamma57

#/98Part 3 Estimation TheoryBernoulli58

#/98Part 3 Estimation TheoryExponentialEstimating the average time until failure, , of light bulbs. yi = observed life until failure.f(yi|)=(1/)exp(-yi/)L()=i f(yi|)= -N exp(-yi/)logL ()=-Nlog () - yi/Likelihood equation: logL()/=-N/ + yi/2 =0Solution: (Multiply both sides of equation by 2) = yi /N (sample average estimates population average)

#/98Part 3 Estimation Theory59Poisson Distribution60

#/98Part 3 Estimation TheoryNormal Distribution61

#/98Part 3 Estimation TheoryGamma Distribution62

(P) = (P) /(P) = dlog (P)/dP

#/98Part 3 Estimation TheoryGamma Application63

Gamma (Loglinear) Regression ModelDependent variable YLog likelihood function -85.37567--------+---------------------------------------------------------------- | Standard Prob. 95% Confidence Y| Coefficient Error z |z|>Z* Interval--------+---------------------------------------------------------------- |Parameters in conditional mean function LAMBDA| .07707*** .02544 3.03 .0024 .02722 .12692 |Scale parameter for gamma model P_scale| 2.41074*** .71584 3.37 .0008 1.00757 3.81363--------+----------------------------------------------------------------SAME SOLUTION AS METHOD OF MOMENTS USING M1 and Mlogcreate ; y1=y ; y2=log(y) $calc ; m1=xbr(y1) ; mlog=xbr(y2) $Minimize; start = 2.0, .06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (mlog (psi(p)-log(l)))^2 $------------------------------------------------------------P| 2.41074L| .07707--------+---------------------------------------------------

#/98Part 3 Estimation TheoryProperties of the MLEEstimatorRegularityFinite sample vs. asymptotic propertiesProperties of the estimatorInformation used in estimation64

#/98Part 3 Estimation TheoryProperties of the MLESometimes unbiased, usually notAlways consistent (under regularity)Large sample normal distributionEfficientInvariantSufficient (uses sufficient statistics when they exist)65

#/98Part 3 Estimation TheoryUnbiasednessUsually when estimating a parameter that is the mean of the random variableNormal meanPoisson meanBernoulli probability is the mean.Does not make degrees of freedom correctionsAlmost no other cases. 66

#/98Part 3 Estimation TheoryConsistencyUnder regularity MLE is consistent.Without regularity, it may be consistent, but usually cannot be proved.Almost all cases, mean square consistentExpectation converges to the parameterVariance converges to zero.(Proof sketched in Rice text, 275-276)67

#/98Part 3 Estimation TheoryLarge Sample Distribution

#/98Part 3 Estimation TheoryThe Information Equality

#/98Part 3 Estimation TheoryDeduce The Variance of MLE

#/98Part 3 Estimation TheoryComputing the Variance of the MLE

#/98Part 3 Estimation TheoryApplication: GSOEP Income

Descriptive Statistics for 1 variables--------+---------------------------------------------------------------------Variable| Mean Std.Dev. Minimum Maximum Cases Missing--------+--------------------------------------------------------------------- HHNINC| .355564 .166561 .030000 2.0 2698 0--------+---------------------------------------------------------------------

#/98Part 3 Estimation TheoryVariance of MLE

#/98Part 3 Estimation TheoryBootstrappingGiven the sample, i = 1,,NSample N observations with replacement some get picked more than once, some do not get picked. Recompute estimate of .Repeat R times, obtain R new estimates of .Estimate variance with the sample variance of the R new estimates.

#/98Part 3 Estimation TheoryBootstrap Results

Estimated Variance = .003112.

#/98Part 3 Estimation TheorySufficiencyIf sufficient statistics exist, the MLE will be a function of themTherefore, MLE satisfies the Rao Blackwell Theorem (in large samples).

#/98Part 3 Estimation TheoryEfficiencyCramer Rao Lower BoundVariance of a consistent, asymptotically normally distributed estimator is > -1/{NE[Hi()]}.The MLE achieves the C-R lower bound, so it is efficient.Implication: For normal sampling, the mean is better than the median.

#/98Part 3 Estimation TheoryInvariance

#/98Part 3 Estimation TheoryBayesian EstimationPhilosophical underpinningsHow to combine information contained in the sample

#/98Part 3 Estimation TheoryEstimationAssembling informationPrior information = out of sample. Literally prior or outside informationSample information is embodied in the likelihoodResult of the analysis: Posterior belief = blend of prior and likelihood

#/98Part 3 Estimation TheoryUsing Conditional Probabilities: Bayes Theorem

Typical application: We know P(B|A), we want P(A|B)

In drug testing: We know P(find evidence of drug use | usage) < 1. We needP(usage | find evidence of drug use).

The problem is false positives. P(find evidence drug of use | Not usage) > 0

This implies thatP(usage | find evidence of drug use) 1

#/98Part 3 Estimation Theory81

Bayes Theorem

#/98Part 3 Estimation Theory82Disease TestingNotation+ = test indicates disease, = test indicates no diseaseD = presence of disease, N = absence of disease

Known DataP(Disease) = P(D) = .005 (Fairly rare) (Incidence)P(Test correctly indicates disease) = P(+|D) = .98 (Sensitivity)(Correct detection of the disease) P(Test correctly indicates absence) = P(-|N) = . 95 (Specificity)(Correct failure to detect the disease)

Objectives: Deduce these probabilitiesP(D|+) (Probability disease really is present | test positive)P(N|) (Probability disease really is absent | test negative)

Note, P(D|+) = the probability that a patient actually has the disease when the test says they do.

#/98Part 3 Estimation Theory83More InformationDeduce: Since P(+|D)=.98, we know P(|D)=.02 because P(-|D)+P(+|D)=1

[P(|D) is the P(False negative).

Deduce: Since P(|N)=.95, we know P(+|N)=.05 because P(-|N)+P(+|N)=1

[P(+|N) is the P(False positive).

Deduce: Since P(D)=.005, we know P(N)=.995 because P(D)+P(N)=1.

#/98Part 3 Estimation Theory84Now, Use Bayes Theorem

#/98Part 3 Estimation Theory85Bayesian InvestigationNo fixed parameters. is a random variable.Data are realizations of random variables. There is a marginal distribution p(data)Parameters are part of the random state of nature, p() = distribution of independently (prior to) the dataInvestigation combines sample information with prior information.Outcome is a revision of the prior based on the observed information (data)

#/98Part 3 Estimation Theory

#/98Part 3 Estimation TheorySymmetrical TreatmentLikelihood is p(data|)Prior distribution summarizes the nonsample information about in p()Joint distribution is p(data,)P(data,) = p(data|)p()=Likelihood x PriorUse Bayes theorem to get p( |data) = posterior distribution

#/98Part 3 Estimation TheoryThe Posterior Distribution

#/98Part 3 Estimation TheoryPriors Where do they come from?What does the prior containInformative priors real prior informationNoninformative priorsMathematical ComplicationsDiffuseUniformNormal with huge varianceImproper priorsConjugate priors

#/98Part 3 Estimation TheoryApplicationConsider estimation of the probability that a production process will produce a defective product. In case 1, suppose the sampling design is to choose N = 25 items from the production line and count the number of defectives. If the probability that any item is defective is a constant between zero and one, then the likelihood for the sample of data is

L( | data) = D(1 ) 25D,where D is the number of defectives, say, 8. The maximum likelihood estimator of will be q = D/25 = 0.32, and the asymptotic variance of the maximum likelihood estimator is estimated by q(1 q)/25 = 0.008704.

#/98Part 3 Estimation TheoryApplication: Posterior Density

#/98Part 3 Estimation TheoryPosterior Moments

#/98Part 3 Estimation TheoryMixing Prior and Sample Information

#/98Part 3 Estimation TheoryModern Bayesian Analysis

Bayesian Estimate of ThetaObservations = 5000 (Posterior mean was .333333)Mean = .334017 Standard Deviation = .086336Posterior Variance = .007936 Sample variance = .007454Skewness = .248077 Kurtosis-3 (excess)= -.161478 Minimum = .066214 Maximum = .653625.025 Percentile = .177090 .975 Percentile - .510028

#/98Part 3 Estimation TheoryModern Bayesian AnalysisMultiple parameter settingsDerivation of exact form of expectations and variances forp(1,2 ,,K |data) is hopelessly complicated even if the density is tractable.Strategy: Sample joint observations(1,2 ,,K) from the posterior population and use marginal means, variances, quantiles, etc.How to sample the joint observations??? (Still hopelessly complicated.)

#/98Part 3 Estimation TheoryMagic: The Gibbs SamplerObjective: Sample joint observations on 1,2 ,,K. from p(1,2 ,,K|data) (Let K = 3) Strategy: Gibbs sampling: Derive p(1|2,3,data) p(2|1,3,data) p(3|1,2,data)Gibbs Cycles produce joint observations0. Start 1,2,3 at some reasonable values1. Sample a draw from p(1|2,3,data) using the draws of 1,2 in hand2. Sample a draw from p(2|1,3,data) using the draw at step 1 for 13. Sample a draw from p(3|1,2,data) using the draws at steps 1 and 24. Return to step 1. After a burn in period (a few thousand), start collecting the draws. The set of draws ultimately gives a sample from the joint distribution.

#/98Part 3 Estimation TheoryMethodological IssuesPriors: SchizophreniaUninformative are disingenuousInformative are not objectiveUsing existing information?Bernstein von Mises and likelihood estimation.In large samples, the likelihood dominatesThe posterior mean will be the same as the MLE

#/98Part 3 Estimation Theory