Upload
rt2222
View
216
Download
1
Embed Size (px)
Citation preview
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Lesson 1Chapter 1: Basic Statistical Concepts
Michael Akritas
Department of StatisticsThe Pennsylvania State University
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
1 Populations, Samples, and Census
2 Some Sampling Concepts
3 Random Variables and Statistical Populations
4 Basic Graphics for Data Visualization
5 Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Introduction to R
R is a GNU project. The GNU (recursive acronym for”GNU’s Not Unix”) project, sponsored by the Free SoftwareFoundation, was launched in 1984 to develop a completeUnix-like operating system which is free software.To find out about R go to http://www.R-project.org/ .See also the NY Times article http://www.nytimes.com/
2009/01/07/technology/business-computing/
07program.html?pagewanted=all
To download R go to http://cran.r-project.org/.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
You can start using R as a calculator: 2*4; 2**3; sqrt(16);sin(pi); cos(2*pi); log(exp(1)); log(10,base=10)Try some simple commands: 1:10, seq(1,10), seq(1,10,1),seq(2,10, 2). Also, rep(1,5), rep(”a”,5), rep(seq(1,4),2) orrep(1:4,2), c(rep(0,5),rep(1,7)).Can store the numbers in ”objects”: x=c(rep(0,5),rep(1,7)).x=seq(2,10,2); sum(x); mean(x). Try also x/2; x**2; sqrt(x)Define functions: f=function(x){x**2}. Try f(2); f(c(2,3))Integrate a function: integrate(f, 0, 3). Try alsog=function(x){x**(-2)}; integrate(g, 1, Inf)
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Why Statistics?
Example (Examples of Engineering/Scientific Studies)
Comparing the compressive strength of two or morecement mixtures.Comparing the effectiveness of three cleaning products inremoving four different types of stains.Predicting failure time on the basis of stress applied.Assessing the effectiveness of a new traffic regulatorymeasure in reducing the weekly rate of accidents.Testing a manufacturer’s claim regarding a product’squality.Studying the relation between salary increases andemployee productivity in a large corporation.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
These studies require Statistics due to the intrinsic variability:
The compressive strength of different preparations of thesame cement mixture will differ. The figure inhttp://sites.stat.psu.edu/˜mga/401/fig/HistComprStrCement.pdf shows 32 compressivestrength measurements (MegaPascal units), of testcylinders (6 in. diameter, 12 in. high), using water/cementratio of 0.4, measured on the 28th day after they are made.Under the same stress, two beams fail at different times.The proportion of defective items of a certain product willdiffer from batch to batch.
Intrinsic variability renders the objectives of the case studies, asstated, ambiguous.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
The objectives of the case studies can be made precise ifstated in terms of averages or means.
Comparing the average hardness of two different cementmixtures.Predicting the average failure time on the basis of stressapplied.Estimation of the average coefficient of thermal expansion.Estimation of the average proportion of defective items.
Moreover, because of variability, the words ”average” and”mean” have a technical meaning which can be made clearthrough the concepts of population and sample.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
DefinitionPopulation is a well-defined collection of objects or subjects, ofrelevance to a particular study, which are exposed to the sametreatment or method.
Population members are called units.The objective of a study is to investigate certaincharacteristic(s) of the units of the population(s) ofinterest.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Example (Populations and Unit Characteristics)All water samples taken from a lake. Characteristics:Mercury concentration; Concentration of other pollutants.All items of a certain manufactured product (that have, orwill be produced). Characteristic: Proportion of defectiveitems.All students enrolled in Big Ten universities during the2013-14 academic year. Characteristics: Favorite type ofmusic; Political affiliation.Two types of cleaning products. Characteristic: cleaningeffectiveness.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Populations consisting of the same type of units but differin the treatment, or method, applied to them are calledtreatment populations.
Example (Treatment Populations)The concentration of pollutants in water samples isanalyzed by two different labs. Water samples sent to Lab1 constitute population 1, and those sent to Lab 2constitute population 2.The time to failure of beams is studied under differentstress conditions. The beams subjected to each stresscondition constitute different populations.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Full (i.e. population-level) understanding of a characteristicrequires the examination of all population units, i.e. acensus.
For example, full understanding of the relation betweensalary and productivity of a corporation’s employeesrequires obtaining these two characteristics from allemployees.
However,taking a census can be time consuming and expensive:The 2000 U.S. Census costed $6.5 billion, while the 2010Census costed $13 billion.Moreover, census is not feasible if the population ishypothetical or conceptual, i.e. not all members areavailable for examination.
Because of the above, we typically settle for examining allunits in a sample, which is a subset of the population.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Due to the intrinsic variability, the sample properties/attributesof the characteristic of interest will differ from those of thepopulation. For example
The average mercury concentration in 25 water sampleswill differ from the overall mercury concentration in the lake.The proportion in a sample of 100 PSU students who favorthe use of solar energy will differ from the correspondingproportion of all PSU students.The relation between bear’s chest girth and weight in asample of 10 bears, will differ from the correspondingrelation in the entire population of 50 bears in a forestedregion.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
The GOOD NEWS is that, if the sample is suitably drawn, thensample properties approximate the population properties.
20 25 30 35 40 45 50 55
100
200
300
400
Chest Girth
Weight
Figure: Population and sample relationships between chest girth andweight of black bears.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Sampling Variability
Samples properties of the characteristic of interest alsodiffer from sample to sample. For example:
1 The number of US citizens, in a sample of size 20, whofavor expanding solar energy, will (most likely) be differentfrom the corresponding number in a different sample of 20US citizens.
2 The average mercury concentration in two sets of 25 watersamples drawn from a lake will differ.
The term sampling variability is used to describe suchdifferences in the characteristic of interest from sample tosample.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
20 25 30 35 40 45 50 55
100
200
300
400
Chest Girth
Weight
Figure: Illustration of Sampling Variability.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Population level properties/attributes of characteristic(s) ofinterest are called (population) parameters.
Examples of parameters include averages, proportions,percentiles, and correlation coefficient.
The corresponding sample properties/attributes ofcharacteristics are called statistics.Sample statistics approximate the correspondingpopulation parameters but are not equal to them.Statistical inference deals with the uncertainty issueswhich arise in approximating parameters by statistics.The tools of statistical inference include point and intervalestimation, hypothesis testing and prediction.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Example (Examples of Estimation, Hypothesis Testing andPrediction)
Estimation (point and interval) would be used in the task ofestimating the coefficient of thermal expansion of a metal,or the air pollution level.Hypothesis testing would be used for deciding whether totake corrective action to bring the air pollution level down,or whether a manufacturer’s claim regarding the quality ofa product is false.Prediction arises in cases where we would like to predictthe failure time on the basis of the stress applied, or theage of a tree on the basis of its trunk diameter.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
For valid statistical inference the sample must berepresentative of the population. For example, a sampleof PSU basketball players is not representative of PSUstudents, if the characteristic of interest is height.Typically it is hard to tell whether a sample isrepresentative of the population. So, we define a sample tobe representative if . . . (cyclical definition!!)
it allows for valid statistical inference.
The only guarantee for that comes from the method usedto select the sample (sampling method).The good news is that there are several sampling methodsguarantee representativeness.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
DefinitionA sample of size n is a simple random sample if the selectionprocess ensures that every sample of size n has equal chanceof being selected.
In simple random sampling every member of thepopulation has the same chance of being included in thesample. The reverse, however, is not true.
ExampleTo select a sample of 2 students from a population of 20 maleand 20 female students, one selects at random one male andone female students. Is this a s.r.s.? (Does every student havethe same chance of being included in the sample?)
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Another sampling method for obtaining a representative sampleis called stratified sampling.
DefinitionA stratified sample consists of simple random samples fromeach of a number of groups (which are non-overlapping andmake up the entire population) called strata.
Examples of strata include: ethnic groups, age groups, andproduction facilities.If the units in the different strata differ in terms of thecharacteristic under study, stratified sampling is preferableto s.r.s. For example, if different production facilities differin terms of the proportion of defective products, a stratifiedsample is preferable.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
How do we select a s.r.s. of size n from a population of Nunits?
STEP 1: Assign to each unit a number from 1 to N.STEP 2: Write each number on a slips of paper, place theN slips of paper in an urn, and shuffle them.STEP 3: Select n slips of paper at random, one at a time.
Alternatively, the entire process can be performed in softwarelike R. We will see this in the next lab session.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Sampling without replacement simply means that apopulation unit can be included in a sample at most once.For example, a simple random sample is obtained bysampling without replacement: Once a unit’s slip of paperis drawn, it is not placed back into the urn.Sampling with replacement means that after a unit’s slip ofpaper is chosen, it is put back in the urn. Thus apopulation unit could be included in the sample anywherebetween 0 and n times. Rolling a die can be thought of assampling with replacement from the numbers 1,2, . . . ,6.Though conceptually undesirable, sampling withreplacement is easier to work with from a mathematicalpoint of view.When a population is very large, sampling with and withoutreplacement are practically equivalent.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Non-representative samples arise whenever the samplingplan is such that a part, or parts, of the population ofinterest are either excluded from, or systematicallyunder-represented in, the sample. This is called selectionbias.Two examples of non-representative samples areself-selected and convenience samples.A self-selected sample often occurs when people areasked to send in their opinions in surveys orquestionnaires. For example, in a political survey, oftenthose who feel that things are running smoothly or whosupport an incumbent will (apathetically) not respond,whereas those activists who strongly desire change willvoice their opinions.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Convenience samples are made up from the most easilyaccessed units. For example, randomly selecting studentsfrom your classes will not result in a sample that isrepresentative of all PSU students since your classes aremostly comprised of students with the same major as you.
Example (The Literary Digest poll of 1936)The magazine had been extremely successful in predicting theresults in US presidential elections, but in 1936 it predicted a3-to-2 victory for Republican Alf Landon over the Democraticincumbent Franklin Delano Roosevelt. Worth noting is that thisprediction was based on 2.3 million responses (out of 10 millionquestionnaires sent). On the other hand Gallup correctlypredicted the outcome of that election by surveying only 50,000people.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Variable = a Numerical Characteristic
If the characteristic of interest can be measured expressed as anumber, e.g. thermal expansion of a metal, hardness ofcement, mercury concentration, or number of accidents it is arecalled quantitative.
Examples of non-quantitative characteristics are gender, makeof car, eye color, strength category, political affiliation. Suchcharacteristics are called categorical or qualitative.
Because statistical procedures are applied to numerical datasets, the categories in categorical characteristic are labeledwith arbitrarily chosen numbers (i.e. ’male’= −1, ’female’= +1).
A characteristic expressed as a number is called a variable.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Types of Variables
Qualitative variables are a particular kind of discretevariables. Quantitative variables can also be discrete.
All variables expressing counts, such as the number ofearthquakes, the number of fish caught etc, are discrete.
Quantitative variables expressing measurements on acontinuous scale are examples of continuous variables.
Measurements of length, strength, weight, or time to failureare examples of continuous variables.
When two or more characteristics are measured on eachpopulation unit, we have bivariate or multivariatevariables.
Example of bivariate: Salary increase and productivity.Example of multivariate: Age, income, education level.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Random Variables
When a unit is randomly sampled from a population, thevalue of its variable will be denoted by X (or Y, or Z, etc).Because of the intrinsic variability, X is not known a-prioriand thus it is called a random variable (r.v.).The population from which a random variable is drawn iscalled the underlying population of the r.v.The collection of of the variable values of all populationunits is called the statistical population.The statistical population of a r.v. is NOT the same as theset of values a variable can take.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Example1 A list of the weight of every PSU student is the statistical
population of the r.v. weight.2 A list of 1s and 0s representing every student’s opinion on
whether solar energy should be expanded is the statisticalpopulation of the r.v. expressing opinion on solar energy.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Sampling from the Statistical Population
It should be intuitively clear that taking a sample of n units formsome population and recording the variable of each sampledunit, is equivalent to taking a sample of n units from thestatistical population of the random variable and its underlyingpopulation.
Henceforth, the word sample will also be used to denote asample from the statistical population. Such a sample
1 Consists of units of the statistical population i.e. numbers.2 The numbers are not known a-priori, so they are rv’s.3 A sample of size n will be denoted by X1,X2, . . . ,Xn.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Histograms and Stem and Leaf Plots
In histograms the range of the data is divided into bins, anda box is constructed above each bin.The height of each box is the bin’s frequency. Alternatively,the heights can be adjusted so the histogram’s area is one.R will automatically choose the number of bins but it alsoallows user specified intervals. Moreover, R offers theoption of constructing a smooth histogram.In stem and leaf plots each observation gets split into itsstem, which is the beginning digit(s), and its leaf, which isthe first of the remaining digits.They retain more information about the original data but donot offer as much flexibility in selecting the bins.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
The R data set faithful
The histogram, with superimposed smooth histogram, for asample of 272 eruption durations from the Old Faithfulgeyser is shown in http://stat.psu.edu/˜mga/401/fig/HistOldFaith1.pdf
The stem and leaf plot for the same data set is shown inhttp://stat.psu.edu/˜mga/401/fig/StemLeaf.pdf
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Scatterplots
The basic scatterplot is useful for exploring therelationship between two variables. An enhance versionidentifies subclasses of data. See http://stat.psu.edu/˜mga/401/fig/BearsChG_W_by_S.pdf
A scatterplot matrix is a matrix of scatterplots for all pairsof variables in a data set. See http://stat.psu.edu/
˜mga/401/fig/BearMeas_by_S.pdf. It helps identifythe best single predictor of weight.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Scatterplots Continued
Scatterplots with marginal histograms showshistograms of the two variables in the margins of thescatterplot. See http://stat.psu.edu/˜mga/401/fig/BearMeas_with_MarginalHist.pdf
3D Scatterplots are useful for exploring the relationshipbetween three variables. For example, http://stat.psu.edu/˜mga/401/fig/TempProdElect2.pdf givesa three dimensional view of the joint effect of temperatureand production volume on electricity consumed.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Pie Charts and Bar Graphs
Pie charts and bar graphs are used with count data todisplay the proportion of each category in a sample.The pie chart is popular in the mass media and one of themost widely used statistical charts in the business world.It is a circular chart, where the circle is divided intosections whose areas represent proportions.The pie chart in http://www.stat.psu.edu/˜mga/401/fig/LvMsPie.pdfdisplays information on the November, 2011 light vehiclemarket share of car companies (source:http://wardsauto.com/keydata/USSalesSummary0702.xls).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
According to Steven’s power law bar lengths is better thansection areas for comparing the different proportions.Bar graphs resemble histograms with the heights of thebars equal to the proportion of each category. The bargraph display of the November 2011 light vehicle marketshare data is shown in http://stat.psu.edu/˜mga/401/fig/LvMsBar2.pdf.
Remark: When the heights of the bars are arranged in a decreasingorder, the bar graph is also called Pareto chart. The Pareto chart isone of the key tools used in quality control, where it is often used torepresent the most common sources of defects in a manufacturingprocess, or the most frequent reasons for customer complaints, etc.[Google Pareto principle]
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
The Most Common Parameters
For a univariate statistical population these are:
The proportion. For example, the proportion of HondaAccords that will require warranty repair work in 36,000miles.The average. For example, the average failure time at agiven stress level.The variance and standard deviation. These parametersquantify the intrinsic variability.The median and other percentiles. Can be used to quantifyboth location and variability.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Outline1 Populations, Samples, and Census
2 Some Sampling Concepts
3 Random Variables and Statistical Populations
4 Basic Graphics for Data Visualization
5 Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Proportions are relevant whenever the variable of interest iscategorical, or has been categorized.
Definition1 If the population has N units, and Ni units are in category i ,
then the population proportion for category i , is
pi =#{population units of category i}
#{population units}=
Ni
N.
2 If a sample of size n is taken, and ni sample units are incategory i , then the sample proportion for category i , is
p̂i =#{sample units of category i}
#{sample units}=
ni
n.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Example1 In a sample of 1000 adults, 72% favor tougher penalties for
drunk driving. Is the correct notation for 0.72 p or p̂?2 In a population of 80 engineering majors taking a required
statistics class, 40 are enthusiastic about having computerlabs. If a s.r. sample of 20 from these students 8 areenthusiastic. What is the correct notation for 40/80 = 0.5and for 8/20 = 2/5?
Always remember that, under s.r. sampling, p̂approximates, but in general is different from p.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Outline1 Populations, Samples, and Census
2 Some Sampling Concepts
3 Random Variables and Statistical Populations
4 Basic Graphics for Data Visualization
5 Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Consider a population of N units, and let v1, v2, . . . , vN denotethe statistical population corresponding to some variable. Thenthe population average or population mean, denoted by µ, isthe arithmetic average of all values in the statistical population.Thus,
µ =1N
N∑i=1
vi .
If the random variable X denotes the value of the variable of arandomly selected population unit, then a synonymousterminology for the population mean is expected value of X , ormean value of X , and is denoted by µX or E(X ).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
ExampleIn a population of 500 tin plates, the number of plates with 0, 1and 2 scratches is N0 = 190, N1 = 160 and N2 = 150. Thus, inthe statistical population v1, . . . , v500, 190 vi equal 0, 160 equal1, and 150 equal 2. The population mean is
µ =1
500
500∑i=1
vi =0× N0
500+
1× N1
500+
2× N2
500= 0.92
If a tin plate is selected at random and X is the rv denoting thenumber of scratches, the mean value of X is 0.92 and we writeµX = 0.92, or E(X ) = 0.92.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
If a sample of size n is taken, and x1, x2, . . . , xn denote thevariable values of the sample units, then the sample averageor sample mean, denoted by x , is
x =1n
n∑i=1
xi
Under s.r. sampling, a sample mean approximates, but ingeneral is different from the population mean.
ExampleIf a s.r. sample of n = 100 is taken from the 500 tin plates, itcould be that there are n0 = 40, n1 = 34 and n2 = 26 plateswith 0, 1 and 2 scratches. In this case, x = 0.86.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Proportions are Averages!
A proportion is a special case of a mean. To see this:
Consider the example with the tin plates, where N1 = 160out of N = 500 have one scratch, and let the variable Xtake the value 1 if a tin plate has one scratch and the value0 otherwise.Note that for the statistical population, v1, . . . , v500, of thisvariable, 160 vi are equal to 1 and 340 are equal to 0.Thus,
µX =160500
= 0.32, which equals p =N1
N.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Outline1 Populations, Samples, and Census
2 Some Sampling Concepts
3 Random Variables and Statistical Populations
4 Basic Graphics for Data Visualization
5 Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Let v1, v2, . . . , vN be a statistical population with mean µ.
Definition
The population variance, σ2, is defined as
σ2 =1N
N∑i=1
(vi − µ)2.
The standard deviation is the positive square root of thevariance: σ =
√σ2.
If the rv X denotes a randomly selected value from thestatistical population, then a synonymous terminology for thepopulation variance is variance of X , and is denoted by σ2
X , or
Var(X ). The standard deviation of X is σX =√σ2
X .
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
A simpler computational formula for the variance is
σ2 =1N
N∑i=1
v2i − µ
2
.ExampleConsider the tin plate example, so the statistical populationv1, . . . , v500, has 190 vi equal 0, 160 equal 1, 150 equal 2, andµ = 0.92. Then,
σ2 =190× 0
500+
1× 160500
+4× 150
500− 0.922 = 0.6736.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
If x1, x2, . . . , xn denotes a sample from the statistical population,the sample variance and its computational formula are:
S2 =1
n − 1
n∑i=1
(xi − x)2 =1
n − 1
[ n∑i=1
x2i −
1n
( n∑i=1
xi
)2].
The sample standard deviation is S =√
S2. Under s.r.sampling, S2 approximates, but in general is different from σ2.
ExampleConsider the s.r. sample of n = 100 tin plates, which has 40, 34and 26 plates with 0, 1 and 2 scratches. Then,
S2 =1
99[138− 73.96] = 0.647
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Why Divide by n − 1?
Because this assures that the average of the sample variancesresulting from all possible samples is equal to the populationaverage.
Example
The variance of the population {0,1}, which corresponds totossing a fair coin, is 0.25 (why?). The possible samples of sizetwo, taken with replacement, are {0,0}, {0,1}, {1,0}, {1,1}.Verify that the four sample variances average to 0.25.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Outline1 Populations, Samples, and Census
2 Some Sampling Concepts
3 Random Variables and Statistical Populations
4 Basic Graphics for Data Visualization
5 Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Roughly speaking, the (1− α)100th sample percentileseparates the part having the (1− α)100% smaller values,from that which has the α100% larger values. Thus:
The 90th sample percentile separates the largest 10% fromthe lower 90% values in the data set.
The 50th sample percentile is also called the samplemedian. The 25th, the 50th and the 75th samplepercentiles are also called sample quartiles. The 25thand 75th percentiles are the lower quartile and upperquartile, respectively.The distance between the lower and upper quartiles iscalled the interquartile range or IQR.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Order Statistics as Sample Percentiles
Let X1, . . . ,Xn be a s.r. sample from a continuousdistribution. The ordered sample values are denoted
X(1),X(2), . . . ,X(n) .
Thus, X(1) < X(2) < · · · < X(n).X(i), the i th smallest sample value, is defined to be the[
100 i−0.5n
]-th sample percentile.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Example
A s.r.s. of 10 black bears’ weights is: 154 158 356 446 40 15490 94 150 142. Give the order statistics, and state thepopulation percentiles they estimate.Solution: The R command
sort( c(154, 158, 356, 446, 40, 154, 90, 94, 150, 142) )returns the order statistics: 40, 90, 94, 142, 150, 154, 154, 158,356, 446. These order statistics estimate the5th, 15th, 25th, 35th, 45th, 55th, 65th, 75th, 85th and 95thpopulation percentiles, respectively. For example, X(3) = 94 isthe 100(3− 0.5)/10 = 25th percentile and estimates thecorresponding population percentile. [In R the percentiles areobtained with: ”100*(1:10 - 0.5)/10”.]
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
In the above example none of the order statistics correspondsto the median or the 90th percentile. In general, if n is even,none of the order statistics corresponds to the median. Forexample,
If n = 5 then X(3), the 3rd smallest value, is the1002.5
5 = 50th sample percentile or median.If n = 4 then
X(2) is the 100 1.54 = 37.5th sample percentile,
while X(3) is the 100 2.54 = 62.5th sample percentile.
Thus, none of the ordered values is the median.
Depending on n, the above definition may not identify otherpercentiles of interest. In such cases, we use interpolations.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Percentiles in R
R uses a different interpolation algorithm for evaluating samplepercentiles from a given data set. With the data set in theobject x, the commands
median(x)quantile(x,0.25)quantile(x,c(0.3,0.7,0.9))summary(x)
R commandsfor percentiles
give, respectively, the median, the 25th percentile, the 30th,70th and 90th percentiles, and a five number summary of thedata consisting of x(1), q1, x̃ , q3, and x(n).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Example
Using the previous sample of 10 black bear weights, estimate thepopulation median, 70th, 80th and 90th percentiles.Solution: With the sample values in the object w, i.e.w=c(154, 158, 356, 446, 40, 154, 90, 94, 150, 142)the R commandquantile(w,c(0.5, 0.7, 0.8, 0.9))returns 152.0, 155.2, 197.6, 365.0 for the sample median, 70th, 80thand 90th percentiles, respectively.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
The five number summary of the data given by the”summary(x)” command in R is the basis for the boxplot.A boxplot displays the central 50% of the data with a box,
the lower and upper edges are at q1 and q3, respectively,a line inside the box represents the median.
The lower 25% and upper 25% of the data are representedby lines (or whiskers) which extend from each edge of thebox.
The lower (upper) whisker extends from q1 (q3) until thesmallest (largest) observation within 1.5 interquartileranges from q1 (q3).Observations further from the box than the whisker ends(i.e. smaller than q1 − 1.5× IQR or larger thanq3 + 1.5× IQR) are called outliers, and are plottedindividually.
See http://sites.stat.psu.edu/˜mga/401/fig/BoxplotOzoneR.pdf
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Example
Scientists have been monitoring the ozone hole since 1980.See the images shown in http://ozonewatch.gsfc.nasa.gov/ The14 Ozone measurements (Dobson units) given inhttp://stat.psu.edu/∼mga/401/Data/OzoneData.txt. are taken in2002 from the lower stratosphere, between 9 and 12 milesaltitude. Give the five number summary of this data andconstruct the box plot.Solution: Read the data in the R object oz using
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
oz=read.table(”http://stat.psu.edu/∼mga/401/Data/OzoneData.txt”,header =T)
Then, use the command
summary(oz) (or summary(oz$OzoneData)) to get the fivenumber summary of this data. For the boxplot use
boxplot(oz, col=”grey”), or boxplot(oz$OzoneData, col=”grey”).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Hand Calculation of Sample Median
Definition
Let X(1),X(2), . . . ,X(n) denote the ordered sample values in asample of size n. The sample median is defined as
X̃ =
X( n+1
2 ), if n is odd
X( n2)
+ X( n2+1)
2, if n is even
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Example (Relation Between X̃ and X )Find the sample median of X1 = 2.3, X2 = 3.2, X3 = 1.8,X4 = 2.5, X5 = 2.7.Solution. We first order the values from smallest to largest:
X(1) = 1.8, X(2) = 2.3, X(3) = 2.5, X(4) = 2.7, X(5) = 3.2.
Since sample size is odd, X̃ = X( n+12 ) = X(3) = 2.5.
For this data, X = X̃ = 2.5.If X(5) is changed to 4.2, then X = 2.7 but X̃ = 2.5. Thus Xis affected by outliers, where as X̃ is not.In general, if the histogram of the data is positively skewedX > X̃ , and if it is negatively skewed X < X̃ .
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Hand Calculation of Sample Quartiles and SampleIQR
DefinitionThe sample lower quartile or SLQ is defined as
the median of the smallest n/2 values, if n is eventhe median the smallest (n + 1)/2 values, if n is odd
The sample upper quartile or SUQ is defined asthe median of the largest n/2 values, if n is eventhe median the largest (n + 1)/2 values, if n is odd
The sample interquartile range, or SIQR, defined as
SIQR = SUQ − SLQ
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
ExampleFind the lower and upper quartiles of the n = 9 observations9.39, 7.04, 7.17, 13.28, 9.00, 7.46, 21.06, 15.19, 7.50.Solution. Since n is odd, the SLQ is the median of the
Smallest 5(= (n + 1)/2) values: 7.04, 7.17, 7.46, 7.50, 9.00
and the SUQ is the median of the
Largest 5(= (n + 1)/2) values: 9.00, 9.39, 13.28, 15.19, 21.06.
Thus SLQ = 7.46, and SUQ = 13.28.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
OutlinePopulations, Samples, and Census
Some Sampling ConceptsRandom Variables and Statistical Populations
Basic Graphics for Data VisualizationProportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-Averages: Population- and Sample-Variance: Population- and Sample-Sample Percentiles and the Box Plot
Go to next lesson http://stat.psu.edu/˜mga/401/course.info/lesson2.pdf
Go to the Stat 401 home pagehttp://stat.psu.edu/˜mga/401/course.info/
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts