Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Unit 1: Introduction to dataLecture 2: Exploratory data analysis
Statistics 101
Nicole Dalzell
May 14, 2015
Warm-Up and Data Basics
Announcements
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 2 / 1
Warm-Up and Data Basics
Review
Example Study:A researcher is interested in whether or not cats will choose to sleepless if they have toys to entertain themselves. She divides 250 cats(adults and kittens) into two rooms, with adult cats in one room andbaby kittens in the other room. Within each room she erects a fence,randomly placing half the cats (or kittens) on each side of the fence.On one side of the fence she scatters a variety of cat toys. For 1 day,the researcher records the number of hours each cat spendssleeping.
What is the research question?
What are the explanatory and response variables?
Is this an Experimental or Observational study?
What are the controls and treatments?
Is blocking employed in this study?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3 / 1
Warm-Up and Data Basics
Types of Variables Example
Still our cat example:
Cat Age Toys # of Naps Weight (lbs)1 adult 1 3 82 juvenile 1 5 93 adult 0 2 10.54 adult 1 8 12.25...
......
......
250 adult 0 5 11.67
What types of variables are these:
Age?
Toys?
# of Naps?
Weight?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3 / 1
Warm-Up and Data Basics Sampling Methods
Population to sample
It is usually not feasible to collect information on the entirepopulation due to high costs of data collection so statisticiansinstead work with samples that are (hopefully) representative ofthe populations they come from.
population
sample
We try to understand certain features of the population as awhole using summary statistics and graphs based on thesesamples.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 4 / 1
Warm-Up and Data Basics Sampling Methods
Obtaining good samples
Almost all statistical methods are based on the notion of impliedrandomness.
If observational data are not collected in a random frameworkfrom a population, these statistical methods – the estimates anderrors associated with the estimates – are not reliable.
Most commonly used random sampling techniques are simple,stratified, and cluster sampling.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 5 / 1
Warm-Up and Data Basics Sampling Methods
Simple random sample
Randomly select cases from the population, each case is equallylikely to be selected.
Index
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
Index
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
Stratum 1
Stratum 2
Stratum 3
Stratum 4
Stratum 5
Stratum 6
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 6 / 1
Warm-Up and Data Basics Sampling Methods
Stratified sample
Strata are homogenous, simple random sample from each stratum.
Index
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
Index
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
Stratum 1
Stratum 2
Stratum 3
Stratum 4
Stratum 5
Stratum 6
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 7 / 1
Warm-Up and Data Basics Sampling Methods
Cluster sample
Clusters are not necessarily homogenous, simple random samplefrom a random sample of clusters. Usually preferred for economicalreasons.
Index
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
Index
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
Stratum 1
Stratum 2
Stratum 3
Stratum 4
Stratum 5
Stratum 6
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 8 / 1
Warm-Up and Data Basics Sampling Methods
Participation question
A city council has requested a household survey be conducted in asuburban area of their city. The area is broken into many distinct andunique neighborhoods, some including large homes, some with onlyapartments. Which approach would likely be the least effective?
(a) Simple random sampling
(b) Cluster sampling
(c) Stratified sampling
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 9 / 1
Warm-Up and Data Basics Exploratory Data Analysis
Explore the Data
When you taste a spoonful of chili and decide it doesn’t tastespicy enough, that’s exploratory analysis.
For data analysis, we perform exploratory data analysis, or EDA,to determine trends in features that may be present in the data.
The distribution of a variable is a list of possible values thevariable can take and how often it takes each of those values.
Distributions are critical to assessing the probability of events.
Plots are almost always useful for visualizing relationships anddistributions in the data.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 10 / 1
Warm-Up and Data Basics Exploratory Data Analysis
Visualizing numerical variables
Intensity map: Useful for displaying the spatial distribution.
Dot plot: Useful when individual values are of interest.
Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.
Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 11 / 1
Warm-Up and Data Basics Exploratory Data Analysis
Why visualize?
Describe the spatial distribution of race/ethnicity in the US.
http:// demographics.coopercenter.org/ DotMap/ index.html
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 12 / 1
Warm-Up and Data Basics Exploratory Data Analysis
Why visualize?
And let’s take a closer look at Durham.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 13 / 1
Warm-Up and Data Basics Exploratory Data Analysis
Scatterplot
Scatterplots are useful for visualizing the relationship between twonumerical variables.
Do life expectancy and total fertil-ity appear to be associated or in-dependent?
Was the relationship the samethroughout the years, or did itchange?
http:// www.gapminder.org/ world
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 14 / 1
Warm-Up and Data Basics Exploratory Data Analysis
Cars: ... vs. weight
From the cars data:
mile
s p
er
ga
llon
(city r
atin
g)
2000 3000 4000
20
30
40
weight (pounds)
2000 2500 3000 3500 4000
10
20
30
40
50
60
weight (pounds)
pric
e ($
1000
s)
What do these scatterplots reveal about the data? How might they beuseful?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 15 / 1
Numerical Variables Basic Plots
World Bank Data
This is public-use data available for download fromhttp:// data.worldbank.org/ topic/ energy-and-mining .
What does the distribution of energy use per capita look likeacross different countries?Is energy use fairly uniform across different countries?If not, can we distinguish groups of countries that use more thanothers?
Country.Name X201137 Afghanistan50 Angola 672.7463 Albania 689.0376 Arab World 1806.9089 United Arab Emirates 7407.01
102 Argentina 1966.97115 Armenia 916.26128 American Samoa141 Antigua and Barbuda154 Australia 5500.79167 Austria 3927.92180 Azerbaijan 1369.32193 Burundi206 Belgium 5348.97219 Benin 384.56232 Burkina Faso245 Bangladesh 204.72258 Bulgaria 2615.04
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 16 / 1
Numerical Variables Basic Plots
Visualizing numerical variables
Intensity map: Useful for displaying the spatial distribution.
Dot plot: Useful when individual values are of interest.
Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.
Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 17 / 1
Numerical Variables Basic Plots
Stacked Dot Plot
Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.
gpa
3.0 3.2 3.4 3.6 3.8 4.0
●
●
● ●● ● ●● ●● ●● ●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 18 / 1
Numerical Variables Basic Plots
Dot Plot: Why visualize?
Dot plot of weight, in ounces
0 1000 2000 3000 4000
●● ●● ●● ●●●
●
●●● ●
●
●● ● ● ●
●
● ●
●
●
●● ●
●
●●
●●
●●
● ●
●
● ●
●●
●
● ●
● ●
●
●●
● ●
●
●● ●
●● ●
●
● ●
●
●
●
●●
●
●
●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Do you see anything out of the ordinary?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 19 / 1
Numerical Variables Basic Plots
Why visualize?
Dot plot of weight, in ounces
0 1000 2000 3000 4000
●● ●● ●● ●●●
●
●●● ●
●
●● ● ● ●
●
● ●
●
●
●● ●
●
●●
●●
●●
● ●
●
● ●
●●
●
● ●
● ●
●
●●
● ●
●
●● ●
●● ●
●
● ●
●
●
●
●●
●
●
●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Do you see anything out of the ordinary?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 20 / 1
Numerical Variables Basic Plots
Why visualize?
What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?
Dot plot of average number of hours of sleep per night
4 5 6 7 8 9
●●● ●●● ●●●●●
●●● ●●●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 21 / 1
Numerical Variables Basic Plots
Dot Plot: World Bank Data
0 5000 10000 15000
Eenrgy Data Dot Plot
Energy per capita
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 22 / 1
Numerical Variables Basic Plots
Histogram
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
0 5000 10000 15000
020
4060
80
Country.Name X2011AfghanistanAngola 672.74Albania 689.03Arab World 1806.90United Arab Emirates 7407.01Argentina 1966.97Armenia 916.26
Bins 0-2000 2001 - 4000 4001 - 6000 6001 - 8000 . . .
Count 92 38 18 10 . . .
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 23 / 1
Numerical Variables Basic Plots
Histogram: Bin Width
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
0 5000 10000 150000
2040
6080
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
0 5000 10000 15000 20000
020
4060
8012
0
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
0 5000 10000 15000
05
1015
2025
3035
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 24 / 1
Numerical Variables Basic Plots
Bin Width
Which one(s) of these histograms are useful? Which reveal too muchabout the data? Which hide too much?
extracurricular hrs / week
freq
uenc
y
0 5 10 15 20 25 300
10
20
30
40
50
extracurricular hrs / week
freq
uenc
y
0 5 10 15 20 250
5
10
15
20
25
30
extracurricular hrs / week
freq
uenc
y
0 5 10 15 20 250
5
10
15
extracurricular hrs / week
freq
uenc
y
5 10 15 20 2502468
101214
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 25 / 1
Numerical Variables Basic Plots
Histogram
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
0 5000 10000 15000
020
4060
80
Provides a view of the data density.
Very usual for examining the shape of a distribution.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 26 / 1
Numerical Variables Distribution Shapes
Histogram
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
This distribution is right skewed and unimodal.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 27 / 1
Numerical Variables Distribution Shapes
Shape: Skewness
We describe histograms as right skewed, left skewed, or symmetric.
0 2 4 6 8 10
05
1015
0 5 10 15 20 25
020
4060
0 20 40 60 80
05
1015
2025
30Histograms are said to be skewed to the side of the long tail.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 28 / 1
Numerical Variables Distribution Shapes
Shape: Modality
The mode is defined as the most frequent observation in the data set.Does the histogram have a single prominent peak (unimodal), severalprominent peaks (bimodal/multimodal), or no apparent peaks(uniform)?
0 5 10 15
05
1015
0 5 10 15 20
05
1015
0 5 10 15 20
05
1015
20
0 5 10 15 20
02
46
810
1214
In order to determine modality, it’s easiest to step back and imagine adensity curve over the histogram. Use the limp spaghetti method.Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 29 / 1
Numerical Variables Distribution Shapes
Shape and Skew
How would you describe this distribution?
Histogram ofaverage number of hours spent on school work per day
2 4 6 8 10
05
1015
2025
30
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 30 / 1
Numerical Variables Distribution Shapes
Shape: Why does this matter?
Symmetric Distribution
Value
Hei
ght
Bimodal Distribution
Value
Hei
ght
0 1 2 3 4 5 6 7 8 9
020
040
060
080
010
00
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 31 / 1
Numerical Variables Distribution Shapes
Participation question
Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 32 / 1
Numerical Variables Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 33 / 1
Numerical Variables Distribution Shapes
Density Curves
A Density Curve is a smoothed density histogram where the areaunder the curve is 1.To draw a density curve from a histogram simply connect thepeaks of a histogram with a smooth line, and normalize thevalues of the y-axis such that the area under the curve is 1.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 34 / 1
Numerical Variables Distribution Shapes
Unusual Observations
Are there any unusual observations or potential outliers?
0 5 10 15 20
05
1015
2025
30
20 40 60 80 100
010
2030
40Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 35 / 1
Numerical Variables Distribution Shapes
Application exercise: Shapes of distributions
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 36 / 1
Numerical Variables Distribution Shapes
Describing Your Pictures
Bell Shaped: Data is bell shaped if the majority of the data isclustered around the center value (mean) with very few datapoints lying either way above or way below this value.
Right Skewed: Data is positively skewed if you have severallarge positive data points creating a long tail to the right.
Left Skewed: Data is negatively skewed if you have several largenegative numbers creating a long tail to the left.
Bimodal: Data is bimodal if it has two large clusters of datapoints.
Symmetric: Data is symmetric if it looks like a mirror imagearound a point of inflection.
Uniformly Distributed: Data is evenly spread across all possiblevalues.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 37 / 1
Descriptive Statistics Center
Mean
The sample mean, denoted as x̄, can be calculated as
x̄ =x1 + x2 + · · ·+ xn
n=
Sum of Data PointsNumber of Data Points
,
where x1, x2, · · · , xn represent the n observed values.
The population mean is a parameter computed the same way butis denoted as µ. It is often not possible to calculate µ sincepopulation data is rarely available.
x̄ is an estimate of µ based on the observed data.
The sample mean is a sample statistic, or a point estimate of thepopulation mean. This estimate may not be perfect, but if thesample is good (representative of the population) it is usually agood guess.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 38 / 1
Descriptive Statistics Center
Median
The median is the value that splits the data in half when orderedin ascending order.
0, 1, 2, 3, 4
If there are an even number of observations, then the median isthe average of the two values in the middle.
0, 1, 2, 3, 4, 5→2 + 3
2= 2.5
Since the median is the midpoint of the data, 50% of the valuesare below it. Hence, it is also the 50th percentile.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 39 / 1
Descriptive Statistics Center
Mean vs. MedianLink
If the distribution is symmetric, center is the meanSymmetric: mean ≈ median
If the distribution is skewed or has outliers center is the medianRight-skewed: mean > medianLeft-skewed: mean < median
Right−skewed
meanmedian
Left−skewed
meanmedian
Symmetric
meanmedian
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 40 / 1
Descriptive Statistics Center
Back to our Energy Data
Energy Use in 2011 (World Bank Data)
Energy Use (kg oil equivalent per capita)
Num
ber
of C
ount
ries
0 5000 10000 15000
020
4060
80
Mean: 2532.631Median: 1593.7
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 41 / 1
Descriptive Statistics Center
Measures of Center
The Mean of a dataset is what we commonly refer to as theaverage.
The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.
The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 42 / 1
Descriptive Statistics Center
Are you typical?
http:// www.youtube.com/ watch?v=4B2xOvKFFz4
How useful are centers alone for conveying the true characteristics ofa distribution?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 43 / 1
Descriptive Statistics Center
Describing distributions of numerical variables
When describing distributions of numerical variables always mention
Shape: skewness, modalityCenter: an estimate of a typical observation in the distribution(mean, median, mode, etc.)Unusual observations: observations that stand out from the restof the data that may be suspected outliersSpread: measure of variability in the distribution (SD, IQR, range,etc.)
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 44 / 1
Descriptive Statistics Spread
Measures of Spread
The population Variance, σ2, measures each observation’sdeviation from the mean.
The population Standard Deviation, σ, is the square root of thevariance.
The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.
Link
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 45 / 1
Descriptive Statistics Spread
Box Plot
The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.
# of study hours / week10 20 30 40
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 46 / 1
Descriptive Statistics Spread
Anatomy of a Box Plot
# of
stu
dy h
ours
/ w
eek
0
10
20
30
40
lower whisker
Q1 (first quartile)
median
Q3 (third quartile)
upper whisker
max whisker reach
suspected outliers
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 47 / 1
Descriptive Statistics Spread
Measures of Location
The 25th percentile is also called the first quartile, Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile, Q3.
summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00
Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.
IQR = 20 − 10 = 10
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 48 / 1
Descriptive Statistics Spread
Whiskers and Outliers
Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.
max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35
max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5
An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 49 / 1
Descriptive Statistics Spread
Outliers (cont.)
Why is it important to look for outliers?
Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 50 / 1
Descriptive Statistics Spread
Why visualize?
What does a response of 0 mean in this distribution?
●●●
0 2 4 6 8 10 12
Number of drinks it takes students to get drunk
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 51 / 1
Descriptive Statistics Spread
Example: Visualizing
What does our Energy Data look like?
050
0010
000
1500
0
Energy Use Data Boxplot
Ene
rgy
Usa
ge
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 52 / 1
Descriptive Statistics Spread
Who uses the most energy?
Country.Name X20111 Iceland 17964.442 Qatar 17418.693 Trinidad and Tobago 15691.294 Kuwait 10408.285 Brunei Darussalam 9427.096 Oman 8356.297 Luxembourg 8045.908 United Arab Emirates 7407.019 Bahrain 7353.16
10 Canada 7333.2811 North America 7062.2212 United States 7032.3513 Saudi Arabia 6738.4214 Singapore 6452.3315 Finland 6449.04
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 53 / 1
Descriptive Statistics Spread
Participation question
Which of the following is false about the distribution of average numberof hours students study daily?
●
2 4 6 8 10
Average number of hours students study daily
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 3.821 5.000 10.000
(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 54 / 1
Descriptive Statistics Spread
Measures of Spread
The population Variance, σ2, measures each observation’sdeviation from the mean.
The population Standard Deviation, σ, is the square root of thevariance.
The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.
Link
Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 55 / 1