14
Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell May 14, 2015 Warm-Up and Data Basics Announcements Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 2/1 Warm-Up and Data Basics Review Example Study: A researcher is interested in whether or not cats will choose to sleep less if they have toys to entertain themselves. She divides 250 cats (adults and kittens) into two rooms, with adult cats in one room and baby kittens in the other room. Within each room she erects a fence, randomly placing half the cats (or kittens) on each side of the fence. On one side of the fence she scatters a variety of cat toys. For 1 day, the researcher records the number of hours each cat spends sleeping. What is the research question? What are the explanatory and response variables? Is this an Experimental or Observational study? What are the controls and treatments? Is blocking employed in this study? Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3/1 Warm-Up and Data Basics Types of Variables Example Still our cat example: Cat Age Toys # of Naps Weight (lbs) 1 adult 1 3 8 2 juvenile 1 5 9 3 adult 0 2 10.5 4 adult 1 8 12.25 . . . . . . . . . . . . . . . 250 adult 0 5 11.67 What types of variables are these: Age? Toys? # of Naps? Weight? Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3/1

Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Unit 1: Introduction to dataLecture 2: Exploratory data analysis

Statistics 101

Nicole Dalzell

May 14, 2015

Warm-Up and Data Basics

Announcements

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 2 / 1

Warm-Up and Data Basics

Review

Example Study:A researcher is interested in whether or not cats will choose to sleepless if they have toys to entertain themselves. She divides 250 cats(adults and kittens) into two rooms, with adult cats in one room andbaby kittens in the other room. Within each room she erects a fence,randomly placing half the cats (or kittens) on each side of the fence.On one side of the fence she scatters a variety of cat toys. For 1 day,the researcher records the number of hours each cat spendssleeping.

What is the research question?

What are the explanatory and response variables?

Is this an Experimental or Observational study?

What are the controls and treatments?

Is blocking employed in this study?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3 / 1

Warm-Up and Data Basics

Types of Variables Example

Still our cat example:

Cat Age Toys # of Naps Weight (lbs)1 adult 1 3 82 juvenile 1 5 93 adult 0 2 10.54 adult 1 8 12.25...

......

......

250 adult 0 5 11.67

What types of variables are these:

Age?

Toys?

# of Naps?

Weight?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3 / 1

Page 2: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Warm-Up and Data Basics Sampling Methods

Population to sample

It is usually not feasible to collect information on the entirepopulation due to high costs of data collection so statisticiansinstead work with samples that are (hopefully) representative ofthe populations they come from.

population

sample

We try to understand certain features of the population as awhole using summary statistics and graphs based on thesesamples.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 4 / 1

Warm-Up and Data Basics Sampling Methods

Obtaining good samples

Almost all statistical methods are based on the notion of impliedrandomness.

If observational data are not collected in a random frameworkfrom a population, these statistical methods – the estimates anderrors associated with the estimates – are not reliable.

Most commonly used random sampling techniques are simple,stratified, and cluster sampling.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 5 / 1

Warm-Up and Data Basics Sampling Methods

Simple random sample

Randomly select cases from the population, each case is equallylikely to be selected.

Index

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Index

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 6 / 1

Warm-Up and Data Basics Sampling Methods

Stratified sample

Strata are homogenous, simple random sample from each stratum.

Index

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Index

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 7 / 1

Page 3: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Warm-Up and Data Basics Sampling Methods

Cluster sample

Clusters are not necessarily homogenous, simple random samplefrom a random sample of clusters. Usually preferred for economicalreasons.

Index

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Index

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 8 / 1

Warm-Up and Data Basics Sampling Methods

Participation question

A city council has requested a household survey be conducted in asuburban area of their city. The area is broken into many distinct andunique neighborhoods, some including large homes, some with onlyapartments. Which approach would likely be the least effective?

(a) Simple random sampling

(b) Cluster sampling

(c) Stratified sampling

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 9 / 1

Warm-Up and Data Basics Exploratory Data Analysis

Explore the Data

When you taste a spoonful of chili and decide it doesn’t tastespicy enough, that’s exploratory analysis.

For data analysis, we perform exploratory data analysis, or EDA,to determine trends in features that may be present in the data.

The distribution of a variable is a list of possible values thevariable can take and how often it takes each of those values.

Distributions are critical to assessing the probability of events.

Plots are almost always useful for visualizing relationships anddistributions in the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 10 / 1

Warm-Up and Data Basics Exploratory Data Analysis

Visualizing numerical variables

Intensity map: Useful for displaying the spatial distribution.

Dot plot: Useful when individual values are of interest.

Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.

Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 11 / 1

Page 4: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Warm-Up and Data Basics Exploratory Data Analysis

Why visualize?

Describe the spatial distribution of race/ethnicity in the US.

http:// demographics.coopercenter.org/ DotMap/ index.html

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 12 / 1

Warm-Up and Data Basics Exploratory Data Analysis

Why visualize?

And let’s take a closer look at Durham.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 13 / 1

Warm-Up and Data Basics Exploratory Data Analysis

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertil-ity appear to be associated or in-dependent?

Was the relationship the samethroughout the years, or did itchange?

http:// www.gapminder.org/ world

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 14 / 1

Warm-Up and Data Basics Exploratory Data Analysis

Cars: ... vs. weight

From the cars data:

mile

s p

er

ga

llon

(city r

atin

g)

2000 3000 4000

20

30

40

weight (pounds)

2000 2500 3000 3500 4000

10

20

30

40

50

60

weight (pounds)

pric

e ($

1000

s)

What do these scatterplots reveal about the data? How might they beuseful?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 15 / 1

Page 5: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Numerical Variables Basic Plots

World Bank Data

This is public-use data available for download fromhttp:// data.worldbank.org/ topic/ energy-and-mining .

What does the distribution of energy use per capita look likeacross different countries?Is energy use fairly uniform across different countries?If not, can we distinguish groups of countries that use more thanothers?

Country.Name X201137 Afghanistan50 Angola 672.7463 Albania 689.0376 Arab World 1806.9089 United Arab Emirates 7407.01

102 Argentina 1966.97115 Armenia 916.26128 American Samoa141 Antigua and Barbuda154 Australia 5500.79167 Austria 3927.92180 Azerbaijan 1369.32193 Burundi206 Belgium 5348.97219 Benin 384.56232 Burkina Faso245 Bangladesh 204.72258 Bulgaria 2615.04

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 16 / 1

Numerical Variables Basic Plots

Visualizing numerical variables

Intensity map: Useful for displaying the spatial distribution.

Dot plot: Useful when individual values are of interest.

Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.

Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 17 / 1

Numerical Variables Basic Plots

Stacked Dot Plot

Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.

gpa

3.0 3.2 3.4 3.6 3.8 4.0

● ●● ● ●● ●● ●● ●

● ●●

●●

●●

● ●

● ●

●●

● ●

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 18 / 1

Numerical Variables Basic Plots

Dot Plot: Why visualize?

Dot plot of weight, in ounces

0 1000 2000 3000 4000

●● ●● ●● ●●●

●●● ●

●● ● ● ●

● ●

●● ●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●● ●

●● ●

● ●

●●

● ●●●

●●

Do you see anything out of the ordinary?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 19 / 1

Page 6: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Numerical Variables Basic Plots

Why visualize?

Dot plot of weight, in ounces

0 1000 2000 3000 4000

●● ●● ●● ●●●

●●● ●

●● ● ● ●

● ●

●● ●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●● ●

●● ●

● ●

●●

● ●●●

●●

Do you see anything out of the ordinary?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 20 / 1

Numerical Variables Basic Plots

Why visualize?

What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?

Dot plot of average number of hours of sleep per night

4 5 6 7 8 9

●●● ●●● ●●●●●

●●● ●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 21 / 1

Numerical Variables Basic Plots

Dot Plot: World Bank Data

0 5000 10000 15000

Eenrgy Data Dot Plot

Energy per capita

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 22 / 1

Numerical Variables Basic Plots

Histogram

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

0 5000 10000 15000

020

4060

80

Country.Name X2011AfghanistanAngola 672.74Albania 689.03Arab World 1806.90United Arab Emirates 7407.01Argentina 1966.97Armenia 916.26

Bins 0-2000 2001 - 4000 4001 - 6000 6001 - 8000 . . .

Count 92 38 18 10 . . .

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 23 / 1

Page 7: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Numerical Variables Basic Plots

Histogram: Bin Width

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

0 5000 10000 150000

2040

6080

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

0 5000 10000 15000 20000

020

4060

8012

0

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

0 5000 10000 15000

05

1015

2025

3035

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 24 / 1

Numerical Variables Basic Plots

Bin Width

Which one(s) of these histograms are useful? Which reveal too muchabout the data? Which hide too much?

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 25 300

10

20

30

40

50

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 250

5

10

15

20

25

30

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 250

5

10

15

extracurricular hrs / week

freq

uenc

y

5 10 15 20 2502468

101214

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 25 / 1

Numerical Variables Basic Plots

Histogram

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

0 5000 10000 15000

020

4060

80

Provides a view of the data density.

Very usual for examining the shape of a distribution.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 26 / 1

Numerical Variables Distribution Shapes

Histogram

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

This distribution is right skewed and unimodal.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 27 / 1

Page 8: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Numerical Variables Distribution Shapes

Shape: Skewness

We describe histograms as right skewed, left skewed, or symmetric.

0 2 4 6 8 10

05

1015

0 5 10 15 20 25

020

4060

0 20 40 60 80

05

1015

2025

30Histograms are said to be skewed to the side of the long tail.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 28 / 1

Numerical Variables Distribution Shapes

Shape: Modality

The mode is defined as the most frequent observation in the data set.Does the histogram have a single prominent peak (unimodal), severalprominent peaks (bimodal/multimodal), or no apparent peaks(uniform)?

0 5 10 15

05

1015

0 5 10 15 20

05

1015

0 5 10 15 20

05

1015

20

0 5 10 15 20

02

46

810

1214

In order to determine modality, it’s easiest to step back and imagine adensity curve over the histogram. Use the limp spaghetti method.Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 29 / 1

Numerical Variables Distribution Shapes

Shape and Skew

How would you describe this distribution?

Histogram ofaverage number of hours spent on school work per day

2 4 6 8 10

05

1015

2025

30

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 30 / 1

Numerical Variables Distribution Shapes

Shape: Why does this matter?

Symmetric Distribution

Value

Hei

ght

Bimodal Distribution

Value

Hei

ght

0 1 2 3 4 5 6 7 8 9

020

040

060

080

010

00

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 31 / 1

Page 9: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Numerical Variables Distribution Shapes

Participation question

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 32 / 1

Numerical Variables Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 33 / 1

Numerical Variables Distribution Shapes

Density Curves

A Density Curve is a smoothed density histogram where the areaunder the curve is 1.To draw a density curve from a histogram simply connect thepeaks of a histogram with a smooth line, and normalize thevalues of the y-axis such that the area under the curve is 1.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 34 / 1

Numerical Variables Distribution Shapes

Unusual Observations

Are there any unusual observations or potential outliers?

0 5 10 15 20

05

1015

2025

30

20 40 60 80 100

010

2030

40Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 35 / 1

Page 10: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Numerical Variables Distribution Shapes

Application exercise: Shapes of distributions

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 36 / 1

Numerical Variables Distribution Shapes

Describing Your Pictures

Bell Shaped: Data is bell shaped if the majority of the data isclustered around the center value (mean) with very few datapoints lying either way above or way below this value.

Right Skewed: Data is positively skewed if you have severallarge positive data points creating a long tail to the right.

Left Skewed: Data is negatively skewed if you have several largenegative numbers creating a long tail to the left.

Bimodal: Data is bimodal if it has two large clusters of datapoints.

Symmetric: Data is symmetric if it looks like a mirror imagearound a point of inflection.

Uniformly Distributed: Data is evenly spread across all possiblevalues.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 37 / 1

Descriptive Statistics Center

Mean

The sample mean, denoted as x̄, can be calculated as

x̄ =x1 + x2 + · · ·+ xn

n=

Sum of Data PointsNumber of Data Points

,

where x1, x2, · · · , xn represent the n observed values.

The population mean is a parameter computed the same way butis denoted as µ. It is often not possible to calculate µ sincepopulation data is rarely available.

x̄ is an estimate of µ based on the observed data.

The sample mean is a sample statistic, or a point estimate of thepopulation mean. This estimate may not be perfect, but if thesample is good (representative of the population) it is usually agood guess.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 38 / 1

Descriptive Statistics Center

Median

The median is the value that splits the data in half when orderedin ascending order.

0, 1, 2, 3, 4

If there are an even number of observations, then the median isthe average of the two values in the middle.

0, 1, 2, 3, 4, 5→2 + 3

2= 2.5

Since the median is the midpoint of the data, 50% of the valuesare below it. Hence, it is also the 50th percentile.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 39 / 1

Page 11: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Descriptive Statistics Center

Mean vs. MedianLink

If the distribution is symmetric, center is the meanSymmetric: mean ≈ median

If the distribution is skewed or has outliers center is the medianRight-skewed: mean > medianLeft-skewed: mean < median

Right−skewed

meanmedian

Left−skewed

meanmedian

Symmetric

meanmedian

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 40 / 1

Descriptive Statistics Center

Back to our Energy Data

Energy Use in 2011 (World Bank Data)

Energy Use (kg oil equivalent per capita)

Num

ber

of C

ount

ries

0 5000 10000 15000

020

4060

80

Mean: 2532.631Median: 1593.7

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 41 / 1

Descriptive Statistics Center

Measures of Center

The Mean of a dataset is what we commonly refer to as theaverage.

The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.

The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 42 / 1

Descriptive Statistics Center

Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics ofa distribution?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 43 / 1

Page 12: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Descriptive Statistics Center

Describing distributions of numerical variables

When describing distributions of numerical variables always mention

Shape: skewness, modalityCenter: an estimate of a typical observation in the distribution(mean, median, mode, etc.)Unusual observations: observations that stand out from the restof the data that may be suspected outliersSpread: measure of variability in the distribution (SD, IQR, range,etc.)

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 44 / 1

Descriptive Statistics Spread

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Link

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 45 / 1

Descriptive Statistics Spread

Box Plot

The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.

# of study hours / week10 20 30 40

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 46 / 1

Descriptive Statistics Spread

Anatomy of a Box Plot

# of

stu

dy h

ours

/ w

eek

0

10

20

30

40

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

upper whisker

max whisker reach

suspected outliers

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 47 / 1

Page 13: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Descriptive Statistics Spread

Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 48 / 1

Descriptive Statistics Spread

Whiskers and Outliers

Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.

max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35

max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5

An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 49 / 1

Descriptive Statistics Spread

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 50 / 1

Descriptive Statistics Spread

Why visualize?

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 51 / 1

Page 14: Unit 1: Introduction to data Lecture 2: Exploratory data ...nmd16/courses/Summer15/sta101.001-1/slides/(1) Unit 1/Lec 2...Unit 1: Introduction to data Lecture 2: Exploratory data analysis

Descriptive Statistics Spread

Example: Visualizing

What does our Energy Data look like?

050

0010

000

1500

0

Energy Use Data Boxplot

Ene

rgy

Usa

ge

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 52 / 1

Descriptive Statistics Spread

Who uses the most energy?

Country.Name X20111 Iceland 17964.442 Qatar 17418.693 Trinidad and Tobago 15691.294 Kuwait 10408.285 Brunei Darussalam 9427.096 Oman 8356.297 Luxembourg 8045.908 United Arab Emirates 7407.019 Bahrain 7353.16

10 Canada 7333.2811 North America 7062.2212 United States 7032.3513 Saudi Arabia 6738.4214 Singapore 6452.3315 Finland 6449.04

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 53 / 1

Descriptive Statistics Spread

Participation question

Which of the following is false about the distribution of average numberof hours students study daily?

2 4 6 8 10

Average number of hours students study daily

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 54 / 1

Descriptive Statistics Spread

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Link

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 55 / 1