26
More Numerical and Graphical Summaries using Percentiles David Gerard 2017-09-18 1

More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

More Numerical and Graphical Summaries

using Percentiles

David Gerard

2017-09-18

1

Page 2: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Learning Objectives

• Percentiles

• Five Number Summary

• Boxplots to compare distributions.

• Sections 1.6.5 and 1.6.6 in DBC.

2

Page 3: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Trump’s Tweet Length

0

100

200

300

0 50 100 150

length

coun

t

• Mean = 102.7281, median = 114.5

• Standard deviation = 37.4711, MAD = 36.3237

3

Page 4: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Are these sufficient summaries?

• Tells us nothing about the left skew.

• Doesn’t tell us that a fourth of all tweets are greater than 138

characters.

• Doesn’t tell us that small tweets are quite rare.

4

Page 5: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Percentiles

percentile

The pth percentile of a distribution is the value that has p

percent of the observations fall at or below it. To calculate the

percentile, arrange the observations in increasing order and count

up the required percent from the bottom of the list.

5

Page 6: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Why do we care?

• If we know a few percentiles, that gives us an idea of the

shape of a distribution.

• Knowing the same percentiles of two distributions makes it

easy to quickly compare them.

• It’s usual to return the 0th (= minimum), 25th, 50th (=

median), 75th, and 100th (= maximum) percentiles.

6

Page 7: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Quartiles

• The 25th and 75th percentiles have special names:

Quartiles

The first quartile Q1 is the 25th percentile. It is the median of

the lower half of the data.

The third quartile Q3 is the 75th percentile. It is the median of

the upper half of the data.

7

Page 8: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Example: Trump’s Tweet Length

These ARE NOT the qaurtiles of Trump’s tweet length

Histogram of trump$length

trump$length

Fre

quen

cy

0 20 40 60 80 100 120 140

010

030

050

0

8

Page 9: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Example: Trump’s Tweet Length

These ARE NOT the qaurtiles of Trump’s tweet length

Histogram of trump$length

trump$length

Fre

quen

cy

0 20 40 60 80 100 120 140

020

040

060

080

0

9

Page 10: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Example: Trump’s Tweet Length

These ARE the qaurtiles of Trump’s tweet length

Histogram of trump$length

trump$length

Fre

quen

cy

0 20 40 60 80 100 120 140

020

040

060

080

0

10

Page 11: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Example: Trump’s Tweet Length

These ARE the qaurtiles of Trump’s tweet length

Histogram of trump$length

trump$length

Fre

quen

cy

0 20 40 60 80 100 120 140

010

020

030

040

0

11

Page 12: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Boxplot

• It’s very useful to plot these quantiles in what is called a

boxplot.

boxplot

A boxplot is a graph of the five number summary. A central box

spans the quartiles Q1 and Q3. A line in the box marks the

median M. Lines (the “whiskers”) extend from the box out to

the smallest and largest observations.

12

Page 13: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Trump’s Tweets

boxplot(trump$length, range = 0)0

2040

6080

100

120

140

min

Q1

M

Q3max

13

Page 14: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Boxplots tell us about skew: trump

0 50 100

0

100

200

300

0 50 100 150

length

coun

t

14

Page 15: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Boxplots tell us about skew: email

0 50 100 150

0

500

1000

1500

0 50 100 150 200

email length

coun

t

15

Page 16: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Boxplots tell us about skew: satGPA

30 40 50 60 70

0

30

60

90

30 40 50 60 70

SAT Verbal Scores

coun

t

16

Page 17: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Most boxplots you see will actually have more in them

boxplot(email$num_char)0

5010

015

0

What are those points? 17

Page 18: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

IQR

To answer that, we first need to introduce the interquartile range

(IQR).

IQR

The interquartile range IQR is the distance between the first and

third quartiles,

IQR = Q3 − Q1,

and is a measure of spread.

18

Page 19: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Like MAD, IQR is a robust measure of spread

IQR(c(1, 2, 2, 3, 3))

[1] 1

IQR(c(1, 2, 2, 3, 10))

[1] 1

IQR(c(1, 2, 2, 3, 20))

[1] 1

IQR(c(1, 2, 2, 3, 100))

[1] 119

Page 20: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

1.5× IQR Rule

1.5 × IQR Rule

People will often call an observation a suspected outlier if it falls

more than 1.5 × IQR above the third quartile or below the first

quartile.

• In most boxplots, the upper whisker extends to the largest

observation within 1.5 × IQR of Q3.

• In most boxplots, the lower whisker extends to the smallest

observation within 1.5 × IQR of Q1.

• Points outside of [Q1 − 1.5 × IQR,Q3 + 1.5 × IQR] are

labelled “suspsected outliers” and are plotted individually.

20

Page 21: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Sometimes, be suspicious of this rule

0

50

100

150

emai

l len

gth

5.25 percent of all emails are “outliers”?

21

Page 22: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Recall Movie Scores Dataset

Observational units: Movies that sold tickets in 2015.

Variables:

• rt Rotten tomatoes score normalized to a 5 point scale.

• meta Metacritic score normalized to a 5 point scale.

• imdb IMDB score normalized to a 5 point scale.

• fan Fandango score.

22

Page 23: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Recall Movie Scores Dataset

read_csv("../../data/movie.csv") %>%

select(FILM, RT_norm, Metacritic_norm,

IMDB_norm, Fandango_Stars) %>%

transmute(film = FILM, rt = RT_norm, meta = Metacritic_norm,

imdb = IMDB_norm, fan = Fandango_Stars) ->

movie

head(movie)

# A tibble: 6 x 5

film rt meta imdb fan

<chr> <dbl> <dbl> <dbl> <dbl>

1 Avengers: Age of Ultron (2015) 3.70 3.30 3.90 5.0

2 Cinderella (2015) 4.25 3.35 3.55 5.0

3 Ant-Man (2015) 4.00 3.20 3.90 5.0

4 Do You Believe? (2015) 0.90 1.10 2.70 5.0

5 Hot Tub Time Machine 2 (2015) 0.70 1.45 2.55 3.5

6 The Water Diviner (2015) 3.15 2.50 3.60 4.523

Page 24: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

How to compare these distributions?

Side-by-side boxplots!

boxplot(movie[, 2:5])

rt meta imdb fan

12

34

5

24

Page 25: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Another Option: stacked histograms

old_parameters <- par(mfrow = c(4, 1))

hist(movie$rt, xlim = c(0, 5))

hist(movie$meta, xlim = c(0, 5))

hist(movie$imdb, xlim = c(0, 5))

hist(movie$fan, xlim = c(0, 5))

par(old_parameters)

IMPORTANT: Same x-limits for all plots when stacking vertically.

25

Page 26: More Numerical and Graphical Summaries using PercentilesQuartiles The 25th and 75th percentiles have special names: Quartiles The rst quartile Q 1 is the 25th percentile. It is the

Another Option: stacked histograms

movie$rt

Fre

quen

cy

0 1 2 3 4 5

010

2030

movie$meta

Fre

quen

cy

0 1 2 3 4 5

010

2030

movie$imdb

Fre

quen

cy

0 1 2 3 4 5

010

20

movie$fan

Fre

quen

cy

0 1 2 3 4 5

020

40

26