67
Revision: 1-12 1 1 1 Module 2: Descriptive Statistics (and a bit about R) Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 1

Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

  • Upload
    lecong

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 1 1 1

Module 2: Descriptive Statistics

(and a bit about R) Statistics (OA3102)

Professor Ron Fricker Naval Postgraduate School

Monterey, California

Reading assignment:

WM&S chapter 1

Page 2: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Why Care About

Descriptive Statistics?

• Data sets continue to grow ever bigger

– The human mind cannot assimilate and make

sense of volumes of raw data

• Descriptive statistics are useful data reduction

– Numeric summaries

– Graphical plots

• Good descriptive statistics help analysts and

decision makers understand what the raw

data means

Revision: 1-12 2

Page 3: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Goals for this Module

• Define types of data and types of variables

• Learn how to appropriately summarize data

using descriptive statistics

– Numerical descriptive statistics

• Measures of location: mean, median, mode

• Measures of spread: variance, standard

deviation, range, inter-quartile range, etc.

– Graphical descriptive statistics

• Continuous variables: histogram, boxplot

• Categorical variables: barplots, pie charts

• R paradigms and summarizing data with R Revision: 1-12 3 3 3

Page 4: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 4 4

Variables

• A characteristic that is being studied in a statistical problem is called a variable

• Types of variables: – Continuous: Can divide by any number and result

still makes sense

• Examples: flight time, failure rate, detection distance

– Categorical:

• Ordinal: ordered categories – Examples: rank, magazine capacity, shirt size

• Nominal: unordered categories – Examples: gender, service branch, ship type

Page 5: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 5 5

Types of Data

Data

Qualitative Quantitative

Discrete Continuous (ordinal)

(nominal)

(continuous)

Page 6: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 6

Some Descriptive Statistics

• Numerical: – Location: Mean, median, mode

– Spread: Standard deviation, variance, range, quantiles, IQR

– Correlation

• Graphical: – Histograms, bar charts,

dot charts, boxplots,

scatter plots, etc.

• Good descriptive statistics leads to good decision making

Page 7: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 7

Sample Mean ( )

• Sample average or sample mean

– Sample consists of n observations, x1,…,xn

– Often denoted by (spoken “x-bar”)

• To calculate

– R: use mean() function

– Excel: =AVERAGE(cell reference)

x

n

i

ixn

x1

1

x

Page 8: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 8

Sample Median ( ) x~

5.5

• The median is the halfway point in the

ordered data

• Steps to calculate the median:

– Order the data from smallest to largest

– If the number of data is odd, the middle

observation is the median. E.g.,

1 3 5 6 12 12 99

– If the number is even, then the average of the two

middle observations is the median. E.g.,

1 3 5 6 12 12

Page 9: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 9

Using More Formal Notation…

• Let denote the ith order statistic from a sample – E.g., for , we have

• Then the sample median can be defined as

– Equations apply to samples and populations

• To calculate – R: use median() function

– Excel: =MEDIAN(cell reference)

nxxx ,...,, 21

)(ix

2,12,5 321 xxx12,5,2 )3()2()1( xxx

2

~ 122

nn xx

x 2

1

~ nxxn odd: n even:

Page 10: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 10

Mean vs. Median

• Both are measures of location or “central tendency” – But, median less affected by outliers

• Example: – Imagine a sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 2

• Median=mean=1

– Another sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 83

• Median still equals 1, but mean=10!

• Which to use? Depends on whether you are: – characterizing a “typical” observation (the median)

– or describing the average value (the mean)

Page 11: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 11

Exercise

• Calculate “by hand” the mean and median for the

data: {6,1,3,7,3,6,7,4,8}

11

Page 12: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 12

Exercise (continued)

• Now do the same for {6,1,3,7,3,6,7,4,8,100}

12

Page 13: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 13

Now, in R:

• For {6,1,3,7,3,6,7,4,8}:

• For {6,1,3,7,3,6,7,4,8,100}:

Page 14: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 14

Common Measures of “Spread”

• Measures of location tell you where the “center” of

the data is

• Measures of spread tell you how variable the data is

around the center

• Typical measures of spread:

– Sample variance: essentially, the average squared deviation

around the mean,

– Standard deviation: the square root of the variance,

• The standard deviation is in the same units at the mean

2

1

)(1

12 xxn

sn

i

i

2ss

Page 15: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 15

Exercise

• Calculate “by hand” the sample variance and

standard deviation for the data: {1,2,3,4,5}

15

Page 16: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Pictorially

Revision: 1-12 16

Page 17: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Pictorially

Revision: 1-12 17

Page 18: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Pictorially

Revision: 1-12 18

Page 19: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Pictorially

Revision: 1-12 19

Page 20: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 20

Ignore Variability at Your Peril

• Often analyses only focus on the average

• But it’s possible to be right on average and be

way off in every case

– The average high temperature

in Washington DC in June is

83 degrees

• “Oh, how balmy!”

• No...it’s either 75°

or it’s 90+ degrees!

From Flaws and Fallicies in Statistical Thinking

by Stephen K. Campbell.

Page 21: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 21

The Range (R)

• Range is another measure of spread

• In words, it is the largest observation in the sample minus the smallest observation – Example: A sample of students’ ages in the class

• Data: 21, 23, 23, 25, 25, 26, 27, 31, 33, 33, 35, 40

• Note that they are already ordered!

• R = 40 - 21 = 19

– Using previous notation:

• In R: use the code diff(range()) – range() function gives x(1) and x(n)

1xxR

n

Page 22: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Other Measures of Spread:

Quantiles and Percentiles

• Percentiles

– For data, the pth percentile , , is the

value of x such that p% of the data is less than

or equal to x

• Quantiles same as percentiles except for

scale

– Percentiles are on a 0 to 100 scale

– Quantiles are on a 0 to 1 scale

– The pth quantile equals the (px100)th percentile

Revision: 1-12 22

0 100p

Page 23: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 23

Special Percentiles and Quantiles

• Special percentiles:

– Minimum: 0th percentile (or 0 quantile)

– Median: 50th percentile (or 0.5 quantile)

– Maximum: 100th percentile (or 1.0 quantile)

• Quartiles: 25th and 75th percentiles

– Devore: “lower fourth” and “upper fourth”

• Interquartile Range (IQR):

IQR = 75th percentile - 25th percentile

– Devore calls the IQR the “fourth spread”

– In R: IQR()

Page 24: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 24

Calculating Quantiles

• R function: quantile(data, probs)

– data is a numeric vector of data

– probs is a numeric vector of probabilities

• Default: 0, 0.25, 0.5, 0.75 and 1.0 quantiles

• In R, pth quantile is x(px(n-1)+1)

– If px(n-1)+1 is not an integer, interpolate between

two closest values

– E.g.,

Page 25: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 25

Hinges

• Hinges are an alternative to quartiles

– They’re the x(j) and x(n-j+1) order statistics, for

where if j is not integer, interpolate

• Easier way to compute:

– If n is even, they’re the median values of the upper

and lower halves of the sorted data

– If n is odd, they’re the median values of the upper

and lower halves of the sorted data, where each

half includes the median data point

11

2

2

n

j

Page 26: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 26

Exercise

• “By hand,” calculate the five number summary for

{12,2,7,5,15,4,9,18,6}

– The five number summary is the minimum, lower hinge, median, upper hinge, maximum

26

Page 27: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 27

Exercise (continued)

• “By hand,” calculate the five number summary for

{12,2,7,5,15,4,9,18,6,10}

27

Page 28: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 28

Results in R

28

Page 29: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

The Empirical Rule

29

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

-4 -3 -2 -1 0 1 2 3 4

Z

• If the distribution of measurements is

approximately normal, then:

99.7%

• 99.7% (“almost

all”) within m ± 3s 68%

• 68% of the data is

within m ± 1s

95%

• 95% within m ± 2s

Page 30: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 30

Remember Notation Conventions

• Summation:

– Σ notation and subscripts

• Size:

– n denotes size of sample

– N denotes size of population

• Knowns vs. unknowns:

– Small letters (i.e., “x”) mean quantity is known

– Capital letters (i.e., “X”) mean quantity is unknown

(i.e., it’s a random variable)

Page 31: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 31

Graphically Depicting Data

• Many different types of plots and charts

• What ever you do, don’t fall into the trap of just

using Excel plots because they’re easy

– R much more powerful and flexible

– Excel does not do some important/useful plot types

5

10

15

Co

un

t Axis

80 85 90 95 100 105 110 115 120 125

(thousands)

Page 32: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 32

A Classic Good Graphic

Page 33: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 33

Some Types of Graphical and

Tabular Summaries of Data

• Univariate discrete data: tables, barplots, dot

charts, pie charts

• Univariate continuous data: stem-and-leaf

plots, strip charts, histograms, boxplots

• Bivariate discrete data: two-way contingency

tables

• Bivariate continuous data: scatterplots, QQ

plots

Page 34: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 34

Tabular Summaries of Data

• Categorical data: counts and/or percentages

by category

• Continuous data: counts and/or percentages

within “bins”

– Bins: sequential intervals over the range of data

• Generally intervals are of equal width

• Must decide how to count data point that falls

on the boundary between two bins

– Either count them all in the left bins, or in the right

bins

– Doesn’t matter which, just be consistent

Page 35: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 35

Example: Tabular Summary

of Univariate Categorical Data

Manufacturer Frequency

Relative

Frequency

(fraction)

Honda 41 0.34

Yamaha 27 0.23

Kawasaki 20 0.17

Harley-Davidson 18 0.15

BMW 3 0.03

Other 11 0.08

120 1.00

• In R, use the table() function

• For the example:

Page 36: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 36

Barplots

• Barplots also known as bar charts and bar

graphs

• Plot one bar for each category

– Bars show counts or percentage of observations in

each category

• Can plot bars vertically or horizontally

• In R: barplot()

– Option horiz=TRUE plots bars horizontally

(default is FALSE)

Page 37: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 37

In R

barplot(table(manufac),xlab="Manufacturer",ylab="Count") barplot(table(manufac),ylab="Manufacturer“

,xlab="Count",horiz=TRUE)

Page 38: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 38

Plotting Fractions

barplot(table(manufac)/length(manufac),

xlab="Manufacturer",ylab="Fraction")

barplot(table(manufac)/length(manufac),

ylab="Manufacturer",xlab="Fraction",horiz=TRUE)

Page 39: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 39

Histograms

• A histogram is a graph of the observed

frequencies in a sample or population

• Histograms show the distribution of the data

• Reading a histogram:

0

2

4

6

8

10

12

170 180 190 200 210 220 230 240 250 260

There are 10

observations greater

than 215 but less

than or equal to 225

Page 40: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 40

Histograms Depict

the Empirical Distribution

• Histograms help answer: – Where is the mean of the data (roughly) located?

– How variable is the data?

– What is the overall shape of the data?

• Is the distribution symmetric? Is it skewed? If so, in what direction?

– Are there any unusual observations?

• In R: hist() function

– Options:

• breaks option allows user to vary number of bars

• freq=TRUE (default) gives counts

• freq=FALSE gives density histogram (area sums to one)

Page 41: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 41

Frequency Histogram

of Challenger Data

84 49 61 40 83 67 45 66 70 69 80 58

68 60 67 72 73 70 57 63 70 78 52 67

53 67 75 61 70 81 76 79 75 76 58 31

> challenger<-c(84,49,61,40,

83,67,45,66,70,69,80,

58,68,60,67,72,73,70,

57,63,70,78,52,67,53,

67,75,61,70,81,76,79,

75,76,58,31)

> hist(challenger)

Page 42: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 42

Density Histogram

of Challenger Data

hist(challenger,freq=FALSE)

Page 43: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 43

• Do try alternate numbers of bars

– Find best depiction of the shape (distribution) of data

– Start with number of classes = (i.e., breaks= )

• Don’t use unequal bin widths – keep the bar widths all

the same

• Don’t plot histograms by hand – use software

Dos and Don’ts for Histograms

n

hist(challenger,breaks=2)

1n

hist(challenger,breaks=5) hist(challenger,breaks=9) hist(challenger,breaks=25)

Page 44: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 44

Extremes in Histograms

0

5

10

15

20

25

30

35

40

30-89

Temperature (F)

Freq

uen

cy (

co

un

t)

One extreme: A

single bar for all the

data – but that just

shows the total, no

information about the

shape of the data

Another extreme:

One bar for each

temperature – but

that’s just a bar chart.

It’s hard to see the

shape classes seems to be

about right to show

distribution of the data

n

Page 45: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 45

Differences Between

Barplots and Histograms

• Barplots:

– For categorical data

– Often most easily read with bars plotted horizontally

– Adjacent bars are separated from each other

• Histograms:

– For continuous data

– Convention to plot bars vertically (to look like a pdf)

– Adjacent (nonzero) bars touch (since base of each

bar denotes the “bin” for that bar)

Page 46: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 46

Boxplots

• Boxplots show distribution in one dimension

– Only useful for continuous variables

– Good for comparing distributions of a continuous

variable between categorical groups

– Will not show multiple modes

• Illustration (of one variant):

median

hinges

whiskers outliers outlier

Page 47: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 47

Exercise

• Given the following

summary statistics

for the Challenger

data,

(roughly) draw the

boxplot over the

“strip chart”

Page 48: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 48

Exercise: Result from R

• Boxplot

Page 49: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 49

Histograms vs. Boxplots

• Histogram shows distribution of the data in two dimensions – the boxplot is in one dimension – Histogram shows frequency of observations within ranges – Boxplot only shows summary statistics

Page 50: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

We’ll Use Software To Do Most

Calculations and Plots…

• …generally R

• Benefits of R include:

– It’s free

– More importantly, it’s powerful, flexible, extensible,

and cutting-edge

– In terms of extensible, there are now thousands of

libraries (aka packages) available to do custom

calculations, plots, etc.

Revision: 1-12 50

Page 51: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Some R Paradigms

• Command line interface

• Object-oriented programming

• Types of objects, particularly data frames

• Vector-based calculations

Revision: 1-12 51

Page 52: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Command Line Interface

• Command line allows scripting/programming,

which gives flexibility and extensibility

– Point and click paradigm limits user to what has

been programmed into the interface

– Trade-off is “user friendliness,” meaning command

line users must learn the underlying language and

syntax

• Good news: Once you gain a working

familiarity, you have access to very powerful

computing tool

Revision: 1-12 52

Page 53: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

All the Std Graphics Plus…

Revision: 1-12 53

Page 54: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Example #1: Flexible Graphics

Revision: 1-12 54

Page 55: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Example #2: Flexible Graphics

Revision: 1-12 55

Page 56: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Example #3: Flexible Graphics

Revision: 1-12 56

Page 57: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Example #4: Flexible Graphics

Revision: 1-12 57

Page 58: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Example #5: Flexible Graphics

Revision: 1-12 58

Page 59: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Object-oriented Programming

• R is an object-oriented programming

language – Wikipedia: “Object-oriented programming (OOP) is a

programming paradigm that uses "objects" … to design

applications and computer programs. ”

• Everything in R is an object of some type

– Each type of object has particular properties

– Properties control what objects can and cannot

do, as well as how other objects interact with them

Revision: 1-12 59

Page 60: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Types of Objects

• Important types of objects in R:

– Vector: a one-dimensional list of numbers

– Matrix: a two-dimensional list of numbers

– Array: a multi-dimensional list of numbers

– Data.frame: a two-dimensional list that can contain

any type of data (numeric, string, logical, etc)

– Function: small programs that usually take input

as arguments and after running produce output

• The function class(obj) will tell you what

type of object “obj” is

Revision: 1-12 60

Page 61: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

More on Data Frames

• Think of them like tables

– Columns correspond to variables (and data in

columns must all be of the same type)

– Rows correspond to observations

Revision: 1-12 61

Page 62: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

More on Functions

• Functions always end with parenthesis

– If there are arguments, they go here

– Some functions don’t have or need arguments

• Example: ls()

– Function code output when parentheses left off

• Can run functions of functions

– Example: mean(seq(1:9))

• Lots of built-in functions and you can write

your own

Revision: 1-12 62

Page 63: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Vector-based Calculations

• R very efficient (i.e., fast) working with

vectors, much less so with loops

• Key idea: In data frames, instead of writing

code that operates on the rows of a data

frame (i.e., observation by observation) you

write code that operates on the variables

(i.e., the columns, which are the variables!)

• Takes a while to get used to thinking in terms

of vectors rather than individual observations

Revision: 1-12 63

Page 64: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Simple Example

• Data frame with data

on various types of

travel for a set of

individuals:

• Easy way to calc total days deployed in R:

Revision: 1-12 64

Page 65: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

• Even fancier:

• The hard way:

Simple Example, continued

Revision: 1-12 65

Page 66: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

What We Covered in this Module

• Defined types of data and types of variables

• Learned how to appropriately summarize

data using descriptive statistics

– Numerical descriptive statistics

• Measures of location: mean, median, mode

• Measures of spread: variance, standard

deviation, range, inter-quartile range, etc.

– Graphical descriptive statistics

• Continuous variables: histogram, boxplot

• Categorical variables: barplots, pie charts

• R paradigms and summarizing data with R Revision: 1-12 66 66 66

Page 67: Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf ·  · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Revision: 1-12 67 67

Homework

• WM&S chapter 1 – Required exercises 2, 9, 13, 17, 22, 25

– Extra credit: 11

• Hints and instructions: Do exercises 2,13, and 25 in R as much as possible

o The data sets are in Sakai in CSV format; read them in using the instructions from Lab #1

o Exercise 2: Just construct a frequency histogram in R with the Mt. Washington observation left out

o Exercises 13 and 25: The sort() function in R could be useful for counting the number that fall in each interval

Exercise 9: Use either Table 4 in WM&S or R to calculate. If you use R, the pnorm() function will be helpful

Exercise 17: Only do the approximation for Exercise 1.2