33
1 Class Session #2 Numerically Summarizing Data Measures of Central Tendency Measures of Dispersion Measures of Central Tendency and Dispersion from Grouped Data Measures of Position

1 Class Session #2 Numerically Summarizing Data Measures of Central Tendency Measures of Dispersion Measures of Central Tendency and Dispersion from Grouped

Embed Size (px)

Citation preview

1

Class Session #2Numerically Summarizing Data

• Measures of Central Tendency

• Measures of Dispersion

• Measures of Central Tendency and Dispersion from Grouped Data

• Measures of Position

2

Recall the Definitions

• Parameter – a descriptive measure of a population

(p = parameter = population, usually in Greek letters)

• Statistic – a descriptive measure of a sample

(s = statistic = sample, usually in Roman letters)

3

Common “descriptions”

• ? Average ? – “typical” as described in the news reports

• Give some of today’s examples• Data distributions’ “characteristics”

– Shape – look at a picture (histogram)– Center – mean, mode, median– Spread – range, variance, std. dev.

4

Central Tendency Definitions

• Arithmetic mean – the sum of all the values of the variable in the data set, divided by the number of observations

• Population arithmetic mean - computed using all the individuals in the population (“mew” = μ) (≠ micro µ)

• Sample arithmetic mean – computed using the sample data (“x-bar”)

• Note: is a statistic, μ is a parameter

x

5

More Central Tendency Defs

• Median – the value that lies in the middle of the data, when arranged in ascending order(think of the median strip of highway in the middle of the road)

• Mode – the most frequent observation of the variable in the data set (think “a la mode” in fashion /on top)

6

Measures of Dispersion Definitions

• Range (R) – the difference between the largest data value (maximum) & the smallest data value (minimum)

• Deviation about the mean – how “spread out” the data is.? for both population and sample variance, the sum of all deviations about the mean equals what ?? the square of a non-zero number is ?

7

More Measures of Dispersion Definitions

• Population Variance – sum of squared deviations about the population mean, divided by the number of observations in the population N (sigma squared)

• ? i.e. population variance is the mean of the ______ _________ ____ __ _________ ___ ?

Answer: Population variance is the mean of the squared deviations about the population mean

8

More Measures of Dispersion Definitions

• Sample Variance – sum of the squared deviations about the sample mean, divided by the number of observations minus one (s squared)

• Degrees of freedom is the “n-1”

9

More Measures of Dispersion Definitions

• Population Standard Deviation – the square root of the population variance (sigma, written as “σ”)

• Sample Standard Deviation – the square root of the sample variance (s, written as “s”)

BTW, later we discover “s” itself is a random variable

10

Empirical Rule for Symmetric Data

• If the distribution is bell shaped: 68% of data within 1 std deviations 95% of data within 2 std deviations 99.7% of data within 3 standard deviations

of the mean

Rule holds for both samples & populations

11

Supposing Grouped Data

• Approximate mean of a variable from a frequency distribution• Use the midpoint of each class• Use the frequency of each class• Use the number of classes

• Population Mean

• Sample Mean

12

Supposing Grouped Data

• Weighted MeanGood to use when certain data

values have higher importance (or weight)

[Sum of each value of variable times its weight] / [sum of weights]

Examples of Grade Point Average (GPA) and mixed nuts pricing

13

Supposing Grouped Data

• Population Variancesum of [(midpoint – mean)2 times

frequency] / [sum of frequencies]

• Sample Varianceas before except “-1” in denominator (the

degrees of freedom thing again)

14

Supposing Grouped Data

• Population Standard Deviationtake square root of population variance

• Sample Standard Deviationtake square root of sample variance

15

Measures of Position Definition

• z-Score – the distance that a data value is from the mean in terms of standard deviations. Equals (data value minus mean) divided by standard deviation]

• Population z-score• Sample z-score

16

Measures of Position Definitions

• z-score equals [(data value minus mean) divided by standard deviation]

• Is a "unitless" measure• Can be “normalized” to get• Mean of zero• Standard Deviation of one

17

Measures of Position Definitions

• z-score purpose is to provide a way to "compare apples and oranges"

• by converting variables with different centers and/or spreads

• to variables with the same center (0) and spread (1).

18

Measures of Position Definition

• Percentiles – k th percentile is a set of data divides the lower k% from the upper (1-k)%• Divide into 100 parts, so 99 percentiles

exist• “P sub k”• Use to give relative standing of the data

19

Measures of Position Definition

• Quartiles – divides the data into four equal parts• Four parts, so three percentiles exist• “Q sub one, two, or three”• Q2 is the median of the data

• Q1 is the median of the lower half

• Q3 is the median of the upper half

20

Numerical summary of data

• Five number summaries

• Interquartile range (Q3 – Q1) is resistant to extreme values

• Compute five number summary

• Min value | Q1 | M | Q3 | max value

21

Building a Box Plot – part 1

• 1. Calculate interquartile range (IQR)

• 2. Compute lower & upper fence

• Lower fence = Q1 – 1.5 (IQR)

• Upper fence = Q3 + 1.5 (IQR)

• 3. Draw scale then mark Q1 and Q3

• 4. Box in Q1 to Q3 then mark M

22

Building a Box Plot – part 2

• 5. Temporarily mark fences with brackets

• 6. Draw line from Q1 to smallest value inside the lower fence and a line from Q3 to largest value inside the upper fence

• 7. Put * for all values outside of the fences

• 8. Erase brackets

23

Distribution based on Boxplot

• Symmetric• median near center of box• horizontal lines about same length

• Skewed Right / Positive Skew• median towards left of box• right line much longer than left line

• Skewed Left / Negative Skew• median towards right of box• left line much longer than right line

24

Which measure best to report?

• Symmetric distribution• Mean• Standard Deviation

• Skewed distribution• Median• Interquartile Range

25

Self Quiz

• When can the mean and the median be about equal?

• In the 2000 census conducted by the U.S. Census Bureau, two average household incomes were reported: $41,349 and $55,263. One of these averages is the mean and the other is the median. Which is which and why?

26

Self Quiz

• The U.S. Department of Housing and Urban Development (HUD) uses the median to report the average price of a home in the United States.

• Why do they do that?

27

Self Quiz

• A histogram of a set of data indicates that the distribution of the data is skewed right.

• Which measure of central tendency will be larger, the mean or the median?

• Why?

28

Self Quiz

• If a data set contains 10,000 values arranged in increasing order, where is the median located?

• Matching: (parameter; statistic)• _____ is a descriptive measure of a

population• _____ is a descriptive measure of a

sample.

29

Self Quiz

• A data set will always have exactly one mode. (true or false)

• If the number of observations, n, is odd; then the median, M, is the value calculated by the formula M=(n+1)/2

30

Self Quiz

• Find the Sample Mean:

20, 13, 4, 8, 10• Find the Sample Mean:

83, 65, 91, 87, 84• Find the Population Mean:

3, 6, 10, 12, 14

31

Self Quiz

• The median for the given list of six data values is 26.5.

• 7 , 12 , 21 , , 41 , 50

• What is the missing value?

32

Self Quiz

• The following data represent the monthly cell phone bill for the cell phone for six randomly selected months.

• $35.34 $42.09 $39.43• $38.93 $43.39 $49.26• Compute the mean, median, and mode cell

phone bill.

33

Self Quiz

• Heather and Bill go to the store to purchase nuts, but can not decide among peanuts, cashews, or almonds. They agree to create a mix. They bought 2.5 pounds of peanuts for $1.30 per pound, 4 pounds of cashews for $4.50 per pound, and 2 pounds of almonds for $3.75 per pound. Determine the price per pound of the mix.