128

Click here to load reader

Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Embed Size (px)

Citation preview

Page 1: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics and Inferential Statistics

Shibin LiuSAS Beijing R&D

Page 2: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

2

Page 3: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

3

Page 4: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

4

Page 5: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

5

Page 6: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

6

Page 7: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

7

12

36

206

31

4062

8

X a v b f R

P 9 O 8I t 0 3 L

Page 8: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

8

Descriptive Statistics

Inferential Statistics

Page 9: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

9

Populations and Samples

Parameters and Statistics

Variable Classifications

Scales of Measurement

Page 10: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

10

Page 11: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

11

Point estimators

Distribution of sample means

Central limit theorem

Standard error of the mean

Confidence intervals

Page 12: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

12

Define null /alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

Page 13: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Lesson overview

13

What do you want to

examine?

The relationship between variables

The difference between groups on one or more

variables

The location, spread, and shape of the data’s

distribution

Summary statistics

or graphics?

How many

groups?Which kind

of variables?

SUMMARY STATISTICS

DISTRIBUTION ANALYSIS

TTEST

LINEAR MODELS

CORRELATIONS

ONE-WAY FREQUENCIES

& TABLE ANALYSIS

LINEAR REGRESSION

LOGISTIC REGRESSION

Summary statistics

Both Two

Two or more

Descriptive Statistics

Descriptive Statistics, histogram, normal,

probability plots

Analysis of variance

Continuous only

Frequency tables, chi-square test

Categorical response variable

Descriptive Statistics

Inferential Statistics

Lesson 1 Lesson 2 Lesson 3 & 4 Lesson 5

Page 14: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

14

Page 15: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

15

σ

∆α

Descriptive Statistics and Inferential Statistics

Populations and Samples

Parameters and Statistics

Variable Classifications

Scales of Measurement

Statistical Methods

Page 16: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

16

• In this topic, you will learn to do following:– Distinguish between descriptive and inferential

statistics – Define populations and samples– Distinguish between parameters and statistics – Classify variables– Explain other statistical concepts, including scale

of measurement

Page 17: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

17

Descriptive Statistics

12

36

206

314062

8

Page 18: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

18

Descriptive Statistics

% percent

μ average

variabilityσ

Page 19: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

19

Descriptive Statistics

describe

summarize??? !!!

Page 20: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

20

Descriptive Statistics

Measures of location

Measures of variability

Frequency

Page 21: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

21

Descriptive Statistics

Page 22: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

22

Descriptive Statistics

Draw preliminary conclusions

Characterize central tendency

Inspect the spread and shape

Screen for unusual data values

Goals:

Page 23: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

23

Descriptive Statistics

Extreme values?

Unusual shape?

Data value cluster?

Unique feature?

Error free?

???

Page 24: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

24

Descriptive Statistics

Page 25: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

25

Inferential Statistics

reliability

precision

Page 26: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

26

Inferential Statistics

generalize

Page 27: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

27

Inferential StatisticsDescriptive Statistics

Page 28: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

28

Populations and Samples

Population= complete set of observations

Page 29: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

29

Representative sample

Characteristics similar to population's characteristics

sample

Populations and Samples

Page 30: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

30

Simple random sample

Each sample of a given size

Equal chance of selection

Populations and Samples

Page 31: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

31

Convenience sampling

Populations and Samples

May not be representative, Why? For example:

a. the first n…b. Male/female

Page 32: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

32

Populations and Samples

Question 1.

A sample from a population should be

a) Randomb) Representativec) Biasedd) All of above

Answer: b

Page 33: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

33

Populations and Samples

Question 2.

A medical researcher wants to verify if patients at a hospital during a particular time frame had the positive effect. From the entire set of 5000 patients admitted to the hospital during that time, the researcher selects a sample of 200 patients.Match these items between the two groups:

All 5000 patients

The 200 selected patients

The first 200 patients admitted to the hospital during this time frame

200 patients whose patients ID numbers were selected at random from the 5000

c

d

b

a

a) Simple random sample

b) Convenience sample

c) Population

d) Sample

Page 34: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

34

Parameters and Statistics

Parameters

Statistics

sr

s2

σρ μ

σ2

Greek letters

English letters

Parameters are numerical values that summarize characteristics of population.Usually unknown.

Page 35: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

35

Parameters and Statistics

Population Parameters Sample Statistics

Mean μ

Variance σ2 s2

Standard Deviation σ s

=

=

Sample mean

Sample variance

Sample standard deviation

=

Page 36: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

36

Uses of Variables: Independent and Dependent

Independent variable

Dependent variable

Response variable

Outcome variable

Control variable

Explanatory variable

Predictor variable

Different values changes as required

Different values changes in response

Context 1.Control: fertilizer->plant

Context 2.Dependent variable value is given: credit default or not

Page 37: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

37

Variables Classification by Types: Quantitative and Categorical

???

Quantitative Categorical

Page 38: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

38

Variables Classification by Types: Quantitative and Categorical

Quantitative

Numerical

1245 0.36 206 31 8

Arithmetic operations

Discrete

Continuous

Page 39: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

39

Variables Classification by Types: Quantitative and Categorical

0 1 2 3 4 5 6

Discrete

2.34

Page 40: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

40

Variables Classification by Types: Quantitative and Categorical

Continuous

200, 201, 201.13, 199.904 (ml/bottle)

Page 41: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

41

Variables Classification by Types: Quantitative and Categorical

Categorical

Attribute data

No arithmetic operations

Nominal

Ordinal

Page 42: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

42

Variables Classification by Types: Quantitative and Categorical

No ordering

Nominal

Gender

Beverage

2 1

3 2 1

Qualitative

Classification

Page 43: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

43

Variables Classification by Types: Quantitative and Categorical

Degree: Low, moderate, severeOrdinal

Page 44: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

44

Variables Classification by Types: Quantitative and Categorical

Question 3.

a) Categorical (ordinal)b) Quantitative (discrete)c) Categorical (nominal)d) Quantitative (continuous)

Match these items between the two groups:

The number of students in a classHair color (black, brown, gray, red)Income (low, medium, high)The numbers on athletes’ team shirtThe height of trees in a forestName (Adam, Becky, Christina, Dave…)The length of time for a light bulb to burn outThe number of phone calls a company receives

B

B

CACDCD

Page 45: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

45

By Scale of measurement: Nominal Ordinal Interval Ratio

Scale of measurement?

To know the scale of measurement for each variable, then select statistical procedure appropriate useful for that variable.

Page 46: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

46

By Scale of measurement: Nominal Ordinal Interval Ratio

Nominal

Degree: Low, moderate, severeOrdinal

Page 47: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

47

By Scale of measurement: Nominal Ordinal Interval Ratio

Continuous

Interval

Ratio

Page 48: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

48

By Scale of measurement: Nominal Ordinal Interval Ratio

Interval

Low Medium High

1. Body temperature: no true 0 point

2. PH=8 (Sea) ≠ 2* PH=4 (Tomato juice)

Page 49: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

49

By Scale of measurement: Nominal Ordinal Interval Ratio

Ratio

True 0 point

1. 2 kg apples; 0 kg apples

2. 0 $; 100$ = 2* 50$

Page 50: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

50

By Scale of measurement: Nominal Ordinal Interval Ratio

Question 4.

a) Categorical (nomina)b) Quantitative (interval)c) Categorical (ordinal)d) Quantitative (ratio)

Match these items between the two groups:

ID BodyTempGenderHeartRate

a

ab

d

Page 51: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

51

Statistical Methods

Univariate

Bivariate

Multivariable

Multivariate

The selection of Statistical Methods depends on how many variables involved.

Page 52: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

52

Statistical Methods

Univariate

RangeDispersionCentral tendencyFrequency

Page 53: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

53

Statistical Methods

Bivariate

Correlation analysis

Chi-square tests of independence

Page 54: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

54

Statistical Methods

Multiple linear regression

N-way ANOVA

Multivariable

Page 55: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

55

Statistical Methods

Factor analysis

Clustering Multivariate

Multivariate ANOVA

Multivariate linear regression

Page 56: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

56

Scenario: exploring your data

School of Carver County want to check if the average score of SAT is 1200?

Gender SATScore IDNumber

Male 1170 61469897

Female 1090 33081197

Male 1240 68137597

Female 1000 37070397

Male 1210 64608797

Female 970 60714297

Male 1020 16907997

Female 1490 9589297

Male 1200 93891897

Female 1260 85859397

80 students are selected randomly, and their scores are recorded into TestScores.

TestScores

Page 57: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

57

Scenario: exploring your data

Question 5.

a) Yes b) No

Is the data for these 80 students representative of the entire population of magnet school students in Carver County?

a

Page 58: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

58

Scenario: exploring your data

Question 6.

a) Categorical, quantitative, categorical b) Categorical, quantitative, quantitativec) Quantitative, quantitative, categorical d) Quantitative, categorical, categorical

Select the correct type for the following variables: Gender SATScore IDNumber

a

Page 59: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Basic Statistical Concepts

59

Scenario: exploring your data

Question 7.

a) Univariate b) Bivariate c) Multivariate d) All of above

What types of statistical methods can be applied to this data?

d

Page 60: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

60

Page 61: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

61

In this topic, we will learn to do the following:

– Explain the basics of descriptive statistics– Describe distributions can tell you about your data– Use the Summary Statistics task to produce descriptive

statistics

Page 62: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

62

Describing your data

Number of times each value appears

Outlier

Page 63: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

63

Measures of Location

Median

9389888483827978787774737268686835

Mode most common data value, 68 here

Middle value, when ordered

Odd : middle value Even : average of two middle values

Page 64: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

64

Measures of Location

Mode=1

Median=1.5

1

1 3 10

1

2

Mean=3

Page 65: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

65

Measures of Location

Question 8.

a) Meanb) Median c) Either mean or median

For a symmetric distribution, which of the following statistics is a good measure for the middle of the data?

Answer: c

Page 66: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

66

Percentile

Percentile = the value of a variable below which % of observations fallPercentile locate a position in your data rather than a given proportion of data valuesQuartiles break your data into quarters

63 69 75 81 87 930

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Page 67: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

67

Percentile

63 69 75 81 87 930

0.5

1

1.5

2

2.5

3

3.5

4

4.5

93898884838279787877747372686763

25th Percentile =72.5

First quartile

Q1Lower quartile

Page 68: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

68

Percentile

63 69 75 81 87 930

0.5

1

1.5

2

2.5

3

3.5

4

4.5

93898884838279787877747372686763

50th Percentile =78

Median

Q2Middle quartile

Page 69: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

69

Percentile

63 69 75 81 87 930

0.5

1

1.5

2

2.5

3

3.5

4

4.5

93898884838279787877747372686763

75th Percentile =83.5

Third quartile

Q3Upper quartile

Page 70: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

70

Measure of Variability

Sample variance:

Measure Definition

Range The difference between the maximum and minimum data values

Interquartile Range

The difference between the 25th and 75th percentiles

Variance A measure of dispersion of the data around the mean

Standard Deviation

A measure of dispersion expressed in the same units of measurement as your data (the square root of the variance)

=

Coefficient of variation(C.V.): 𝑠𝑥×100

Page 71: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

71

Measure of Variability

Question 9.

a) True b) False

The IQR(interquartile range ) is a measure of the middle 50% of your data. It is not as likely to affected by outliers and therefore is more robust than the overall range.

a

Page 72: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Descriptive Statistics

72

Scenario

Check the SAT score of XXX school, 80 random selecteddescriptive statistics

the average=1200?

Page 73: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

73

Page 74: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Objectives

74

In this topic, we will learn to do the following:

– Look at the distribution of continuous variables– Describe the normal distribution– Use the Distribution Analysis task to generate descriptive

statistics, including histogram, box and normal probability plots.

Page 75: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D
Page 76: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Normal Distributions

76

Characters of normal distribution:– Symmetric– Defined by μ and σ– Bell shaped– Mean=median=mode

Page 77: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Normal Distributions

77

The bean machine, a device invented by Francis Galton, can be called the first generator of normal random variables. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve.

Planche de Galton

Page 78: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Normal Distributions

78

Standard Normal Distribution when: μ= 0; σ= 1

𝑓 (𝑥 ;𝜇 ,𝜎2 )= 1𝜎 √2𝜋

𝑒− 12

( 𝑥−𝜇𝜎

)2

Page 79: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Assessing normality

79

For a normal distribution In SAS: Skewness=0; Kurtosis=0,In some other place: Skewness=0; Kurtosis=3

Statistical summary

Page 80: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Skewness

80

Skewness>0Mean>median

Skewness<0Mean<median

Page 81: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Kurtosis

81

Kurtosis measures the tendency of your data to be distributed toward the center or toward the tails of the distribution.

For a symmetric distribution (in SAS):

Kurtosis = 0, Normal;Kurtosis < 0, platykurtic;Kurtosis > 0 leptokurtic

Page 82: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data

82

12345

Question 10.

a) Heavy-tailed b) Light-tailedc) Right-skewedd) Normale) Left-skewed

Now that you understand the many different ways your data could be distributed, match each example with its appropriate shape.The blue curve on each histogram represents the shape of the normal distribution with the mean and standard deviation estimated from the sample.

dc

ae

b

Page 83: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data

83

Mean kurtosisMedianStandard deviationvariancemodeSkewnessIQR(interquartile range)

Question 11.

a) Location

b) Spread

c) Shape

Match the term to what it measured on the right by dragging and dropping, or by selecting the correct letter.

ac

ba

bacb

Page 84: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Normal Probability Plot

84

A Normal Probability Plot is a visual method for determining whether your data comes from a distribution that is approximately normal.

actual data values

expected percentile from standard normal

Normal reference line

Page 85: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Normal Probability Plot

85

1. Normal distribution (the observed data follow the reference line)2. Skewed-to-the –right distribution3. Skewed-to-the –left distribution4. Light-tailed distribution5. Heavy-tailed distribution

Page 86: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data: Box Plot

86

1.5

* IQ

R

Outliers> 1.5IQR from the boxLargest point <=1.5IQR from the box

The 75th percentile

Smallest point <=1.5IQR from the box

The 50th percentile

The 25th percentile

The Mean is denoted by a

Page 87: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data

87

a) Check if the mean and median are equalb) Check if the skewness and kurtosis are close to 0c) Create histograms and other graphical tools to

visually assess the data d) All of above

Question 12.

Answer: d

Which of the following are ways to check the normality of your data?

Page 88: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Picturing your data

88

Scenario: Assessing the normality

Check the SAT score of XXX school, 80 random selected, the average=1200?

Range FrequencyShape

Normal distribution Mean, Median

Skewness, Kurtosis

Outlier

DemoTask > Describe> Distribution Analysis

Page 89: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

89

Page 90: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean: Objectives

90

In this topic, we will learn to do the following:

– Define the distribution of sample means and the central limit theorem

– Calculate and intercept standard error of the mean and confidence intervals for the mean

– Use the Summary Statistics task to generate the standard error of the mean and the confidence intervals for the mean

Page 91: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

91

Point estimator

Parameters

Statistics

sr

s2

σρ μ

σ2

μEstimates

s Estimates σ

Point estimator

Page 92: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

92

Variability

=1.0

=1.2 =1.3

=0.8

How much the estimator varies from sample to sample?

Page 93: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

93

Variability

Standard Error

Standard Deviation

Variability of a sample statistic

Variability of data

Page 94: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

94

Variability

Standard Error of the mean

Variability of sample means

𝑠𝒙=𝑠

√𝑛Where S is the sample standard deviationn is the sample size

What will be if n is larger? The larger sampleThe smaller standard errorThe more precise sample mean

Page 95: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

95

Variability

Distribution of sample means

The distribution of sample means is less variable than data

Page 96: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

96

Confidence Interval

95% Confidence

A 95% confidence interval represents a range of values within which you are 95% certain the true population mean exists.• One interpretation is that if 100 different samples were drawn

from the same population and 100 intervals were calculated, approximately 95 of them would contain the population mean.

μ

( )

Page 97: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

97

Confidence Interval

95% Confidence

μ

( )

A confidence interval. Is a range of values that you believe to contain the population parameter of interest.. Is defined by an upper and lower bound around a sample statistic.

To construct a confidence interval, a significant level must be chosen.

Question: Why not 99%?

Answer: The larger interval gives less useful information.

Page 98: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

98

Confidence Interval for the mean

𝑠𝒙=𝑠

√𝑛

𝒙± 𝒕 ∙𝑠 𝒙

,

is the sample meant is the value corresponding to the confidence

level and n-1 degrees of freedom, where n is the sample size.is the standard error of the mean

Or

Student’s t-distribution (or simply the t-distribution) is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

Page 99: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

99

Confidence Interval for the mean

𝑠𝒙=𝑠

√𝑛

𝒙± 𝒕 ∙𝑠 𝒙

,

is the sample meant is the value corresponding to the confidence

level and n-1 degrees of freedom, where n is the sample size.is the standard error of the mean

Or

Question: how to decrease the width of the interval?

Answer: way 1: increase sample size nway 2:decrease confidence level

Page 100: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

100

Confidence Interval Sometimes Miss

( )

( ) ( )

Page 101: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

101

Normality and the Central Limit Theorem

To satisfy the assumption of normality, you can do one of the following:• Verify that the population distribution is approximately

normal• Apply the Central Limit Theorem

– The central limit theorem states that the distribution of sample means is approximately normal. Regardless of the population’s shape. If the sample size is large enough.

– “Large enough” is usually about 30 observations. It is more if the data are heavily skewed, and fewer if the data are symmetric.

Page 102: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

102

Normality and the Central Limit Theorem

Data from an exponential distribution 1000 samples of size 5

1000 samples of size 10 1000 samples of size 30

Page 103: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean

103

a) The true average SAT scores is significantly different from 1200.b) The true average SAT scores is not significantly different from

1200.c) The true average SAT scores is less than 1200.d) None of the above. You cannot determine statistical significance

from confidence interval.

Question 13.

Answer: b

A 95% confidence interval for SAT scores is (1157.90, 1223.35). What can you conclude from this information?

Page 104: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Confidence Intervals for the mean:Demo

104

Task > Summary Statistics > Statistics > Additional > Confidence Intervals of the mean, 95%

Page 105: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

105

Page 106: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing

106

Hypothesis Test

Parameters

Statistics

μ= μ0

?

Page 107: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing

107

In this topic, we will learn to do the following:

– Design and conduct a hypothesis test– Use the p-value to determine statistical significance– Use the Distribution Analysis task to perform a statistical

hypothesis test– Perform a one-sample, two-sided t-test to determine if the

population mean is significantly different from a known value.

Page 108: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Decision making process

108

Use a decision rule to make a judgment. If the evidence contradicting the null hypothesis is

– Sufficiently strong to meet the burden if proof (significance level), then reject the null hypothesis

– Not strong enough to meet the burden if proof , then fail to reject the null hypothesis. Be aware that failing to prove guilty does not mean that the defendant is proven innocent. It could mean that the prosecuting attorney did not build a strong enough case to meet the burden of proof.

Define null alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

Page 109: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Coin example

109

Define null alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

H0: null hypothesis

The coin is fair

Ha or H1 :alternative hypothesis

The coin is not fair

Page 110: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Coin example

110

Define null alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis Amount of evidence needed to reject the null hypothesis

Significant level

𝛼=0.05

Page 111: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Coin example

111

Define null alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

Two headsThree tails

Question: is it possible to get 100 tails?

Page 112: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Coin example

112

Define null alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

Two headsThree tails

100 tails

H0: coin is fair

Not reject H0

Reject H0

Page 113: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Type of error and Power

113

ACTUAL

Decision H0 is True H0 is False

Fail to Reject Null Correct Type II Error

Reject Null Type I Error Correct

Probability of Type I Error = H0 : coin is fair

Conclude: coin is not fair

Probability of Type II Error =

Probability of Correct Rejection= =Power

An unfair coin is regarded fair

𝛼 𝛽

Page 114: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Type of error and Power

114

a) Probability the manager concludes that the rate is indeed above 1%, when in reality it is not.

b) Probability the manager concludes that there is no evidence that the rate is above 1%, when in reality it is above 1%.

Question 14.

a

A bank manager is concerned that the percent of loans that contain errors has increased above the acceptable amount of 1%. A significance test is conducted to test his concern (H0 : p<=.01, Ha : p>=.01)

Match these items between left and right:

Type I Error

Type II Error b

Page 115: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing:Coin Experiment – Effect Size Influence

115

The effect size refers to the magnitude of the difference in sampled population from the null hypothesis. In this example, the null hypothesis of a fair coin suggests 50% heads 50% tails. If the true coin flipped were actually weighted to give 55% heads, the effect size would be 5%.

55 Heads45 Tails

P-value =.3682

40 Heads60 Tails

P-value =.0569

37 Heads63 Tails

P-value =.0120

15 Heads85 Tails

P-value =.0001

Flip a coin 100 times and decide whether it is fair

A P-value measures the probability of observing a value as extreme or more extreme than the one observed.

Page 116: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing:Coin Experiment – Sample Size Influence

116

When the effect size keep constant, the P-value decreases along with the sample size increases.

4 Heads6 Tails

P-value =.7539

16 Heads24 Tails

P-value =.2682

40 Heads60 Tails

P-value =.0569

160 Heads240 Tails

P-value <.0001

Flip a coin and get 40% heads, decide whether it is fair

Page 117: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Statistical Hypothesis Test

117

Statistical Hypothesis Test

Define null alternative hypotheses

Specify significance level

Collect data

Reject or fail to reject the null hypothesis

H0 : equality, H1 : difference

Set α, 5% for example

Collect data

Reject : P-value < αFail to reject: P-value α

The value of α the probability of Type I error, is specified by the experimenter before collecting data.The strength of the evidence is measured by the a P-value.The P-value is calculated from the collected data.

Page 118: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Performing a Hypothesis Test

118

To test the null hypothesis H0: μ=μ0 , SAS software calculates the Student’s t statistic value:

The null hypothesis is rejected when the calculated value is more extreme (either positive or negative) than would be expected by chance if H0 were true.

For the test score example:

𝑡=𝑥−μ0𝑠𝑥

𝑡=1190.625−1200

16.4416=−0.5702

Page 119: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Performing a Hypothesis Test

119

Two sided test of a hypothesis: each tail takes 2.5% for a 5% α

Page 120: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Performing a Hypothesis Test

120

Why T –test?

a. Normal distributionb. σ unknownc. Small sample < 30

Page 121: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Performing a Hypothesis Test

121

T-test 1 by Distribution Analysis:

Page 122: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Hypothesis Testing: Performing a Hypothesis Test

122

T-test 2 by Task>ANOVA>T test:

Page 123: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Agenda

• 0. Lesson overview• 1. Basic Statistical Concepts• 2. Descriptive Statistics • 3. Picturing your data• 4. Confidence Intervals for the mean• 5. Hypothesis Testing• 6. Summary

123

Page 124: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Home Work: Exercise 1.1.1 Calculating Basic Statistics in the Summary Statistics Task

The data in Norm Temp comes from an article in the Journal of Statistics Education by Dr. Allen L. Shoemaker. The data questions the notion that the true mean body temperature is 98.6. There are 65 males and 65 females. There is also some question about whether mean body temperatures for women are the same as for men. The variables in the data set are as follows:

ID Identification numberBodyTemp Body temperature (degree Fahrenheit) HeartRate Coded (Male, Female)

a. Open a new process and the NormTemp dataset. Use Summary Statistics Task to answer the next two questions.b. What is the overall mean and standard deviation of BodyTemp in the sample?c. What is the interquartile range of BodyTemp?

124

Page 125: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Home Work: Exercise 2.2.1 Producing Descriptive Statistics Use the NormTemp dataset to answer the following: a. What are the minimum, the maximum, the

mean, and the standard deviation for BodyTemp? Does the variable appear to be normally distributed?

b. Create box plots for BodyTemp. Does the average BodyTemp seem to be 98.6 degrees?

2.2 Multiple Choice PollIn the NormTemp dataset, the distribution of BodyTemp seemed :c. Close to normald. Left skewede. Right skewedf. To have high positive kurtosisg. To have high negative kurtosis

125

BodyTempMinimum Maximum Mean Standard deviation Skewness KurtosisDistribution: Normal Yes/No

Page 126: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Home Work: Exercise 3.3.1 Producing Confidence Intervalsa. Generate the 95% confidence interval for the mean of BodyTemp in

the NormTemp dataset b. Is the assumption of normality met to produce a confidence interval

for this data?c. What are the bounds of the confidence interval?

3.2 Multiple Answer PollThe distribution of sample means is approximately normal if which of the following are true?d. The population is normal.e. The sample size is “large enough.”f. The sample standard deviation is small

126

Page 127: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Home Work: Exercise 4.4.1 Performing a One-Sample t-Testa. Performing a One-Sample t-Test to determine whether the mean of body temperatures

(the variable BodyTemp in the NormTemp dataset) is truly 98.6b. What is the value of the t statistic and the corresponding p-value?c. Produce a confidence interval plot of BodyTemp.d. Do you reject or fail to reject the null hypothesis at the .05 level that the average

temperature is 98.6 degrees?

4.2 Multiple Choice PollA 95% confidence interval for SAT scores is (1157.90, 1223.35). From this, what can you conclude, at alpha=0.05?e. The true average SAT score is significantly different from 1200.f. The true average SAT score is not significantly different from 1200.g. The true average SAT score is less than 1200.h. Non of the above – you cannot determine statistical significance from confidence

intervals.

127

Page 128: Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D

Thank you!