26
Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the normal distribution, the central limit theorem, the standard error. Krisztina Boda PhD Department of Medical Informatics, University of Szeged

Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

Embed Size (px)

Citation preview

Page 1: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

Biostatistics, statistical software III.

Population, statistical sample. Probability, probability variables. Important distributions. Properties of the normal distribution, the central limit theorem, the

standard error.

Krisztina Boda PhD

Department of Medical Informatics, University of Szeged

Page 2: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 2Krisztina Boda

Population, sample

Population: the entire group of individuals that we want information about.

Sample: a part of the population that we actually examine in order to get information

A simple random sample of size n consists of n individuals chosen from the population in such a way that every set of n individuals has an equal chance to be in the sample actually selected.

Page 3: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 3Krisztina Boda

Examples

Sample data set Questionnaire filled

in by a group of pharmacy students

Blood pressure of 20 healthy women

Population Pharmacy students Students

Blood pressure of women (whoever)

Page 4: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 4Krisztina Boda

Sample Population

Bar chart of relative frequencies of a categorical variable

Distribution of that variable in the population

Gender

20 23.0 23.0 23.0

67 77.0 77.0 100.0

87 100.0 100.0

male

female

Total

ValidFrequency Percent Valid Percent

CumulativePercent

Gender

femalemale

Pe

rce

nt

100

80

60

40

20

0

77

23

(approximates)

Page 5: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 5Krisztina Boda

Sample Population

Histogram of relative frequencies of a continuous variable

Distribution of that variable in the population

(approximates)

Body height

195.0

190.0

185.0

180.0

175.0

170.0

165.0

160.0

155.0

150.0

Body height

Fre

qu

en

cy

30

20

10

0

Std. Dev = 8.52

Mean = 170.4

N = 87.00

Body height

195.0

190.0

185.0

180.0

175.0

170.0

165.0

160.0

155.0

150.0

Body height

Fre

qu

en

cy

30

20

10

0

Std. Dev = 8.52

Mean = 170.4

N = 87.00

Page 6: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 6

Krisztina Boda

Sample Population

Mean (x) Standard deviation (SD) Median

Mean (unknown) Standard deviation

(unknown) Median (unknown)

(approximates)

Body height

195.0

190.0

185.0

180.0

175.0

170.0

165.0

160.0

155.0

150.0

Body height

Fre

qu

en

cy

30

20

10

0

Std. Dev = 8.52

Mean = 170.4

N = 87.00

Page 7: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 7Krisztina Boda

The concept of probability

Lets repeat an experiment n times under the same conditions. In a large number of n experiments the event A is observed to occur k times (0 k n).

k : frequency of the occurrence of the event A. k/n : relative frequency of the occurrence of the event A.

0 k/n 1

If n is large, k/n will approximate a given number. This number is called the probability of the occurrence of the event A and it is denoted by P(A).

0 P(A) 1

Page 8: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 8Krisztina Boda

Example: the tossing of a (fair) coin

k: number of „head”s

n= 10 100 1000 10000 100000

k= 7 42 510 5005 49998

k/n= 0.7 0.42 0.51 0.5005 0.49998

P(„head”)=0.5

0

0.5

1

10 100 1000 10000

k/n

Page 9: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 9Krisztina Boda

Probability facts

Any probability is a number between 0 and 1.

All possible outcomes together must have probability 1.

The probability of the complementary event of A is 1-P(A).

Page 10: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 10Krisztina Boda

Rules of probability calculus Assumption: all elementary events are equally probable

Examples: Rolling a dice. What is the probability that the dice shows 5?

If we let X represent the value of the outcome, then P(X=5)=1/6.

What is the probability that the dice shows an odd number? P(odd)=1/2. Here F=3, T=6, so F/T=3/6=1/2.

outcomesofnumbertotaloutcomesfavoriteofnumber

TF

P(A)

Page 11: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 11Krisztina Boda

The density curve The density curve is an

idealized description of the overall pattern of a distribution that smoothes out the irregularities in the actual data.

The density curve is on the above the

horizontal axis, and has area exactly 1

underneath it. The area under the curve and

above any range of values is the proportion of all observations that fall in that range.

Page 12: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 12Krisztina Boda

Special distribution of a continuous variable: The normal distribution

The normal curve was developed mathematically in 1733 by DeMoivre as an approximation to the binomial distribution. His paper was not discovered until 1924 by Karl Pearson. Laplace used the normal curve in 1783 to describe the distribution of errors. Subsequently, Gauss used the normal curve to analyze astronomical data in 1809. The normal curve is often called the Gaussian distribution.

The term bell-shaped curve is often used in everyday usage.

Page 13: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 13Krisztina Boda

Special distribution of a continuous variable: The normal distribution N(,)

The density curves are symmetric, single-peaked and bell-shaped The 68-95-99.7 rule . In the normal distribution with mean and

standard deviation : 68% of the observations fall within of the mean 95% of the observations fall within 2 of the mean 99.7% of the observations fall within 3 of the mean

Page 14: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 14Krisztina Boda

Normal distributions N(, )

Probability Density Function

y=normal(x;1;1)

-3 -2 -1 0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

0.6

-3 -2 -1 0 1 2 30.0

0.2

0.4

0.6

0.8

1.0

Probability Density Function

y=normal(x;0;2)

-3 -2 -1 0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

0.6

-3 -2 -1 0 1 2 30.0

0.2

0.4

0.6

0.8

1.0

N(0,1) N(1,1)

N(0,2)

, : parameters (a parameter is a number that describes the distribution)

Page 15: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 15Krisztina Boda

„Imagination” of the distribution given the sample mean and sample SD – supposing normal distribution

In the papers generally sample means and SD-s are published. We use these value to know the original distribution

For example, age in years: 55.2 15.7

70.939.5

68.26% of the data are supposed to be in this interval

95.44% of the data are supposed to be in the interval (23.8, 86.6)

Page 16: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 16Krisztina Boda

Stent length per lesion (mm): 18.8 10.5 Using these values as parameters, the „normal” distribution

would be like this:

The original distribution of stent length was probably skewed, or if the distribution was normal, there were negative values in the data set.

Page 17: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 17Krisztina Boda

Standard normal probabilities N(0,1)z (z): proportion of area to the left of z

-4 0.00003

-3 0.0013

-2.58 0.0049

-2.33 0.0099

-2 0.0228

-1.96 0.0250

-1.65 0.0495

-1 0.1587

0 0.5

1 0.8413

1.65 0.9505

1.96 0.975

2 0.9772

2.33 0.9901

2.58 0.9951

3 0.9987

4 0.99997

The area under the standard normal curve is tabulated

Page 18: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 18Krisztina Boda

The use of the normal curve tablez (z): proportion of area to the left of z

-4 0.00003

-3 0.0013

-2.58 0.0049

-2.33 0.0099

-2 0.0228

-1.96 0.0250

-1.65 0.0495

-1 0.1587

0 0.5

1 0.8413

1.65 0.9505

1.96 0.975

2 0.9772

2.33 0.9901

2.58 0.9951

3 0.9987

4 0.99997

-1.96 1.96

0.025 0.0250.95

Find the area under the standard normal curve to the left of z=-1.96, that is, find

the probablity that z<-1.96

P(z<-0.96)=0.025

Page 19: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 19Krisztina Boda

Standardization

If a random variable X has normal distribution N(,) distribution, than the variable has is standard normal distribution N(0,1).

zX

Page 20: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 20Krisztina Boda

The use of the normal curve tablez (z): proportion of area to the left of

z

-4 0.00003

-3 0.0013

-2.58 0.0049

-2.33 0.0099

-2 0.0228

-1.96 0.0250

-1.65 0.0495

-1 0.1587

0 0.5

1 0.8413

1.65 0.9505

1.96 0.975

2 0.9772

2.33 0.9901

2.58 0.9951

3 0.9987

4 0.99997

0228.0)2()10( zPxP

Example. It was found that the weights of a certain population of laboratory rats are normally distributed with = 14 ounces and σ=2 ounces. In such a population, what percentage of rats would we expect to weigh below 10 ounces?

Solution.1. Standardize X

22

1410

xz

2. Find the area to the left of -2

Page 21: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 21Krisztina Boda

Distribution of sample meanshttp://www.ruf.rice.edu/~lane/stat_sim/index.html

Page 22: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 22Krisztina Boda

Distribution of sample means The original population is not normally distributed

Page 23: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 23Krisztina Boda

The central limit theorem Let’s draw samples from a

population with mean µ and standard deviation σ. Examine the distribuiton of the sample means.

If the sample size n is large (say, at least 30), then the population of all possible sample means approximately has a normal distribution with mean µ and standard deviation σ no matter what probability describes the population sampled.

n

Page 24: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 24Krisztina Boda

The standard error of mean(SE or SEM)

is called the standard error of mean

Meaning: the dispersion of the sample means around the (unknown) population mean.

Estimation of SE from the sample:

n

SD

n

n

n

SD

Page 25: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 25Krisztina Boda

Review questions and exercises

Problems to be solved by hand-calculations ..\Handouts\Problems hand III.doc

Solutions ..\Handouts\Problems hand III solutions.doc

Problems to be solved using computer - none

Page 26: Biostatistics, statistical software III. Population, statistical sample. Probability, probability variables. Important distributions. Properties of the

INTERREG 26Krisztina Boda

Useful WEB pages

http://www-stat.stanford.edu/~naras/jsm http://www.ruf.rice.edu/~lane/rvls.html http://my.execpc.com/~helberg/statistics.html http://www.math.csusb.edu/faculty/stanton/m26

2/index.html