Probability Models for Distributions of Discrete Variables

1

Probability Models for Distributions of Discrete Variables

2

x p(x)0 0.201 0.302 0.203 0.154 0.105 0.05

Randomly select a college student. Determine x, the number of credit cards the student has.x = # of cards p(x) = probability of x occurring

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 1 2 3 4 5

3

A population is a collection of all units of interest.Example: All college students

A sample is a collection of units drawn from the population.

Example: Any subcollection of college students.Probabilities go with populations.Scientific studies randomly sample from the entire population.

Each unit in the sample is chosen randomly.The entire sample is random as well.

Populations / Samples

4

For discrete data, a population and a sample are summarized the same way (for instance, as a table of values and accompanying relative frequencies).

A probability distribution (or model) for a discrete variable is a description of values, with each value accompanied by a probability.

Probability Models and Populations

5

Definitions of Probability2. the probability of an event is the long term (technically forever) relative frequency of occurrence of the event, when the experiment is performed repeatedly under identical starting conditions.3. The probability of an event is the relative frequency of units in the population for which the event applies.To aggregate these meanings:The probability associated with an event is its relative frequency of occurrence over all possible ways the phenomena can take place.

Probability Models and Populations

6

“All models are wrong. Some are useful.”George Box

-industrial statistician

Probability Models

7

A probability distribution for a discrete variable is tabulated with a set of values, x and probabilities, p(x).

x p(x)

0 0.20

1 0.30

2 0.20

3 0.15

4 0.10

5 0.05

Probabilities

Must be nonnegative.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 1 2 3 4 5

8

A probability distribution for a discrete variable is tabulated with a set of values, x and probabilities, p(x).

x p(x)

0 0.20

1 0.30

2 0.20

3 0.15

4 0.10

5 0.05

SUM 1.00

Probabilities

Must be nonnegative.

Must sum to 1.Within rounding error.

9

The mean of a probability distribution is the mean value observed for all possible outcomes of the phenomena.

10

Consider idealized data sets

x p(x)0 0.20 20 0s1 0.30 30 1s2 0.20 20 2s3 0.15 15 3s4 0.10 10 4s5 0.05 5 5s

11

Idealized data set n = 1000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5

Mean = 1.80 SD = 1.44

12

Consider idealized data sets

x p(x)0 0.20 200 0s1 0.30 300 1s2 0.20 200 2s3 0.15 150 3s4 0.10 100 4s5 0.05 50 5s

13

Idealized data set n = 10000 0 0 0 0 0 0 … 0

(200)1 1 1 1 1 1 1 1 1 1 … 1

(300) 2 2 2 2 2 2 … 2 (200)3 3 3 3 … 3 (150)4 4 … 4 (100)5 … 5 (50)

Mean = 1.80 SD = 1.44

14

Values for the mean and standard deviation don’t depend on the number of data values; they depend instead on the relative location of the data values – they depend on the distribution in relative frequency terms.

15

The mean of a probability distribution is the mean value observed for all possible outcomes of the phenomena.

Formula:

is synonymous with “population mean”

xpx

SUM symbolGreek letter “myou”

16

x p(x) x p(x)0 0.20 0 0.20 = 0.001 0.30 1 0.30 = 0.302 0.20 2 0.20 = 0.403 0.15 3 0.15 = 0.454 0.10 4 0.10 = 0.405 0.05 5 0.05 = 0.25

1.00 1.80

xpx

Multiply each value by its probabilitySum the products

Mean = 1.80

17

The standard deviation of a probability distribution is the standard deviation of the values observed for all possible outcomes of the phenomena.

Formula:

denotes “population standard deviation”

xpx 2

Greek letter “sigma”

18

First obtain the variance. xpx 22

x p(x)0 0.20 (0 – 1.8)2 0.20 = 0.6481 0.30 (1 – 1.8)2 0.30 = 0.1922 0.20 (2 – 1.8)2 0.20 = 0.0083 0.15 (3 – 1.8)2 0.15 = 0.2164 0.10 (4 – 1.8)2 0.10 = 0.4845 0.05 (5 – 1.8)2 0.05 = 0.512

2 = 2.060(take square root to obtain)

= 1.44

19

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5

Mean = 1.80 SD = 1.44Mean – SD = 0.56 Mean + SD = 3.24

65 / 100 = 65%

20

x p(x)0 0.201 0.302 0.203 0.154 0.105 0.05

Mean = 1.80 SD = 1.44

Mean – SD = 0.56

Mean + SD = 3.24

0.30 + 0.20 + 0.15 = 0.65

21

x = # children in randomly selected college student’s family.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 2 3 4 5 6 7 8 9 10

# of Children

Prob

abili

ty

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

22

x = # children in randomly selected college student’s family.

0.2194 = 21.94% of all college students come from a 1 child family.

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

23

Guess at mean? Above 2

(right skew mean > mode).

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 2 3 4 5 6 7 8 9 10

# of Children

Prob

abili

ty

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

24

To determine the mean, multiply values by probabilities,

xp(x)

and sum these.

55/10 = 5.50 is not the mean

1.000/10 = 0.10 is not the mean

x p(x) x p(x)1 0.2194 1(0.2194) = 0.21942 0.2806 2(0.2806) = 0.56123 0.2329 3(0.2329) = 0.69874 0.1442 : = 0.57685 0.0736 0.36806 0.0317 0.19027 0.0124 0.08688 0.0043 0.03449 0.0005 0.0045

10 0.0003 0.003055 1.0000 Mean: = 2.7430

25

To determine the variance, multiply squared deviations from the mean by probabilities,

(x – )2p(x)

and sum these.

x p(x) (x – )2 p(x)1 0.2194 (1 – 2.743)2 0.2194 = 0.66652 0.2806 (2 – 2.743)2 0.2806 = 0.15493 0.2329 (3 – 2.743)2 0.2329 = 0.01544 0.1442 : = 0.22785 0.0736 0.37496 0.0317 0.33637 0.0124 0.22478 0.0043 0.11889 0.0005 0.0196

10 0.0003 0.015855 1.0000 Variance: 2 = 2.1548

26

The standard deviation is the square root of the variance.

Examining the data set consisting of # of children in the family recorded for all students: The mean is 2.743; the standard deviation is 1.468.

468.11548.2

27

Determine the probability a student is from a family with more than 5 siblings.

P(x > 5)

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

28


P(x > 5)

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

29


P(x > 5)

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

30


P(x > 5)

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

31


P(x > 5)

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

32


P(x > 5) = 0.0317

+ 0.0124

+ 0.0043

+ 0.0005

+ 0.0003

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

33


P(x > 5) = 0.0317

+ 0.0124

+ 0.0043

+ 0.0005

+ 0.0003

= 0.0492

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

34


P(x > 5) = 0.0492

4.92% of all college students come from families with more than 5 children (they have 4 or more brothers and sisters).

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

35

Determine the probability a student is from a family with at most 3 siblings.

P(x 3) = 0.2194

+ 0.2806

+ 0.2329

= 0.7329

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

36

Determine the probability a student is from a family with at least 7 siblings.P(x 7) = 0.0124

+ 0.0043+ 0.0005+ 0.0003= 0.0175

Good idea: Take the reciprocal of a small probability…

1/.0175 = 57.1 1 in 57 students

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

37

Determine the probability a student is from a family with fewer than 5 siblings.

P(x < 5) = 0.2194

+ 0.2806

+ 0.2329

+ 0.1442

= 0.8771

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

38

at most 3 at least 7

less than or equal to 3 greater than or equal to 7

no more than 3 no fewer/less than 7

x 3 x 7

39

Determine the probability a student’s number of siblings falls within 1 standard deviation of the mean.

Guess?

0.68

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

40


Mean = 2.743

SD = 1.468

1 SD below the mean

2.743 – 1.468 = 1.275

1 SD above the mean

2.743 + 1.468 = 4.211

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

41


1 SD below the mean = 1.275

1 SD above the mean = 4.211

Values are within 1 SD of the mean if they are between these.

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

42





x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

43





The probability of being between these:

0.2806 + 0.2329 + 0.1442 = 0.6577

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

44

Determine the probability a student’s number of siblings falls within 2 standard deviations of the mean.

Guess? 0.95

2 SD below the mean

1.275 – 1.468 = -0.193

2 SD above the mean

4.211+ 1.468 = 5.679

Between -0.193 and 5.679.

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

45



(Equivalent to 5 or fewer.)

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

46




We know an outcome more than 5 has probability 0.0492.

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

47




We know an outcome more than 5 has probability 0.0492.

The probability of an outcome at most 5 is 1 – 0.0492 = 0.9508.

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

48



0.9508.

x p( x)1 0.21942 0.28063 0.23294 0.14425 0.07366 0.03177 0.01248 0.00439 0.0005

10 0.0003

49

A company monitors pollutants downstream of discharge into a stream.

Data were collected on 200 days from a point 1 mile downstream of the plant on Stream A.

Data were collected on 100 days from a point 1 miles downstream of the plant on Stream B.

Pollutant Particles in Streamwater

50

How do means compare?

(What are the means?)

How do SDs compare?

(What are the SDs?)

6543210

70

60

50

40

30

20

10

0

Stream A

Freq

uenc

y

6543210

70

60

50

40

30

20

10

0

Stream B

Freq

uenc

y

51

Similar Means.

Similar Standard Deviations.

(Similar everything except ns.)

6543210

70

60

50

40

30

20

10

0

Stream A

Freq

uenc

y

6543210

70

60

50

40

30

20

10

0

Stream B

Freq

uenc

y

52

6543210

35

30

25

20

15

10

5

0

Stream A

Perc

ent

6543210

35

30

25

20

15

10

5

0

Stream B

Perc

ent

53

Stream B

Mean = 1.775

SD = 1.242

Stream A

Mean = 1.770

SD = 1.3406543210

35

30

25

20

15

10

5

0

Stream A

Perc

ent

6543210

35

30

25

20

15

10

5

0

Stream B

Perc

ent

54

Here is the probability distribution for the number of diners seated at a table in a small café.

x p(x)

1 0.10

2 0.20

3 ____

4 0.40

a) Fill in the blank

55

x p(x)

1 0.10

2 0.20

3 0.30

4 0.40

a) Fill in the blank


56

b) Determine the mean

Start by computing xp(x) for each row.

x p(x)

1 0.10

2 0.20

3 0.30

4 0.40


57

x p(x) xp(x)

1 0.10

2 0.20

3 0.30

4 0.40




58

x p(x) xp(x)

1 0.10 10.10 = 0.10

2 0.20

3 0.30

4 0.40




59

x p(x) xp(x)

1 0.10 10.10 = 0.10

2 0.20 20.20 = 0.40

3 0.30

4 0.40




60

x p(x) xp(x)

1 0.10 10.10 = 0.10

2 0.20 20.20 = 0.40

3 0.30 30.30 = 0.90

4 0.40 40.40 = 1.60




61

x p(x) xp(x)

1 0.10 10.10 = 0.10

2 0.20 20.20 = 0.40

3 0.30 30.30 = 0.90

4 0.40 40.40 = 1.60


Sum these.


62

x p(x) xp(x)

1 0.10 10.10 = 0.10

2 0.20 20.20 = 0.40

3 0.30 30.30 = 0.90

4 0.40 40.40 = 1.60


Sum these.

= 3.00


63

b) Determine the standard deviation

Start by computing

( x – ) 2 p(x)

for each row.

x p(x)

1 0.10

2 0.20

3 0.30

4 0.40


64


Start by computing

( x – )2 p(x)

for each row.

= 3

x p(x)

1 0.10

2 0.20

3 0.30

4 0.40


65

x p(x) ( x – 3)2 p(x)

1 0.10

2 0.20

3 0.30

4 0.40


Start by computing

( x – 3)2 p(x)

for each row.

= 3


66

x p(x) ( x – 3)2 p(x)

1 0.10 (1 – 3)20.10 = 0.40

2 0.20

3 0.30

4 0.40


Start by computing

( x – 3 ) 2 p(x)

for each row.

= 3


67

x p(x) ( x – 3)2 p(x)

1 0.10 (1 – 3)20.10 = 0.40

2 0.20 (2 – 3)20.20 = 0.20

3 0.30

4 0.40


Start by computing

( x – 3 ) 2 p(x)

for each row.

= 3


68

x p(x) ( x – 3)2 p(x)

1 0.10 (1 – 3)20.10 = 0.40

2 0.20 (2 – 3)20.20 = 0.20

3 0.30 (3 – 3)20.20 = 0.00

4 0.40 (4 – 3)20.20 = 0.40


Start by computing

(x – 3 ) 2 p(x)

for each row.

= 3


69

x p(x) ( x – 3)2 p(x)

1 0.10 (1 – 3)20.10 = 0.40

2 0.20 (2 – 3)20.20 = 0.20

3 0.30 (3 – 3)20.20 = 0.00

4 0.40 (4 – 3)20.20 = 0.40


Sum these


70

x p(x) ( x – 3)2 p(x)

1 0.10 (1 – 3)20.10 = 0.40

2 0.20 (2 – 3)20.20 = 0.20

3 0.30 (3 – 3)20.30 = 0.00

4 0.40 (4 – 3)20.40 = 0.40


Sum these

Variance = 1.00

SD: = 1.00


71

This framework makes it possible to obtain fairly good approximations to means and standard deviations from a histogram of continuous data.

[Optional] Application

72

Here are waiting times between student arrivals in a class. There are 21 students (20 waits).

Example

50403020100

10

8

6

4

2

0

Waiting Time

Freq

uenc

y

Approximate the mean and median. How do they compare?

7350403020100

10

8

6

4

2

0

Waiting Time

Freq

uenc

yFor each class, determine its frequency and corresponding midpoint.

Example: Mean

Frequency = 10

Midpoint = 5

74

Tabulate frequencies and midpoints.Example: Mean

Midpoint

Frequency

5 10

75

Tabulate frequencies and midpoints.Example: Mean

Midpoint

Frequency

5 10

15 5

25 3

35 1

45 1

Total 20

76

Obtain relative frequencies.Example: Mean

Midpoint

Frequency

Relative Frequency

5 10 10/20 = 0.50

15 5

25 3

35 1

45 1

Total 20

77

Obtain relative frequencies.Example: Mean

Midpoint

Frequency

Relative Frequency

5 10 10/20 = 0.50

15 5 5/20 = 0.25

25 3 3/20 = 0.15

35 1 1/20 = 0.05

45 1 1/20 = 0.05

Total 20 1.00

78

Proceed with the formulaExample: Mean

Midpoint

Rel Freq Product

5 0.50 5(0.50) = 2.50

15 0.25

25 0.15

35 0.05

45 0.05

Total 20

xpxMean

79

Proceed as a discrete population distribution.Example: Mean

Midpoint

Rel Freq Product

5 0.50 5(0.50) = 2.50

15 0.25 15(0.25) = 3.75

25 0.15 25(0.15) = 3.75

35 0.05 35(0.05) = 1.75

45 0.05 45(0.05) = 2.25

Total 20

Mean

80

Proceed as a discrete population distribution.Example: Mean

Midpoint

Rel Freq Product

5 0.50 5(0.50) = 2.50

15 0.25 15(0.25) = 3.75

25 0.15 25(0.15) = 3.75

35 0.05 35(0.05) = 1.75

45 0.05 45(0.05) = 2.25

Total 20 14.00

Mean

14.00

8150403020100

10

8

6

4

2

0

Waiting Time

Freq

uenc

yFind the value with 50% below and 50% above.

Example: Median

82

Obtain relative frequencies.Example: Median

Midpoint

Rel Freq

5 0.50

15 0.20

25 0.15

35 0.05

45 0.05

Total 1.00

8350403020100

10

8

6

4

2

0

Waiting Time

Freq

uenc

yFind the value with 50% below and 50% above.

Example: Median

10 of 20 = 50% below 10

Median 10.00

Mean 14.00

Range 44

S.D. 11

84

1.3 1.9 1.9 2.5 2.6 3.0 3.6 3.7 5.9 9.7 10.4 10.6 11.2 13.5 15.9 21.4 27.5 29.8 33.6 43.5

Approximations: Actual Values:

Median 10.0.05 Median =

Mean 14.0 Mean =

Range 44 Range =

SD 11 SD =

Example: Data / Exact Values

85

1.3 1.9 1.9 2.5 2.6 3.0 3.6 3.7 5.9 9.7 10.4 10.6 11.2 13.5 15.9 21.4 27.5 29.8 33.6 43.5

Approximations: Actual Values:

Median 10.0.05 Median = 10.05

Mean 14.0 Mean = 12.68

Range 44 Range = 42.2

SD 11 SD = 12.31

Example: Data / Exact Values

Documents

Probability Models for Distributions of Discrete Variables