57
PROBABILITY & STATISTICS Prepared by: Ms. KAREN S. TAFALLA

Probability and Statistics Review pt 1

Embed Size (px)

DESCRIPTION

Probability and Statistics Review part 1

Citation preview

Page 1: Probability and Statistics Review pt 1

PROBABILITY &

STATISTICS

Prepared by:

Ms. KAREN S. TAFALLA

Page 2: Probability and Statistics Review pt 1

INTRODUCTION

STATISTICS is a collection of methods for planning experiments, obtaining data, and then organizing, summarizing, analyzing, interpreting, and drawing conclusions based on the data.

DESCRIPTIVE STATISTICS consists of procedures used to summarize and describe the important characteristics of a set of measurements.

INFERENTIAL STATISTICS consists of procedures used to make inferences about population characteristics from

information contained in a sample drawn from this population.

Page 3: Probability and Statistics Review pt 1

"The theory of statistics uses probability to measure the

uncertainty associated with an inference. It enables us to

calculate the probabilities of observing specific samples,

under specific assumptions about the population. The

statistician uses these probabilities to evaluate the

uncertainties associated with sample inferences."

Definition of terms:

Data are information or facts necessary to conduct a certain

study.

A variable is a characteristic that changes or varies over time

and/or for different individuals or object under

consideration.

A random variable is a variable whose numerical is

determined by the outcome of some chance experiment.

Page 4: Probability and Statistics Review pt 1

An experimental unit is the individual or object on which a

variable is measured. A single measurement or data

value results when a variable is actually measured on an

experimental unit.

The population in a statistical study is the group of objects

drawn about which conclusions are to be drawn.

A sample is a subset of measurements selected from the

population of interest.

A parameter is a numerical measurement describing some

characteristics of a population and a statistic is a

numerical measurement describing some characteristics

of a sample

Page 5: Probability and Statistics Review pt 1
Page 6: Probability and Statistics Review pt 1

Univariate data result when a single variable is measured on

a single experimental unit.

Bivariate data result when two or more variables are

measured on a single experiment unit.

Multivariate data results when more than two variables are

measured.

Page 7: Probability and Statistics Review pt 1

A. Types of variables:

Qualitative variable measures a quality or characteristic on

each experiment unit.

Ex. - taste ranking: excellent, good, fair, poor,

- color of M&M candy: brown, yellow, red orange,

green, blue

Quantitative variable measures a numerical quantity or

amount on each experiment unit.

Ex. - weight of package ready to be shipped

- volume of orange juice in a glass

Page 8: Probability and Statistics Review pt 1

Types of Quantitative Data:

Discrete data results from either a finite of possible values or

countable number of possible values (That is, the number

of possible values is 0, 1 or 2, and so on)

Continuous data results from many possible values that can

be associated with points on a continuous scale in such a

way that there are no gaps or interruptions.

Page 9: Probability and Statistics Review pt 1

B. Four Levels of Measurement:

The nominal level of measurement is characterized by the

data that consist of names, labels or categories only, and the

data cannot be arranged in an ordering scheme.

Ex. - collection of “ yes, no, undecided” responses to a

survey question.

- responses consisting of 10 nurses, 15 teachers,

16 engineers, 5 priests, 20 businessmen.

The ordinal level of measurement involves data that may

be arranged in some order, but differences between data

values either cannot be determined or are meaningless.

Ex. -In a sample of 24 car stereos, 15 were rated

“good”, 6 were rated “better”, 3 were rated “ best”

-in considering employee promotion, a manager

ranked Myrna 3rd, Al 7th, and Jena 10th

Page 10: Probability and Statistics Review pt 1

The Interval level measurement is like the ordinal level, with

the a additional that meaningful amounts of differences

between data can be determined. However, there is no

inherent zero stating point.

Ex. -body temperatures ( in degrees Celsius )

The ratio level of measurement is the interval level modified

to include the inherent zero starting point. For values at

this level, differences and ratios are meaningful.

Ex. -heights of pine trees along Session road.

- temperature readings on Kelvin Scale since

the scale ha s an absolute zero

Page 11: Probability and Statistics Review pt 1

Classify the following statements as belonging to the area of descriptive statistics or statistical inference:

(a) As a result of recent cutbacks by the oil-producing nations, we can expects the price of gasoline to double in the next years.

(b) At least 5% of all fires reported last year in a certain city were deliberately set by arsonists.

(c) Of all patients who have received this particular type of drug at a local clinic, 60% later developed significant side effects.

(d) Assuming that less than 20% of the Columbian coffee beans were destroyed by frost this past winter, we should expect an increase of no more than 30 cents for a kilogram of coffee by the end of the year

(e) As a result of a recent poll, most Americans are in favor of building additional nuclear power plants.

Page 12: Probability and Statistics Review pt 1

EXERCISES: Understanding the concepts

A. Identify the experimental units on which the ff.

Variables are measured:

1. Gender of student

2. Number of errors on a midterm exam

3. Age of a cancer patient

4. Number of flowers on an azalea plant

5. Color of a car entering the parking lot

Page 13: Probability and Statistics Review pt 1

B. Identify each variable as quantitative or qualitative:

1. Amount of time it takes to assemble a simple puzzle

2. Number of students in a first grade classroom

3. Rating of newly elected politician ( excellent, good,

fair, poor )

4. State in which a person lives.

C. Identify the following quantitative variables as discrete

or continuous:

1. Population in a particular area of the Philippines

2. Weight of newspapers recovered for recycling on a

single day.

3. Time to complete a probability exam

Page 14: Probability and Statistics Review pt 1

D. A data set consist of the ages at death for each of the

41 past president of the United States

1. Is this a set of measurements a population or a

sample?

2. What is the variable being measured?

3. Is the variable in part b quantitative or qualitative?

E. Determine which of the four level of measurement is

most appropriate:

1. Weights of a sample of M&M candies

2. Instructors rated as superior, above average, average,

or poor

3. Lengths (in minutes) of movies

4. Zip codes

5. Movies listed according to their genre, such as comedy,

adventure, and romance

Page 15: Probability and Statistics Review pt 1

FREQUENCY DISTRIBUTION

When the set of data includes a large number of

observe values. It becomes practical to group the data into

classes or categories with the corresponding number of

terms falling into each class. The result is a tabular

arrangement called a frequency distribution.

Definition of terms:

A frequency table categories (or classes) of scores,

along with counts (or frequencies) of the number of scores

that fall into each category.

The frequency for a particular class is the number of

original scores that fall into that class.

Page 16: Probability and Statistics Review pt 1

Lower class limits are the smallest number that can actually

belong to the different classes.

Upper class limits are the largest number that can actually

belong to the different classes.

Class boundaries are the numbers used to separate

classes, but without the gaps created by the class limits.

They are obtained increasing the upper class limits and

decreasing the lower class limits by the same amount so

that there are no gaps between consecutive classes. The

amount be added or subtracted is one-half the difference

between the upper limit of one class and the lower limit of

the following class.

Class marks are the midpoints of the classes. They can be

found by adding lower class limits and dividing by 2.

Page 17: Probability and Statistics Review pt 1

Class width or Class size is the difference between two

consecutive lower class limits or two consecutive lower

class boundaries.

Relative Frequency ratio of the class frequency to the total

frequency

Cumulative Frequency accumulated frequency that is <, > to

a stated value. We obtain the > cumulative frequency if the

frequencies are summed from bottom up to find the

number of observations greater than a specified lower

class boundary. The less than cumulative is constructed if

the frequencies are summed from top down to find the

number of observations less than a particular upper class

boundary.

Page 18: Probability and Statistics Review pt 1

A. Steps in constructing Frequency table.

Step 1: Count the number of data points in the set of data.

Step 2: Determine the range R, for the entire data set. The

range is the smallest value in the set of data subtracted

from the largest value

Step 3: Decide on the number of the class intervals. The

ideal number of class intervals is somewhere between 5

and 15. To approximate the appropriate number of class

intervals, we may use Herbert Sturges’ Formula

K = 1 +3.322 log n

Where K stands for the number of classes suggested and

n represents the total frequency. Avoid having too many

classes or too few classes. Too many classes may lead to

several empty classes. Too few classes tend to lose

important details of the data.

Page 19: Probability and Statistics Review pt 1

Step 4: Determine the class width by dividing the number

of classes into the range. Round the result up to a

convenient number. This rounding up ( not off ) not only

is convenient, but also guarantees that all of the data will

be included in frequency table.

Class width ( i ) = round up of ( range/number of classes )

Step 5: Select as the lower limit of the first class either the

lower score or convenient value slightly less than the

lowest score. This value serve as the starting point.

Step 6: Add the class width to the starting point to get

the second lower class limit. Add the class width to the

second lower class limit to get the third, so on.

Page 20: Probability and Statistics Review pt 1

Step 7: List the lower class limits in a vertical column,

and enter the upper class limits, which can be easily

identified at this stage.

Step 8: Represent each score by a tally in the

appropriate class.

Step 9: Replace the tally marks in each class with the

total frequency count for that class.

Page 21: Probability and Statistics Review pt 1

Example: The test scores of sixty students in Statistics are recorded as follows:

78 51 61 74 68 78

62 71 88 72

66 77 82 68 68 73 56 82 66 71

58 75 67 75 86 66 70 71 64 73

85 74 62 84 66 92 91 57 61 78

63 73 58 79 61 83 88 81 75 57

68 70 54 79 62 78 59 70 66 81

Page 22: Probability and Statistics Review pt 1

1. Number of data points = 60

2. Range = 92 – 51 = 41

3. 3. Using Sturges’ formula, K = 1 +3.322 log 60 = 7.

Therefore, class intervals is seven.

4. The class size or width is computed as i = 41/7 = 5.86 = 6

Instead of starting the first class at 51, choose to start

at the nice round number 50.

Thus , the first class is 50- 55. Adding 6 to both limits, we

obtain the next interval 56-61.

Page 23: Probability and Statistics Review pt 1

CLASS

INTERVAL

CLASS

BOUNDARIES MIDPOINT TALLY FREQUENCY

50 – 55 49. 5 – 55.5

56 – 61 55.5 – 61.5

62 – 67 61.5 – 67.5

68 – 73 67.5 – 73.5

74 – 79 73.5 – 79.5

80 – 85 79.5 – 85.5

86 – 91 85.5 – 91.5

92 – 97 92.5 – 97.5

Page 24: Probability and Statistics Review pt 1

3. The number of television viewing hours per household and the prime viewing times are two factors that affect television advertising income, A random sample of 50 households in a particular viewing area produced the following estimated of viewing hours per household.

a. Starting with the lowest value as the lower class limit,

construct a frequency distribution.

b Determine the class marks, class boundaries, relative

frequency, <CF, and >CF.

3.0 6.0 7.5 15.0 12.0 6.6 9.5 14.5 10.5 11.0

6.5 8.0 4.0 5.5 6.0 5.6 13.3 13.1 5.5 12.5

5.0 12.0 1.0 3.5 3.0 2.4 3.8 4.5 8.0 2.5

7.5 5.0 10.0 8.0 3.5 2.6 8.5 2.5 6.4 7.6

9.0 2.0 6.5 1.0 5.0 7.7 9.3 6.5 8.2 8.8

Page 25: Probability and Statistics Review pt 1

GRAPHICAL REPRESENTATION OF FREQUENCY

DISTRIBUTION

A histogram or frequency histogram, is a bar

graph which consist of a set of rectangles while the

frequency polygon is a line graph. Both graphs are

intended to show more salient features of the frequency

distribution.

a. HISTOGRAM

The histogram is a set of vertical bars having their bases

or the horizontal axes which center on the class marks.

The width corresponds to the class marks and the height

correspond to the frequencies.

A histogram differs from a bar chart in the bases of each

bar are the class boundaries rather than the class limits.

Page 26: Probability and Statistics Review pt 1

b. FREQUENCY POLYGON

The frequency polygon is a modification of the histogram;

only, the frequency polygon is line graph where the class

frequencies is plotted against the class marks. To close the

polygon, an extra class mark at each end must be added. The

frequency polygon can also be obtained by connecting

midpoints of the tops of the rectangles in the histogram.

c. OGIVES

A line graph showing the cumulative frequency of distribution

is called an ogive. For the “less than” ogive, the “less than”

cumulative frequencies are plotted against the upper class

boundaries. For the “greater than” ogive, the greater than

cumulative frequencies are plotted directly above the lower

class boundaries. These graphs are useful in estimating the

number of observations that are less than or more than a

specified value.

Page 27: Probability and Statistics Review pt 1

STEM AND LEAF PLOTS

Another simple way to display the distribution of a

quantitative data set is the stem and leaf plot. This

procedure was introduced by Tukey and is one of the

primary tools of explanatory data analysis. A stem and leaf

diagram consists of a series of horizontal rows of

numbers. The number used to label a row is called a stem,

and the remaining numbers in the row are called leaves..

Page 28: Probability and Statistics Review pt 1

Steps:

1. Divide each measurement into two parts: the stem and

the leaf.

2. List the stem in a column, with a vertical line to their right.

3. For each measurement, record the leaf potion in the

same row as its corresponding stem.

4. Order the leaves from the lowest to highest in each stem.

5. Provide a key to your stem and leaf coding so that the

reader can recreate the actual measurements if

necessary.

Page 29: Probability and Statistics Review pt 1

Sometimes the available stem choices result in a plot that

contains too few stems and a large number of leaves

within each stem. In this situation, you can stretch the

stems by dividing each one into several lines, depending

on the leaf values assigned to them. Stems are usually

divided in one of two ways:

Into two lines, with leaves 0-4 in the first line and

leaves 5-9 in the second line.

Into five lines, with leaves 0-1, 2-3, 4-5, 6-7, and

8-9 in the five lines respectively.

Page 30: Probability and Statistics Review pt 1

Example:

The data below ate the GPAs of 30 Adamson University

freshmen, recorded at the end of the freshmen year.

Construct a stem and leaf plot to display the distribution

of the data.

2.0 3.1 1.9 2.5 1.9 2.3 2.6 3.1 2.5 2.1

2.9 3.0 2.7 2.5 2.4 2.7 2.5 2.4 3.0 3.4

2.6 2.8 2.5 2.7 2.9 2.7 2.8 2.2 2.7 2.1

Page 31: Probability and Statistics Review pt 1

DESCRIPTIVE STATISTICS

MEASURES OF CENTRAL TENDENCY

A measure of central tendency gives a single

value that acts as a representative average of

the values of all the outcomes of your

experiment. Three parameters that measure the

center of the distribution in some sense are of

interest. These parameters, called the

population mean, the population median and the

population mode.

Page 32: Probability and Statistics Review pt 1

a. THE MEAN

For Ungrouped Data:

Let x1 , x2 , x3 ,…. xn be n observations of a random variable X. The sample mean, denoted by x, is the arithmetic average of these values. That is,

_ x1 + x2 + x3 +…+ xn

x (x-bar) = -------------------------------

n

For Grouped Data

_ fi xi

x (x-bar) = ----------

fi

Where: fi is the frequency of class interval i

xi is the class midpoint of class interval i

k

i =1

i = 1

k

Page 33: Probability and Statistics Review pt 1

B. THE MEDIAN

For Ungrouped Data: Let x1 , x2 , x3 ,…. xn be a sample observations arranged in the order of smallest to largest. The

sample median for this collection is given by the middle observation if n is odd. If n is even, the sample median is the average of the two middle observations.

For Grouped Data: When the data are grouped into a frequency distribution, the median is obtained by finding the cell

that has the middle umber and then interpolating within the cell. n/2 – <cfi-1 n/2 – >cfi-1 x = Lbi + -------------------- (i) OR x = Ubi - -------------------- (i) fi fi where: Lbi = lower class boundary of the interpolated interval Ubi = lower class boundary of the interpolated interval <cfi-1 = less than cumulative frequency of the class before interpolated interval >cfi-1 = greater than cumulative frequency of the class before interpolated interval fi = frequency of the interpolated interval i = class size n = number of data points

~ ~

_

Page 34: Probability and Statistics Review pt 1

C. THE MODE

The last measure of central tendency is the mode. For a finite population, the population mode is the value of X that occurs most often. The mode of a sample is the value that occurs most often in the sample. The drawback to this measure is that there might not be a unique mode. There might be no single number that occurs more often that any another. For this reason, the mode is not a particularly useful descriptive measure.

When the data are grouped into a frequency distribution, the midpoint of the cell with the highest frequency is the mode, since this point represents the highest point (greatest frequency).

Page 35: Probability and Statistics Review pt 1

EXAMPLES:

1. The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5, 3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1 and 4.3 seconds. Calculate the mean, median and mode.

2.5 + 3.6 + 3.1 + 4.3 + 2.9 + 2.3 + 2.6 + 4.1 + 4.3

Mean = ------------------------------------------------------------------

9

Mean = 3.3

Median : 2.3, 2.5, 2.6, 2.9, , 3.6, 4.1, 4.3, ,4.3

Median = 3.1

Mode = 4.3

3.1

Page 36: Probability and Statistics Review pt 1

2. The frequency table (on the right side) represent the final

examination for an statistics course. Find the mean, the median and the mode.

Class Interval Frequency Class mark Cumulative

Frequency

<CF

10– 19 3 14.5 3

20 – 29 2 24.5 5

30 – 39 3 34.5 8

40 – 49 4 44.5 12

50 – 59 5 54.5 17

60 – 69 11 64.5 28

70 – 79 14 74.5 42

80 – 89 14 84.5 56

90 – 99 4 94.5 60

Page 37: Probability and Statistics Review pt 1

fi xi

Mean = ---------------

fi

(3)(14.5) + (2)(24.5) +( 3)(34.5) + (4)(44.5) + (5)(54.5) +

(11)(64.5) + 14(74.5)+ (14)(84.5) +(4)(94.5)

Mean = --------------------------------------------------------------------------------

3 + 2 + 3 + 4 + 5 + 11 + 14 + 14 + 14

Mean = 66

n/2 – <cfi-1

Median = Lb + -------------------- (i)

fi

60/2 – 28

Median = 69.5 + -------------------- (10)

14

Median = 70.93

Mode = Classmark with the highest frequency

Mode = 74.5 and 84.5

Page 38: Probability and Statistics Review pt 1

MEASURES OF VARIABILITY

Refers to the extent of scatter or dispersion around the

zone of central tendency

A. RANGE One measure of variation is the range, which has the advantage of

being very easy to compute. The range, R, of a set of n measurements is defined as the difference between the largest and smallest measurements.

Formula: Range = Highest score – Lowest Score or R = (H – L) B. VARIANCE and STANDARD DEVIATION The variance of a population of N measurements is defined to be the

average of the squares of the deviations of the measurements about their mean μ. The population variance is denoted by σ² and is given by the formula

(x - µ) ² ² = -------------- for ungrouped data N ƒ (x - µ) ² ² = ----------------- for grouped data

ƒ

Page 39: Probability and Statistics Review pt 1

The variance of a sample of n measurements is defined to be the sum of the squared deviations of the measurement about their mean x divided by (n-1). The sample variance is denoted by s² and is given by the formula

(x – x) ²

s² = --------------- for ungrouped data

n-1

ƒ (x – x) ²

s² = ------------------- for grouped data

ƒ -1

The standard deviation, in essence, represents the “average amount of variability” in a set of measures, using the mean as a reference point. Strictly speaking, the standard deviation is the positive square root of the average of the square deviations about the mean or the positive square root of the variance. The standard deviation is basically a measure of how far each score, on the average, is from the mean

_

_

Page 40: Probability and Statistics Review pt 1

1. The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5, 3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1 and 4.3 seconds. Calculate the range, variance and standard deviation.

Range = HV – LV

= 4.3 – 2.3 = 2

(x – x-bar) ²

s² = --------------------------

n-1

(2.5-3.3)2 + (3.6-3.3)2 + (3.1-3.3)2 +(4.3-3.3)2 + (2.9-3.3)2 +

(2.3-3.3)2 +(2.6-3.3)2 + (4.1-3.3)2 + (4.3-3.3)2

= -----------------------------------------------------------------------------------

9 -1

= 0. 6325 (sample variance)

s = sqrt (0.6325)

= 0.795298686 or 0.80 (sample standard deviation)

Page 41: Probability and Statistics Review pt 1

The frequency table (below) represent the final

examination for statistics course. Find the population

range, population variance and population standard

deviation

Class Interval Frequency Class mark Cumulative

Frequency

10– 19 3 14.5 3

20 – 29 2 24.5 5

30 – 39 3 34.5 8

40 – 49 4 44.5 12

50 – 59 5 54.5 17

60 – 69 11 64.5 28

70 – 79 14 74.5 42

80 – 89 14 84.5 56

90 – 99 4 94.5 60

Page 42: Probability and Statistics Review pt 1

Range = Highest Upper Class Boundary - Smallest Lower Class

Boundary

= 99.5 – 9.5

= 90

ƒ (x - µ) ²

² = -----------------

ƒ

3(14.5 – 66)2 +2 (24.5 – 66)2 +3 (34.5 – 66)2 + 4(44.5 – 66)2 +

5(54.5 – 66)2 +11 (64.5 – 66)2 +14 (74.5 – 66)2 +

14(84.5 – 66)2 + 4(94.5 – 66)2

² = ----------------------------------------------------------------------------

60

= 432.75

= 20.80264406 or 20.80

Page 43: Probability and Statistics Review pt 1

- refer to the visual characteristics of a certain

distribution.

- knowledge of the shape of the distribution can

help in concluding whether the distribution is

normal or not

Measures of Shape

Two (2) Principal Measures

of Shape

SKEWNESS

KURTOSIS

Page 44: Probability and Statistics Review pt 1

refers to the symmetry of a

distribution. A distribution

which is not symmetric with

respect to its mean can be

termed as either positively-

skewed or negatively-skewed

Measures of Shape

refers to the flatness or

peakedness of a particular

distribution

Skewness

Kurtosis

Page 45: Probability and Statistics Review pt 1

Skewness

SK = 0 Symmetric (Normal)

SK > 0 Positively Skewed

SK< 0 Negatively Skewed

where:

Xi - individual reading

σ - standard deviation

μ - mean

N - population size

SK= S[(Xi - μ)/]3 N

Page 46: Probability and Statistics Review pt 1

negative skew: The left tail is longer than the right tail. It

has relatively few low values. The distribution is said to

be left-skewed or "skewed to the left“; Example

(observations): 1,1000,1001,1002,1003

positive skew: The right tail is longer the left tail. It has

relatively few high values. The distribution is said to be

right-skewed or "skewed to the right".Example

(observations): 1,2,3,4,100.

The skewness for a normal distribution is zero, and any

symmetric data should have a skewness near zero.

Page 47: Probability and Statistics Review pt 1

Kurtosis

k = S[(Xi - μ)/]4 N

where:

Xi - individual reading

σ - standard deviation

μ - mean

N - population size

k = 3 MesoKurtic (Normal)

k > 3 LeptoKurtic

k < 3 PlatyKurtic

Page 48: Probability and Statistics Review pt 1

Platykurtic data set has a flatter peak around its mean,

which causes thin tails within the distribution. The

flatness results from the data being less concentrated

around its mean, due to large variations within

observations

Mesokurtic data, A term used in a statistical context

wherekurtosis of a distribution is similar, or identical, to

the kurtosis of a normally distributed data set.

Leptokurtic distributions have higher peaks around the

mean compared to normal distributions, which leads to

thick tails on both sides. These peaks result from the data

being highly concentrated around the mean, due to lower

variations within observations.

Page 49: Probability and Statistics Review pt 1

Examples

1. A technician checks the resistance value of 5 coils and

records the values in ohms: 3.35, 3.37, 3.28, 3.34 and

3.30. Determine the average.

2. Tensile tests on aluminum alloy rods are conducted at

three different times, which results in three different

average values in megapascals (Mpa). On the first

occasion, 5 tests are conducted with an average of 207

Mpa; on the second occasion, 6 tests, with an average of

203MPa; and on the last occasion, 3 tests, with an

average of 206MPa. Determine the weighted average.

Page 50: Probability and Statistics Review pt 1

3. Determine the standard deviation of the moisture content

of a roll of kraft paper. The results of six readings across

the paper web are 6.7, 6.0, 6.4, 6.4, 5.9, and 5.8%.

4. Given the frequency distribution of the life of 320

automotive tires in 1000 km as shown in table below,

determine the average and standard deviation

Boundaries Midpoint Frequency

23.5-26.5 25.0 4

26.5-29.5 28.0 36

29.5-32.5 31.0 51

32.5-35.5 34.0 63

35.5-38.5 37.0 58

38.5-41.5 40.0 52

41.5-44.5 43.0 34

44.5-47.5 46.0 16

47.5-50.5 49.0 6

Page 51: Probability and Statistics Review pt 1

PRACTICAL SIGNIFICANCE OF THE

STANDARD DEVIATION

A. TCHEBYSHEFF’S THEOREM

Tchebysheff’s theorem applies to any set of measurements and can be used to describe either a sample of or population. The idea involved in this theorem is illustrated below. An interval is constructed by measuring a distance k σ on either side of the mean μ. Note that the theorem is true for any number we choose for k as it is greater than or equal to 1. Then at least 1 – (1/k²) of the total number of n measurements lies constructed interval

1–1/ k2

Page 52: Probability and Statistics Review pt 1

The theorem states that:

At least one the measurements lie in the interval μ-σ to μ+σ.

At least ¼ of the measurements lie in the interval μ-2σ to μ+2σ.

At least 8/9 of the measurements lie in the interval μ-3σ to μ+3σ.

B. EMPIRICAL RULE

Another rule helpful in interpreting a value for a standard deviation is the Empirical rule, which applies to a data set having a distribution that is approximately bell-shaped. The empirical rule is often stated in abbreviated form, sometimes called the 68-95-99 rule.

Page 53: Probability and Statistics Review pt 1

1. A sample of 3000 observations has a mean of 82

and a standard deviation of 16.

Using the empirical rule, find what percentage of the

observations fall in the intervals x+2s; x+3s.

2. The mean life of a certain brand of auto batteries is

44 months with a standard deviation of three

months. Assume that the lives of all auto batteries of

this brand have a bell-shaped distribution. Using the

empirical rule, find the percentage of auto batteries

of this brand that have a life of

a. 41 to 47 months b. 38 to 50 months c. 35 to

53 months

3.

Page 54: Probability and Statistics Review pt 1

3.The ages of cars owned by all employees of

a large company have a bell-shaped

distribution with a mean of seven years and

a standard deviation of 2 years.

a. Using the empirical rule, find the

percentage of cars owned by these

employees are i. 5 to 9 years old ii. 1 to 13

years old.

b. Using the empirical rule, find the interval

that contains the ages of the cars owned by

95% of all employees of this company.

Page 55: Probability and Statistics Review pt 1

MEASURES OF POSITION

A. PERCENTILE

A set of n measurements on the variable

x has been arranged in order of

magnitude. The pth percentile is the value

that separate the bottom p% of the ranked

score from the top (100-p)%.

( Xnp + Xnp+1 ) if np is integer

Any percentile =

Xnp ( round to the next largest integer) if np is non-integer

Page 56: Probability and Statistics Review pt 1

For Grouped Data

np – <cfi

Any Percentile = Lb + -------------------- (i) fi

OR

n(1-p) – >cfi

Any Percentile = Ub - -------------------- (i) fi

where:

Lb = lower class boundary of the interpolated interval

Ub = lower class boundary of the interpolated interval

<cfi = less than cumulative frequency of the class before interpolated interval

>cfi = greater than cumulative frequency of the class before interpolated interval

fi = frequency of the interpolated interval

i = class size

n = number of data points.

p = the desired proportion or percentile

Page 57: Probability and Statistics Review pt 1

B. QUARTILE are values that divide a set of

observations into 4 equal parts. These

values, denoted by Q1 , Q2 and Q3 are

such that 25% of the data falls below Q1

50% fall below Q1 , and 75% falls below

Q3

C. DECILE are values that divide a set of

observations into 10 equal parts.