Business Statistics May Module

7/28/2019 Business Statistics May Module

1/72

Business statistics bcm 307

INTRODUCTION

Expected Learning Outcomes: At the end of the course a student should be able to:

To demonstrate an understanding of statistics and its importance for business

and management

Demonstrate proficiency with the qualitative and quantitative measures: through

ability to organize and present data on tables, charts, graphs, polygons

Demonstrate an understanding of measures of central tendency and measure of

dispersion

Demonstrate an understanding in time series and application

Drawing scatter diagrams, construct simple linear regression equation and

application

Demonstrate proficiency in contingent table, probability concepts and application

in business

Demonstrate some level of understanding and application of hypothesis testing

1.Introduction

Definition of statistics; Application of statistics in business; Terminologies used in

statistics

2.Collection, Organization and Presentation of Data

Qualitative data: Summary tables, bar, pie and Pareto charts,

Quantitative data: Summary tables, histogram, graphs, polygons, ogive and

Lorenz curve

Time Series:Time series graphs, application and forecasting

3.Numerical Descriptive Measures: Discrete and Continuos Data

Measures of Central Tendency: Mode, Median, Mean

Measures of dispersion: Significance of the measures, Range, Interquartilesrange, Variance, Standard deviation, Coefficient of variation

4. Probability Distributions.

a) Discrete distribution

b) Normal distribution(Continuous)


2/72

Introduction

Standard Normal Distribution

Z-scores

Areas to the Left and Right of x

Calculations of Probabilities Using the Central Limit Theorem

5.Confidence Interval

Introduction

Confidence Interval

Single Population Mean, Population Standard Deviation Known

Confidence Interval, Single Population Mean, Standard Deviation

6.Linear Regression Model and Scatter Diagram (Simple and Multi-

linear)

Drawing scatter diagram, Describing relationship, equation relationship between

variables

Use least square method to derive the simple regression equation

Explain the coefficients of the variables and their significance

7.Hypothesis Testing

Introduction

Definition of Hypothesis Testing

Null and Alternate

Hypothesis Testing for the Mean

Hypothesis testing for the Proportion

Support One of the Hypothesis

Decision and ConclusionCourse Texts:

1. Dean. S Illowsky.B; Principles of Business Statistics;


3/72

2. Berenson. M, et al; Basic Business Statistics: Concepts and Application. 11th

edition (2009

3. T.Lucey; Quantitative Techniques: 6th edition (2002)

4. R.I. et al. Quantitative Approaches to Management. 8th edition (1992)

INTRODUCTION

Statistics is a branch of mathematics that transforms numbers into useful information

for decision making. It does this by producing a set of methods for analyzing the

numbers.

Statistics is therefore the science of data that involves collecting, classifying,summarizing, organizing, analyzing and interpreting numerical information.

Definition of Terms

The application of statistics can be divided into two broad areas:

Descriptive statistics

Inferential statistics

Descriptive statistics: It utilizes numerical and graphical methods to look for patterns

in a data set, to summarize the information revealed and to present the information in

a convenient form. This could be referred to as analysis of data.

The data is usually presented in form of tables, charts, graphs and analyzed using

statistics such as the mean, median, mode, variance, standard deviation, coefficient of

variation etc.

Inferential statistics: It uses the data collected from a small group to draw

conclusion about a larger group. The conclusion may be decisions, predictions or other

generalizations about a larger set of data.

Important applications can therefore be summarized as:

Summarizing business data


4/72

Drawing conclusion from the data

Making reliable forecast about business activities

Improving business process

Statistics: The word statistics has two meanings:

Numerical facts derived from analysis of sample data, for example, mean,

standard deviation and proportions. Any numerical facts can also be referred

to as a statistic, e. g number of people, number of countries, marks scored in a

test etc.

Field or discipline of study. It is a branch of mathematics that transforms

numbers into useful information for decision making. It does this for by

providing a set of methods for analyzing the numbers. These methods help to

find patterns in the numbers and this enables one to determine whether

differences in the numbers are just due to chance. In this case statistics can

be seen as a science of data. It involves collecting, classifying summarizing,

analyzing and interpreting numerical information.

Population (target population): It is a set of units that we are interested in studying

and the one we need to draw a conclusion about: the whole set of elements of focus.

Characteristics are calledparameterse.g. population mean, , population standarddeviation, , population proportion, p

Sample: It is the portion or subset of the population that is selected for analysis. The

sample is randomly selected so as to consist of all the characteristics of the population.

Characteristics are called statistics e.g. sample mean,xsample standard deviation, s,

sample proportion,p

Representative sample: The sample selected is representative if it exhibits typical

characteristics that are possessed by the population of interest or the targetpopulation. The most common way to satisfy the representative sample requirements

to use methods that allow us to select a random sample, that is, giving every element

equal chance to be selected.


5/72

Random sample: A random sample is selected from the population in such a way that

every different element and every sample size has equal chance of selection.

Element/member: Element of a sample or population is a specific subjects or object

about which information is collected, e.g. a firm, a country, a person, a university etc.

Variable: It is a characteristic under study that assumes different values for different

elements. For example scores in a test; different scores are expected for different

students when the do a given test. In the case of a firm, the profit made at different

time may be different; peoples tastes for a given product may be different etc.

Observation/measurement: The value of a variable for an element. 120 cm as a

height, a yes for an opinion, Sh. 20 000 as an income

Data set: A collection of observations on one or more variables.

Types of variables

A variable may be classified as qualitative and quantitative.Qualitative Variable

These are measurements that cannot be measured on a natural numerical scale but

can only be classified into groups or categories.

The data from categorical variable are measured on the scales; nominal or ordinal

scales.

Nominal scale: This divides distinct categories that cannot be ranked. For

example the gender (female or male), preference of a product or a service (softdrink), a yes or no response etc. This is the weakest form of measurement

Ordinal scale: It classifies data into distinct categories that can be ranked. E.g.

Responses such as excellent, very good, fair, poor etc Though it is possible to

rank the scale it is still weak in that the amount of the difference between

categories cannot be accounted.

Quantitative Variable

The measurements are recorded on a naturally occurring numerical scale. The scale

cans either b internal or ratio scale. These two scales can be ranked but also the

difference between two variables can be calculated and interpreted.

Internal scale:The scale cannot be used in comparing for example a student who

scores a 100% is not twice as intelligent as one who scales 50%.


6/72

Ratio scale: It refers to data that can also be compared. This includes data that

incorporates arithmetic operations (addition, subtraction, multiplication and

division). For example sales of a company, income of families returns from an

investment etc.

2. COLLECTION, ORGANIZATION AND PRESENTATION OF DATA

The first step is to identify the type of data one wants to collect; quantitative orqualitative. The second step is to device a suitable method for collecting data. There

are various methods that are used to collect data, for example survey, designed

experiments and observational study.

Survey: The researcher samples a group of people and asks them questions from who

responses are obtained. Some tools used in the data collection are questionnaires,

mails, and telephone or in-person interviews.

Experiment: Designed experiments normally involve strict control over the elementsin study. Two groups are designed one of which composes of experiment treatment

group and a control group.

Observational study:The researcher observes the elements in their natural setting

and records the variables of interest.

Regardless of the data collection method, it is likely that the data will be from a

sample. Data is classified as primary or secondary. Primary data is the one collected by

the person analyzing data while secondary is obtained by the analyzer from

publications such as books, journals, newspapers etc.

Organization and Presentation of Data

After data is collected it is cleaned to remove unnecessary work in the record and it is

ready for analysis which is the process that transforms the raw data to meaningful


7/72

information which analyst can use in decision making process. The analysis process

includes with organization, presentation, description and inference of the results

obtained from the sample to make generalization on the population.

The technique or methods used to present or analyze data will depend on the type of

data; quantitative or qualitative.

The process and method of analysis depends on the type of data that was collected;

qualitative or quantitative.

Qualitative Data

The data can be organized and presented in;

Summary tables- frequency, relative frequency or parentage frequency table

The bar charts

The pie charts

The Pareto charts

Examples 1

A sample was taken of 25 high school seniors who were planning to join college. The

following are categories of majors he/she intended to choose: Business (BUS),

economics (ECON), management information science systems (MIS), behavioral science(BS) and others.

The responses of the students as they are asked their choice are listed below:

ECON MIS ECON BUS BUSBUS BUS other other otherOther BS MIS other MISECON BUS MIS BUS otherBS MIS other other other

Required:

To organize and present data by constructing

i. Frequency distribution table


8/72

ii. Bar chart

iii. Pie chart

Solution

The above data is measured on categorical basis. The analyst collected the data by

identifying those students who majored in any of the above categories.

i. Frequency Distribution Table:

This is a summary table that organizes the raw data into a frequency distribution table

that includes three columns as demonstrated below:

Categories Tally FrequencyBUS 6ECON 3MIS 6BS 2Other 8

Total 25

To make sure that all the responses from each category are included, the student goes

through the raw data putting a slash on each response and recording it as a tally on the

tally column.

The tallies are recorded as slashes and any group of five are written by counting the

number of numerical value. To ensure all items were considered the student must write

the total frequency as shown above that must match the sample size of the data set.

The result in the frequency distribution table gives us the number of students who took

that particular major. From the number or frequency we can identify the categories

with the highest number of students, or the least and generally we can describe how

the frequency is distributed among the different categories.

Relative frequency distribution table can also be used as summary table. In this case

an additional column for relative frequency from each category is written as a relative

value by dividing it by the total frequency to result with:

Categories Tally Frequency Relative frequencyBUS 6 6/25 0.24ECON 3 3/25 0.12MISS 6 0.24BS 2 0.08Other 8 0.32Total 25 1.00


9/72

The total relative frequency is equal to 1.00. From this table we obtain the relative

or proportion of the students as distributed among the categories.

ii. Percentage frequency distribution Table

The summary table can also be in percentage frequency where the column is added

to represent this. The table may be like:

Categories Tally Frequency Percentage

FrequencyBUS 6 6/25 *100 24ECON 3 12MIS 6 24BS 2 08

Other 8 32Total 25 100

The column has a total of 100. Percentages are more simpler way of expressing

proportions.

NB: To develop either relative frequency distribution table or percentage frequency

distribution table, one must have constructed the frequency distribution table first

and then the rest and it would be advisable to do all of them in the same table.

iii) Bar Chart

The bar chart presents the data using a horizontal and vertical axis. The horizontal

axis takes the categories which are represented by bars of equal width and


10/72

separated from each other by uniform space. The frequency or (relative frequency

or percentage) is written on the vertical axis.

The scale of the vertical axis is determined by the highest frequency of the

categories. The scale must be easy to use construct and read.

The Vertical axis may also use the relative frequencies or percentages; the scale must

be well selected to include the highest relative/p percentage value. The bar chart

becomes a good presentation of the data from which various information can be drawn.

For example; the business and MIS bars are both equal; the number of students doing

business and MIS majors are equal.

The category of other is the highest and so it means other cause not distinctly

identified are also offered. Behavioral science has the least number of students.bar

charts are best suited for comparing different categories by checking on the height of

the bars

iv) Pie Chart

The information could also be presented on a pie chart. The chart assigns the

categories according to their proportions reflected by the size of the sector. The

percentages (or relative frequencies) are converted to degrees.

Categories RF DegreesBUS 0.24 (360) 86.4ECON 0.12 43.2MIS 0.24 86.4


11/72

BS 0.08 28.8Others 0.32 115.2

1.0 360

Then draw the pie chart to show the different categories in form of different sizes.

The pie chart presentation easily shows and identifies the size of the different portions

and makes it easy to draw conclusions.

v) Pareto charts

Pareto charts classify categories into vital few and trivial many.

The Pareto principle exists when the majority of items in a set of data occur in a small

number of categories and the few remaining items are spread out over a larger number

of categories.

The separation helps to identify and focus on the important categories.

Example 2

The hotel X Y Z samples complaints about the hotel rooms and categorizes them as in

the table below. The sample that gave the responses was made up of 106 customers.The table summarizes the complaint categories and the number of customers that

complained over certain issues.


12/72

Required

a. Construct a Pareto chart

b. What reasons for the complaints do you think the hotel managers should focus on

if it wants to reduce the number of complaints. Explain

c. Construct also pie and bar chart and compare the suitability of each chart in

presenting this data

Solution

Orders the categories from the one with highest frequency to the one with the least,

convert frequencies to percentages and show this column. Also draw columns for

cumulative frequency and cumulative percentage frequency. The categories can be

identified with symbols to avoid a lot of writing. Let the categories in the question be

numbered from A to H before rearrangement.Arranging them in order from the one with highest frequency to the last would give us:

A B E C F G H

The table has the following information

Reasons for complaint Number of

customersA Dirty room 32B Not stocked 17C Not ready 12D Too noisy 10

E Needs maintenance 17F Has too few beds 9G Doesnt have promised

features

7

H No special

accommodation

2


13/72

Reasons Frequency Cumulative

Frequency

% Cumulative %

A 32 32 30.2 30.2B 17 49 16.0 46.2E 17 66 16.0 62.2C 12 78 11.3 73.5D 10 88 9.4 82.9F 9 97 8.5 91.4G 7 104 6.6 98.0H 2 106 2.0 100

Summary:

Summary tables together with the chart are used to describe the portion of items of

interest in each category.


14/72

Each chart best suits certain situations, for example;

Bar chart is more suitable for the purposes of comparing the size of categories

especially when they are not many in number; in our case we can have not more

than six. If they are more than that, they become crowded.

Pie charts are best suited for the situation where the main objective is to

investigate the portion a category occupies in relation to the whole part. Coloring

the portions with different colors enhance the display. It will also be best for few

categories.

Pareto chart sorts the frequencies in descending order and provides the

cumulative curve on the same graph. This allows the viewer to see which

categories account or matter most in the given situation. The chart allows

presentation of many categories and also those with small difefrences in

percentage because the curve enhances identification of the additional

proportion given by any added category.

Summary tables together with the chart are used to describe the portion of items of

interest in each category.

Each chart best suits certain situations, for example;

Bar chart is more suitable for the purposes of comparing the size of categories

Pie charts are best suited for the situation where the main objective is toinvestigate the portion a category occupies in relation to the whole part. Coloring

the portions with different colors enhance the display.

Pareto chart sorts the frequencies in descending order and provides the

cumulative curve on the same graph. This allows the viewer to see which

categories account or matter most in the given situation.

Quantitative Data

These are measurements that are recorded on a naturally concurring numericalscale. They are measured on an interval or ratio scale as explained earlier.

Quantitative data can be organized and presented in a number of ways that

include:

Ordered array


15/72

Stem-and-leaf display

Summary tables

Histogram

Frequency polygons

The cumulative percentage polygon: ogiveQuantitative data; can either be discrete or continuous.

Discrete data: It is a variable whose values are countable i.e. they assume

whole number values e.g. number of persons, cars, companies etc.

Continuous data: It is a variable that can assume any numerical value over

continuum of certain interval or intervals e.g. time taken to serve a customer

in a bank, amount of money height of individuals etc.

Discrete data:

It can be organized and presented in

Ordered Array


Bar chart

Summary tables

Example 1

The following data represents the stock price of 25 companies.

31 15 13 17 2316 22 12 23 3022 18 33 21 1813 26 16 26 2722 27 20 20 22

Required: Construct

i. Ordered arrayii. Stem-and-leaf display

i) Ordered Array:

This requires that the data is written in ascending or descending order.


16/72

12 13 13 15 16 16 17 18 18

20 20 21 22 22 22 22 23 23

26 26 27 27 30 31 33

Ordered array is best applicable if the data is not so large.

ii) Stem-and-leaf display:

It creates suitable stem (main part one digit, two or three) depending on the nature of

the data. Then assigning the remaining digits in what is referred to as leaf.

Since the above data values are a two digit, the tens digit can form stem and the ones

digit the leaf. Tens are represented by 1, 2, 3, ie, tens, twenties and thirties while

the ones digit take the leaf.

Stem leaf

1 2 3 3 5 6 6 7 8 8

2 0 0 1 2 2 2 2 3 3 6 7 7

3 0 1 3

The ones are matched after the appropriate tens, from the display twenties are the

most and thirties the least.

Example 2

The following data represent the monthly rents paid by samples of 30 households

selected from a city.

429 732 550 1020 750

540 956 1070 871 880

650 950 780 900 750

585 675 989 620 660


17/72

578 1030 930 765 975

1020 840 870 800 820

Solution:

The digits contain either 3 digits or 4 digits we can take the stem for 1 digit for the 3

digits number and 2 digits for the four digit number. The leaf can be taken as a two

digit number.

The stem-and-leaf display may not necessarily require data to be arranged in orderly

manner but even. If it is arranged, the pattern obtained is maintained.


4 295 85 50 40 78

6 75 20 60 50

7 32 50 65 80 50

8 71 80 40 70 00 20

9 89 56 30 75 50 00

10 20 30 70 20

By looking at the stem-and-leaf display we can observe how the data values are

distributed. The stem and leaf display does not lose the information on individual

observation or measurement.

Example 3

The following data give the number of computer courses taken by 30 businesses major

who recently graduated from a university.

2 3 2 3 1 4 2 2 3 4

2 3 4 1 2 3 2 1 4 2


18/72

1 2 3 1 1 3 2 2 4 1

Required

a. Prepare a frequency distribution table.

b. Compute relative frequency and percentage distributions

c. Draw a bar graph for the frequency distributions

d. What percentage of the graduates takes 2 or 3 computer courses?

Solution

Identify all the numbers presented in the data set: 1, 2, 3 and 4.

Construct the summary table to include the columns: Number of courses, tallies,

frequency and who relative and percentage frequency distributions can be included inthe same table.

Number

of

courses

Tally Frequency

(f)

Relative Frequency

f/30

Percentage

frequency (*100)

1 7 0.2333 23.332 11 0.3667 36.67

3 7 0.2333 23.334 5 0.1667 16.6730 1.000

Bar graph (chart)


19/72

Those graduates who take 2 or 3 courses are are the total of those who take 2

and those who take 3: (36.67 + 23. 33) % = 60%

Grouped/Continuous data

Discrete data can be presented like categorical data in bar graph where the numbers

take the horizontal axis. Frequency distribution table, relative frequency distribution or

percentage distribution tables can be done as for the categorical variables where the

discrete data value stands as a category. However for grouped data the frequencies,

relative frequencies and percentages are assigned to an interval of numbers in the

table.

Stem- and- leaf display may not be very applicable and in place of bar chart grouped

data is presented in a histogram.

Sometimes it becomes necessary to look at values in a data set in form of class or

groups. Each class gives the total number of values that fall within a given range. It is

required that one identifies the class width, that is, the number of values

accommodated in the class.

Number of classes or groups: at least should not be so few (not less than 3 classes andnot too many (not more than 10) in the context of our class work. However in real life

we may have data grouped into so many classes.


20/72

This is necessary because we are interested in presenting data in a more organized,

easily interpretable form and in a way that makes sense.

Example 1

The data on the stock price of 25 companies:

31 15 13 33 23 16 12 12 23 26 22 18 27 21 18 13 26 16 17 27 22 22 26

20 30

To group the data we can choose a class width of 4 or 5. If we choose 4 the

approximate number of classes will be = 25/4 = 6.25 = 6

If 5 then 25/5 = 5 classes. Either can be used.

Lets use a class width of 5. Identify the smallest value =12

This can be the lowest value in the data or we can decide to start at 10. This means wewill consider in first class 10, 11, 12, 13, 14, ie, 10-14. The next class will have 15, 16,

17, 18, 19, i.e. 15-19 etc. we write class to include all the values. Other classes then

become; 15-19, 20-24 etc. The lowest and highest values in each class are included in

the interval.

The above classes can also be written as 10 to less than 15, 15 to less than 20 etc.

when we use this style the upper value in each class is not included. However in each

case the class interval is five. Be careful to use each style correctly.

The summary table: We can consider including relative frequency distribution and

percentage distribution in the table.

Grouped data can be presented in

i. summary table,

ii. histogram

iii. frequency polygon

iv. cumulative frequency curve (ogive)

i) Summary table


21/72

ii) Histogram

This is a graph in which classes are marked on horizontal axis. The classes are written

to include class limits. Each class in adjusted so that the lower value in the class is

subtracted 0.5 while the upper is added 0.5: .: 9.5 14.5, 14.5 19.5, 19.5 24.5 etc.

The vertical axis either takes the frequency, relative frequency or percentage

frequency. The scale must include the highest frequency: In this case 8.

Draw bars with height corresponding to the frequency in each class making sure that

the bars are adjacent (touch) because the data is continuous and any value can be

included in this data. The information that can be obtained from a histogram is so much

like that from a bar chart for discrete or qualitative data. Histogram also like stem and

leaf can display the distribution pattern of the data.

Data can be normally or approximately distributed or skewed and histogram can

display this information well.

Class Tally Frequency Relative Frequency

f/25

% frequency

*100

Cumulative %

10-14 4 0.16 16 1615-19 6 0.24 24 4020-24 8 0.32 32 7225-29 4 0.16 16 88

30-34 3 0.12 12 10025 1.00 100


22/72

iii) Frequency polygon

It is formed by plotting the middle of each class against the frequency and joining

the points with straight lines. The polygon can be drawn in the histogram by

marking the middle of the bars and joining the points. It is also effectively used to

display the pattern of the data across the classes.

iv) Cumulative frequency graph: Ogive

The graph is drawn by plotting the higher value of the class limits against

cumulative frequency, relative frequency or percentage frequency, i.e. of

companies had their stock prices between 15 and 22

We may also want to know the number of companies whose stock prices

were 27 and below.

Locate 27 along prices and draw a vertical line to meet the curve. Drawa horizontal line to read the frequency =21. Therefore 21 companies

had their stock prices at price of 27 and below. Therefore 4 companies

have their stock prices above 27.

The cumulative frequency curve: ogive


23/72

Lorenz curve: It is a special Ogive that can be used to plot either income or wealth of a

country against the population. It will show how the distribution of wealth is in a given

country. Many Lorenz curves will form a long S showing some level of unequal

distribution of wealth among citizens.

Tax policies can be used to level out the inequality by charging higher tax rates for the

more wealth and lower rates for the little wealth population. For equal distribution the

long S shape results in a straight line- an ideal situation but the more equitably wealth

is distributed nearer the shape to a straight line.

The curve is drawn with percentage as cumulative of the population on vertical axis

and the amounts wealth or income.

3. NUMERICAL DESCRIPTIVE MEASURES AND ANALYSIS

The descriptive measures can be classified as:

- Measures of central tendency mode, median and mean

- Measures of dispersion or spread range, variance, std deviation, semi-

interquartile range, coefficient of variation etc

Measures of Central Tendency

Discrete data

These are summary measures that give averages. The measures of central tendency

can be calculated for discrete (ungrouped) or continuos (grouped) data.

Discrete data:


24/72

a. Mode- this is the most popular or common item in the data set. It is the

value with the highest frequency. Data set can either have unimodal (one

mode), bimodal (two modes) or multimodal. Example

29 31 35 39 39 40 43 44 44

The above set is a bimodal with 39 and 44

b. Median- it is the value of the middle term in a data set that has been

ranked in ascending or descending order. The position of the median is

identified as:

N+1

2 where N is the total frequency

The median in the above data is position 9+1

2 =5th which is 39

Example 123 36 210 249 257 506 385 13 50 97 210 275

Find the median

Solution

Arrange the data in ascending order

13 23 36 50 97 210 234 249 257 275 385 506

Middle position = n+1 = 13 = 6.5

2 2

The position between 6 and 7th position 210 + 243 = 222

2

The advantage of using median as a measure of central tendency is that it is not

influenced by outliers. It is preferred to the mean for data set that contains outliers.

Outliers are few figures in the data that have extreme values from the rest: either very

low or very high.

Mean = Arithmetic mean

It is the most frequently used measure of central tendency. It is the average of the sum

of all values divided by the total frequency. So the mean is preferred in that it

represents the whole data set from which it is computed.


25/72

= Mean = x Sample data

n

n = sample size

= X = mean from a population data N = population sizes.

N

Example 2

The following data gives the profits thousand dollars of a sample of five companies in a

given year.

4725 1884 3807 4939 and 162

X = = = X = 16980 = 3396

n 5

The average profit on the 5 years is $3396000. A major shortcoming tendency is that

mean is very sensitive to outliers.

Example 3

The following data give the number of years eight employee have been with their

current employers

11 9 13 12 8 9 24 10

a) Identify the outlier.

b) What would be the mean if the outlier was ii) excluded ii) included

Solution

a) Outlier is 24 which seem to be the extreme number of years the employee has

been with the employer.


26/72

i) Mean excluding 24:

11+9+13+12+8+9+10 = 10.286

7

ii) Including 24

11+9+13+12+8+9+10+24 = 12

8

The one extreme value changes the mean by almost 2 values (units) i.e. from 10.256 to

12 (1.714).

Mean is very sensitive to outliers. For example the mean mark of BCM 307 test can

easily be affected by few very poor performing students or very few very weeperforming students. The mean may not accurately represent the whole class.

Example 4

The mean of 60, 80, 90, 120

60+80+90+120

4

350=

4

=87.5

The arithmetic mean is very useful because it represents the values of most

observations in the population.

The mean therefore describes the population quite well in terms of the magnitudes

attained by most of the members of the population

Measures of Dispersion

Discrete data

These are statistics or measures that show how data is dispersed. The measures may

include


27/72

Range

Inter-quartile range

Variance

Standard variation

Range:The difference between the highest and the lowest value.

Example 1

The example on the number of years the employees have stayed with the employer.

11 9 13 12 8 9 24 10

Range: 24-8 = 16

The range is influenced by outliers as it is only based on two values. Its disadvantage is

that it ignores the rest of values in a data set and so it is not a satisfactory measure of

dispersion.

Inter Quartile Range:The difference between the upper quartile Q3 and the lower

quartile Q1. It contains the middle 50% data.

Example 1

Arrange data in order

8 9 9 10 11 12 13 241st Quartile: x 8 = 2nd 2nd = 9

Q1 = 9

3rd Quartile (Q3) = x 8 6th

6th

= Q3 = 12Inter quartile Range: 12-9 =3.

Example 2

The following is a discrete data

2, 5, 8, 10, 11, 14, 17, 20


28/72

Required:

(i) Find the 30th percentile

(ii) The quartiles.

Solution

Position = .3(n + 1) = .3(9) = 2.7

30th percentile = 5 + .7(8 5) = 5 + 2.1 = 7.1

Lower Quartile (25th percentile)

Position = .25(n + 1) = .25(9) = 2.25

Q1 = 5+.25(8 5) = 5 + .75 = 5.75

Median (50th percentile)

Position = .5(n + 1) = .5(9) = 4.5

Median: Q2 = 10+.5(11 10) = 10.5

Upper Quartile(75th percentile)

Position = .75(n + 1) = .75(9) = 6.75

Q3 = 14+.75(17 14) = 16.25

Interquartiles

IQ = Q3 Q1 = 16.25 5.75 = 9.50

Example 2 (Grouped Data)

The following table shows the levels of retirement benefits given to a group of workers

in a given establishment.

Retirement

benefits 000

No of

retirees (f)

Upper

class

limit

cf

20 29 50 29.5 5030 39 69 39.5 11940 49 70 49.5 18950 59 90 59.5 27960 69 52 69.5 33170 79 40 79.5 37180 89 11 89.5 382


29/72

Required

i. Determine the semi interquartile range for the above data

ii. Determine the minimum value for the top ten per cent.(10%)

iii. Determine the maximum value for the lower 40% of the retirees

Solution

The lower quartile (Q1) lies on position

N + 1 382 + 1=

4 4

= 95.75

(95.75 - 50)the value of Q1 = 29.5 + x 1069

= 29.5 + 6.63

= 36.13

The upper quartile (Q3) lies on position

N + 1

4

382 + 1=

4

= 287.25

The value of Q3 = 59.5 +( )287.25-279

52 10

= 61.08

The semi interquartile range =Q3-Q1

2

61.08 - 36.13=

2


30/72

= 12.475

= 12,475

ii. The top 10% is equivalent to the lower 90% of the retirees

The position corresponding to the lower 90%

90= (n + 1) = 0.9 (382 + 1)

100

= 0.9 x 383

= 344.7

The benefits (value) corresponding to the minimum value for top 10%

= 69.5 + ( )344.7-33140

x 10

= 72.925

= 72925

iii. The lower 40% corresponds to position

=10

40(382 + 1)

= 153.20

Retirement benefits corresponding to its position

= 39.5 +( )153.2-119

70x 10

= 39.5 + 4.88

= 44.38

= 44380

e. The 10th 90th percentile range


31/72

This is a measure of dispersion which uses percentile. A percentile is a value which

separates one division from the other when a given data is divided into 100 equal

divisions.

This measure of dispersion is very important when calculating the co-efficient of

skewness

Variance: Variance is the square of standard deviation. Formula

= (x ) where x: the values in data

N N: size of population

: the mean

Standard Deviation: It is simply the average of all the Deviations of values of a

variable from the mean.The deviation of each value from the mean is squared and the sum of all the square of

deviations is divided by total frequency (N) of population data and size of sample less 1

(n-1) if sample data was used, them obtain square not.

Formula for calculation:

Population data

= = (x-)

Example 1

Assuming the data in the number of years employees remained with the employer to

have been collected from a sample:

Variance S = (x )

n 1

Mean = X = 12 (obtained earlier)

S = (11-12) + (9-12)) + (8-12) + (10-12) + (24-12)

8-1


32/72

S = (-1) + 2 (-3) + (-4) + (-2) + (8-12) + (10-12)

7

S = 1 + 2 x 9 + 16 + 4 + 144 = 176 = 25.142

7 7

On average each value deviates from the mean on squared = 25.142.

Standard Deviation: Square root of variance

= 25.142 = 5.014

On average each value deviates from the mean by 5.014.

In general the lower the value of standard deviation for a data set from the mean. The

values are close together but higher value of standard deviation indicates that thevalues are relatively spread or scattered.

If the standard deviation of scores obtained by students in a BCM 307 class was

obtained to be higher compared to score obtained in different class, it means the

abilities of students are spread out. Some are very poor while others may be good in

their performance.

If data set is larger the working can be done from a frequency distribution table.

Example 2

A sample comprises of the following observations; 14, 18, 17, 16, 25, 31

Determine the standard deviation of this sample.

x ( )x x ( )2

x x

14 -6.1 37.2118 -2.1 4.4117 -3.1 9.61

16 -4.1 16.8125 4.9 24.0131 10.9 118.81121 210.56


33/72

12120.1

6X = =

Standard deviation, ( )2

210.56

6n

x x

= =

= 5.93

Example 3

The data represents the number of bedrooms in homes owned by 30 families

3 5 2 3 2 3 1 2 1 3

4 1 4 3 1 3 3 2 2 3

3 4 3 1 2 4 2 2 5 3

Required a) identify the mode calculate the

i) mean

ii) variance and standard deviation

Solution

Construct frequency distribution table.

= 30 x = 80 (x- ) = 36.667

a) mode is 3 bedrooms

b) X = x = 80 = 2.67 30

Variance = S = (x- ) = 36.667 = 1.264

n-1 30-1

Number of

rooms (x)

Tally Frequency FX Xi-X (Xi-X)

1 5 5 -1.67 13.94452 8 16 0.67 3.59123 11 33 0.33 1.19794 4 16 1.33 7.07565 2 10 2.33 10.8578

30 80 36.667


34/72

The variance = S = 1.264

Standard deviation = S = 1.264 = 1.124

The mean or average of all deviations of values from the mean is 1.124 i.e. each value

is an average difference of 1.124 from the mean.

Coefficient of Variation

The variance or standard deviation of different data set is not easy to compare. The

coefficient of variation makes it possible for different data sets to be compared based

on measure of central tendency (normally the mean and measure of dispersion

(normally the standard deviation).Coefficient of variation: CV = standard deviation

Mean

In the above example: CV = 1.124 = 0.421

2.67

CV can also be written as a percentage CV = 1.124 x 100 = 0.421x100

2.67

The lower the CV the less the spread of the values from the mean i.e. the values are

closer together.

Measures of Central Tendency and Measures of Dispersion for a Continuos

Data

Example 1

The Table gives the frequency distribution of the daily commuting time for workers

from home to work for all employees of a company.


35/72

Solution:

Computation of the measures similar to that of discrete data whereby the value of x is

obtained as the mid-point of each class

X = sum of the class boundaries e.g. 0+10 = 5 is the mid-point of the 1st classs

2 2

Time Mid-point (x) Frequency (f) (fx) (x-x) f0 to less than 10 5 4 20 1075.8410 to less than 20 15 135 135 368.64

20 to less than 30 25 150 150 77.7630 to less than 40 35 140 140 739.8440 to less than 50 45 90 90 1113.92

=25 x

=535

(x-x) f

=3439.36

Time

(minutes)

Number of

employees0 to less than

10

4

10 to less than

20

9

20 to less than

30

6

30 to less than

40

4

40 to less than

50

2

25


36/72

The mean can also be assigned instead of x given the data is from a population.

However whether the column writes (x-x) or (x-) should not make difference in the

value.

Mean = = x = 535

25

= 21.4

Standard deviation = (x-) f = 3439.36

N 25

= = 137.5744 = 11.729.

NB:For continuous data the mode is replaced by the term modal class; simply the class

with the highest frequency. For the above example the modal class 10 to less than 20.

Practice Questions:

1. The following data represent the age of a sample of 10 employees of a given

company

39 29 43 52 39 44 40 31 44 35

Required:

i) identify the mode and the median

ii) compute

iii) mean

iv) standard deviation

v) coefficient of variation

2. The data gives the frequency distribution of the number of orders received

each day during the past 50 days at the office of a mail order company.

Number of Number of days


37/72

order10 12 413 15 1216 18 2019 - 21 14

a) Identify the modal class

b) Calculate

i) mean

ii) variance and standard deviation

iii) coefficient of variation

3. The price of the ordinary 25p shares of Manco PLC quoted on the stock exchange, at

the close of the business on successive Fridays is tabulated below

126 120 122 105 129 119 131 138

125 127 113 112 130 122 134 136128 126 117 114 120 123 127 140124 127 114 111 116 131 128 137127 122 106 121 116 135 142 130

Required

a) Group the above date into eight classes.

b) Calculate cumulative frequency, the median value, quartile values and the

Semi-quartile range

c) Calculate the mean and standard deviation of your frequency distribution.

d) Compute :

i) The median and mean

ii) The semi-interquartile range and the standard deviation

5. The managers of an import agency are investigating the length of time that

customers take to pay their invoices, the normal terms for which are 30 days net. They

have checked the payment record of 100 customers chosen at random and havecompiled the following table:

Payment in Number of

customers5 to 9 days 4


38/72

10 to 14 days 1015 to 19 days 1720 to 24 days 2025 to 29 days 2230 to 34 days 1635 to 39 days 840 to 44 days 3

Required:

a) Calculate the arithmetic mean.

b) Calculate the standard deviation

c) Construct a histogram and insert the modal value.

d) Estimate the probability that an unpaid invoice chosen at random will be between

30 and 39 days old.

4. PROBABILITY DISTRIBUTIONS

Probability distribution can either be discrete or continuous. The distribution can

also assume the uniform, normal and skewed

For numerical data, any distribution: discrete, continuous or probability, the mean and

standard deviations can be used to find the proportions or percentage of the total

observations that fall within a given internal about the mean.

The pattern of any distribution of data values throughout the entire range of all values

given a certain shape. The shape can be identified from a bar chart for discrete data or

histogram for continuous data. The shape of the distribution can either be

i) Uniform

ii) Bell-shaped shaped that is- symmetrical

iii) skewed

i) Uniform or rectangular


39/72

ii) Symmetrical- bell shaped

For a symmetrical continuous distribution the measures of central tendency mode,

median and mean are equal and the value is at the middle of the shape. Such a

distribution is called normal distribution Gaussian distribution.

a) DISCRETE DATA

A probability distribution for a discrete random variable is a mutually exclusive listing

of all the possible numerical out occurrence of each outcome.


40/72

Example 1

The following table contains the probability distribution for the number of traffic

accidents daily in a small city.

Number of

accidents

Probability p(x)

0 0.10.1 0.202 0.453 0.154 0.055 0.05

Required:Compute:

a) expected number of accidents

b) The variance and standard deviation

c) Coefficient of variation

Solution

Probability is a term that reflects uncertainty. It is used to make predictions on

happenings by assigning the probability of the event happening.

The mean or average from such distribution is referred to as expected value E(x), E(x)

= X (Pxi)

Where X: - Variable Px: - probability that event xi will occur

The variance = = (xi E(x)) 2 Pxi

Number of Accounts probability

(x) P(xi) Xi Pxi Xi E(x) Pxi


41/72

0 0.10 0 0.401 0.20 0.20 0.202 0.45 0.90 0.003 0.15 0.45 0.154 0.05 0.20 0.205 0.05 0.25 0.45

1.00 2.00 1.4

Pxi = 1.00 xi Pxi = 2.00 (xi E(x)) Pxi= 1.4

i) Expected value E (x) = 2.00

ii) Variance = = (xi E(x) Pxi = 1.4

iii) standard deviation = = = 1.4 = 1.1832

iv) Coefficient of variance CV = 1.18322

= 0.592 (59.2%)

Example 2Given the following probability distributions A and B

Distribution A Distribution BX p (x) X

p(x)0 0.25 0 0.151 0.25 1 0.252 0.25 2 0.453 0.25 3 0.15

a) Compute:

i) The expected value for each distribution

ii) The standard deviation for each distribution

iii) Compare the results of distribution A and B.

Distribution A Distribution BX PC(X) XP(x) [X-E(X)]

Px)

X P(X) X P(X) [X-E(X)] P(X)

0 0.25 0.00 0.5625 0 0.15 0.00 0.3841 0.25 0.25 0.0625 1 0.25 0.25 0.0902 0.25 0.50 0.0625 2 0.45 0.90 0.0723 0.25 0.75 0.5625 3 0.15 0.45 0.294


42/72

= E(x) = 1.5 = 1.25 =E(x) = 1.6 = 0.84

Distribution of A is uniform and symmetric .The distribution has one mode i.e. the

variance 2.

b) CONTINUOUS DISTRIBUTION: NORMAL DISTRIBUTIONFrequency distribution for continuous data can be converted to a probability

distribution by calculating the relative frequency for each class. This column is taken

as equivalent of probabilities for each class.

Like total sum of relative frequency, the total probability is also equal to 1. i.e. Px

= 1

The distribution is the most common continuous distribution used in statistics based on

the following main reasons.

Numerous continuous variables common in business and other natural

occurrences have distributions that closely resemble the normal distribution

The normal distribution can be used to approximate various discrete probability

distributions.

It provides the basis for classical statistical inference.

The normal distribution is represented by the classical bell shape with it one can

calculate the probability density function is denoted by the symbol (x).

The mean () is in the middle of the symmetrical distribution. The standard deviation

() measures the distance from the mean to a point on the x (horizontal) axis. In

order to work with a set of standard values it is necessary to convert or transform any

normal distribution to a standard normal distribution which has a mean of o and a

standard deviation of 1.

The total area of the distribution is 1, and each half of the curve is 0.5. Any values of x

in a distribution can be converted to a value called z value or z- score, by the formula:

Z = x -


43/72

Where x the variable

- mean

Standard deviation

Z values are obtained normal probability distribution. The Z values correspond to the

area shaded (identified from the normal curve).

Example 1

The heights of adult males are normally distributed with mean 170 cm and standard

deviation 10cm.

Find the probability that the height of students is:

Between 180 and 190

Taller than 190cm

Shorter than 180cm

Shorter than 165cm

Solution

The distribution is said to be normally distribution

=170

X = 170 is the mean at the middle of the curve.

Use formula to find 2 (standard deviation: )

Find the area (probability) that

The height of an adult is between 180cm and 190cm


44/72

Z = x - = 180 170 = 1 (180 is 1 standard deviation)

10

= 190 170 = 2 (190 Is 2 standard deviation)

10

Find areas in the normal tables = P(z)

Z Area (Area under curve between Z = 0 and 2

1 0.3413

2 0.4772

P (180 x (190) = p (1 z = 0.4772 0.3413 = 01359.

-3 -2 -1 0 1 2 3

Taller than 190cm: P (x > 190)

Z = 190 170 = 2

10

-3 -2 -1 0 1 2 3

Z =2 and P(z) or area =0.4772


45/72

P(z>2) = 0.5 0.4772 = 0.0228 c)

c) Shorter than 180cm

-3 -2 -1 0 1 2 3

Z = 180-170 =1

10

P(x


46/72

-3 -2 -1 0 1 2 3P(-0.59).

Solution

(i) P(2 < X < 5) = P(0.33 < Z < 0.67)

= .3779.

(ii) P(X >0) = P(Z > 1) = P(Z < 1)

= .8413.

(iii) P(X >9) = P(Z > 2.0)

= 0.5 0.4772 = .0228

Example 3

A sample of students had a mean age of 35 years with a standard deviation of 5 years.

A student was randomly picked from a group of 200 students. Find the probability that

the age of the student turned out to be as follows

i. Lying between 35 and 40ii. Lying between 30 and 40

iii. Lying between 25 and 30

iv. Lying beyond 45 yrs

v. Lying beyond 30 yrs


47/72

vi. Lying below 25 years

Solution

(i). the standardized value for 35 years

Z =

=

5

35-35= 0

The standardized value for 40 years

Z =

=

5

35-40= 1

The area between Z = 0 and Z = 1 is 0.3413 (These values are checked from the

normal tables see appendix)

The value from standard normal curve tables

When z = 0, p = 0

And when z = 1, p = 0.3413

Now the area under this curve is the area between z = 1 and z = 0

= 0.3413 0 = 0.3413

The probability age lying between 35 and 40 yrs is 0.3413

(ii). 30 and 40 years

Z =

=5

3530 =5

5

= -1

Z =

=

5

3540= 1

The area between Z = -1 and Z = 1 is

= 0.3413 (lying on the positive side of zero) + 0.3413 (lying on the negative side

of zero)

P = 0.6826The probability age lying between 30 and 40 yrs is 0.6826

(iii). 25 and 30 years

Z =

=

5

3525=

5

10= -2


48/72

Z =

=

5

3530= -1

The area between Z = -2 and Z = -1

Probability area corresponding to Z = -2

= 0.4772 (the z value to check from the tables is 2)Probability area corresponding to Z = -1

= 0.3413 (the z value for this case is 1)

The probability that the age lies between 25 and 30 yrs

= 0.4772 0.3413 (The area under this curve)

P(Z) = 0.1359

iv). P(beyond 45 years) is determined as follow = P(x > 45)

Z =

=

5

3545=

5

10+= + 2

Probability corresponding to P(Z = 2) = 0.4772 = probability of between 35 and 45

P(Age > 45yrs) = 0.5000 0.4772

= 0.0228

Practice Questions

1. Identify the following as discrete or continuous random variables.(i) The market value of a publicly listed security on a given day

(ii) The number of printing errors observed in an article in a weekly news magazine

(iii) The time to assemble a product (e.g. a chair)

(iv) The number of emergency cases arriving at a city hospital

(v) The number of sophomores in a randomly selected Math. class at a university

(vi) The rate of interest paid by your local bank on a given day

2. A random variableXhas the following probability distribution:X 1 2 3 4 5

P(x) .05 .10 .15 .45 .25

(i) Verify thatXhas a valid probability distribution.

(ii) Find the probability thatXis greater than 3, i.e. P(X >3).

(iii) Find the probability thatXis greater than or equal to 3, i.e. P(X 3).


49/72

(iv) Find the probability thatXis less than or equal to 2, i.e. P(X 2).

(v) Find the probability thatXis an odd number.

(vi) Graph the probability distribution forX.

3, Calculate the area under the standard normal curve between the following values.

(i)Z= 0 andz= 1.6 (i.e. P (0 Z 1.6))

(ii)Z= 0 andz= 1.6 (i.e. P (1.6 Z 0))

(iii)Z= .86 andz= 1.75 (i.e. P (.86 Z 1.75))

(iv)Z= 1.75 andz= .86 (i.e. P (1.75 Z .86))

(v)Z= 1.26 andz= 1.86 (i.e. P (1.26 Z 1.86))

(vi)Z= 1.0 andz= 1.0 (i.e. P (1.0 Z 1.0))

(vii)Z= 2.0 andz= 2.0 (i.e. P (2.0 Z 2.0))

(viii)Z= 3.0 andz= 3.0 (i.e. P (3.0 Z 3.0))

4. LetZbe a standard normal distribution. Findz0 such that(i) P (Z z0) = 0.05

(ii) P (Z z0) = 0.99

(iii) P (Z z0) = 0.0708

(iv) P (Z z0) = 0.0708

(v) P (z0 Z z0) = 0.68

(vi) P (z0 Z z0) = 0.95

5.A normally distributed random variableXpossesses a mean of = 10 and a standard

deviation of=5.

Find the following probabilities.

(i)Xfalls between 10 and 12 (i.e. P (10 X 12)).

(ii)Xfalls between 6 and 14 (i.e. P (6 X 14)).

(iii)Xis less than 12 (i.e. P(X 12)).

(iv)Xexceeds 10 (i.e. P(X 10)).

6. CONFIDENCE INTERVAL

The interval estimate or a confidence interval consists of a range (upper confidences

limits and lower confidence limit) within which we are confident that a population


50/72

parameter lies and we assign a probability that this interval contains the true

population value.

Confidence interval is the interval between the confidence limits. The higher the

confidence level the greater the confidence interval.

For example

A normal distribution has the following characteristic

i. Sample mean 1.960 includes 95% of the population

ii. Sample mean 2.588 includes 99% of the population

Large Samples

The Central Limit Theorem: The theory states that if we select a large number of

simple random samples, say from any population and determine the mean of each

sample, the distribution of these sample means will tend to be described by the normal

probability distribution with a mean and variance 2

/n. This is true even if thepopulation itself is not normal distribution. Or the sampling distribution of sample

means approaches to a normal distribution irrespective of the distribution of population

from where the sample is taken and approximation to the normal distribution becomes

increasingly close with increase in sample sizes

Large samples that contain a sample size greater than 30(i.e. n>30). Such samples can

use levels of confidence based on the normal distribution.

Estimation of population mean

Here we assume that if we take a large sample from a population then the mean of the

population is very close to the mean of the sample

Steps to follow to estimate the population mean includes

i. Take a random sample of n items where (n>30); n is the sample size

ii. Compute sample mean (X ) and standard deviation (S)

iii. Compute the standard error of the mean by using the following formula

Sx

=n

s

Where Sx = Standard error of mean

S = standard deviation of the sample

n = sample size

iv. Choose a confidence level e.g. 95% or 99%


51/72

v. Estimate the population mean as under

Population mean = appropriate number XSx

Appropriate number means confidence level e.g. at 95% confidence level is

1.96 this number is usually denoted by Z and is obtained from the norma

tables. The value of z corresponds to the confidence obtained as the

probability percentage

Example 1

The quality department of a wire manufacturing company periodically selects a sample

of wire specimens in order to test for breaking strength. Past experience has shown

that the breaking strengths of a certain type of wire are normally distributed with

standard deviation of 200 kg. A random sample of 64 specimens gave a mean of 6200

kgs. Find out the population mean of 95% level of confidence

Solution

Population mean = 1.96 Sx

Note that sample size is already n > 30 whereas s and x are given thus step i), ii) and

iv) are provided.

Here: X = 6200 kgs

Sx = ns

=64

200= 25

Population mean = 6200 1.96(25)

= 6200 49

= 6151 to 6249

At 95% level of confidence, population mean will be in between 6151 and 6249

Estimation of population proportions

This type of estimation applies at the times when information cannot be given as a

mean or as a measure but only as a fraction or percentage

The sampling theory stipulates that if repeated large random samples are taken from a

population, the sample proportion p will be normally distributed with mean equal to

the population proportion and standard error equal to


52/72

Sp =Pq

n= Standard error for sampling of population proportions

Where n is the sample size and q = 1 p.

The procedure for estimating a proportion is similar to that for estimating a mean, we

only have a different formula for calculating standard error is different.

Example 1

In a sample of 800 candidates, 560 were male. Estimate the population proportion at

95% confidence level.

Solution

Here

Sample proportion (P) =560

800= 0.70

q = 1 p = 1 0.70 = 0.30n = 800

pq

n= ( ) ( )

0.70 0.30

800

Sp = 0.016

Population proportion

= P 1.96 Sp where 1.96 = Z.

= 0.70 1.96 (0.016)

= 0.70 0.03

= 0.67 to 0.73

= between 67% to 73%

Example 2


53/72

A sample of 600 accounts was taken to test the accuracy of posting and balancing of

accounts where in 45 mistakes were found. Find out the population proportion. Use

99% level of confidence

Solution

Here

n = 600; p =45

600= 0.075

q = 1 0.075 = 0.925

Sp =pq

n= ( ) ( )

0.075 0.925

600

= 0.011

Population proportion

= P 2.58 (Sp)

= 0.075 2.58 (0.011)

= 0.075 0.028

= 0.047 to 0.10

= between 4.7% to 10%

Small Samples

Estimation of population mean

If the sample size is small (n


54/72

x

S =s

n

S = standard deviation of samples = ( )2

1

x x

n

for small samples.

n = sample size

v = n 1 degrees of freedom.

The value of t is obtained from students t distribution tables for the required confidence

level

Example

A random sample of 12 items is taken and is found to have a mean weight of 50 grams

and a standard deviation of 9 grams

What is the mean weight of population

a) with 95% confidence

b) with 99% confidence

Solution

50;X = S = 9; v = n 1 = 12 1 = 11;9

12x

sS

n= =

= x xts

At 95% confidence level

= 50 2.2629

12

= 50 5.72 grams

Therefore we can state with 95% confidence that the population mean is between

44.28 and 55.72 grams

At 99% confidence level


55/72

= 50 3.259

12

= 50 8.07 grams

Therefore we can state with 99% confidence that the population mean is between

41.93 and 58.07 grams

Note: To use the t distribution tables it is important to find the degrees of freedom (v =

n 1). In the example above v = 12 1 = 11

From the tables we find that at 95% confidence level against 11 and under 0.05, the

value of t = 2.201

7. SIMPLE LENEAR REGRESSION EQUATION

A regression model is a mathematical equation that describes the relationship between

two or more variables. A simple regression model includes only two variables;

Independent variables: the variables used to explain the variation in the

dependent variable i.e. they are used to make prediction on the dependent

variable.

The dependent variable is the one being explainedThe regression model that is linear shows the equation of a linear relationship between

two variables X (dependent) and of (independent) as shown below:

Y = a + bx

The value of a: it is the y- intercept; the value of y where the line cuts the y- axis. The

constant b; this is the slope or the gradient of the line.

The linear relationship between x and y can be defined if the values of the constants a

and b are determined. The values of a and b can be determined in the ways.

Scatter plots:


56/72

Scatter plots are used to examine the relationship between two variables. One variable

takes the horizontal axis (x) while the other takes the vertical axis (y). The variation

between the variables can show a relationship that is positive or negative.

Positive relationship that is either linear or close to linear would indicate that the

variables more together in a linear manner. The scatter will show points lying in a

region reflecting a and of a line. When one variable increases the other also increases,

and when one decreases the other decreases the other also.

Negative relationship is accompanied by a decrease in the other variable. Linear

relationship shows points scattered in a way to lie in a line.

Relationships between variables can also be non-linear. In such cases the points will

concentrate in a region that reflects a curve. The relationship between two variables

may therefore assume many possible shapes; which can be classified as linear or non-linear relationship that are complicated mathematical functions. The simplest

relationship consists of a straight-line or linear relationship.

Check on the scatter plots on page 606 of the main test book.

A scatter plot from which a line that fits (line of best fit) the variables points in

the scatter into an approximately straight line.

This requires good and refined skill in identifying the line that best fits the nearer

is the accuracy of the line obtained. However there will always be an error causedrandom causes random error term. This is the difference between the actual

value of y the obtained from the survey and the estimated values of the y by

assuming they fall along the line. For every value of x, a different value of y may

be obtained by estimating the line. The error for each value of y can be written as

E = y - : where y is the actual value and the estimated value.

For all the values of y, there will be a sum of y are less than the actual while

others are more than the actual.

The sum of the less (-ve difference) and the sum of the greater (the difference) is

zero.

Example:


57/72

The following data represents a sample of seven households showing their incomes and

food expenditures for a given month.

Income (hundreds of

dollars

Food expenditure

(hundreds of dollars)

35 944 1521 739 1115 528 825 9

Required:

Construct a scatter diagram; with income an x-axis and food-expenditure an y-

axis

Draw the prediction line

Identify the y-intercept (a) and the slope (gradient) (b).

Write the simple linear regression equation.

Scatter diagram


58/72

Depending on ones skill different lines such as L1, and L2 can be drawn using L2

Y Intercept i.e. constant a = 1.2

The gradient i.e. coefficient of x; b = 12-6

46-22 = 0.25

The linear equation generally written as

Y = a + bx

= 1.2 + 0.25x.

The line is an estimate of values of y for different values of x. It can be used to predict

values of y given x.

However since the line is an estimate; the difference between the observed or actual

value of y and the obtained by the prediction line, there exists an error called random

error, also called the residual. It measures the surplus (positive or negative)

differences. The random error obtained from a population is denoted by while that of

a sample is denoted by e in the above example.

E= Actual food expenditure - predicted food expenditure = y - .

If the predicted line completely fits as the best line the sum of positive errors and the

negative errors is equal to zero.


59/72

Drawing a scatter diagram may not give is the best of fit line. The other option that

results in sum of errors equal to zero:

e = (y-) = 0

The use of the least squares method

The Least squares method

The least of squares method minimizes the random error. It helps to determine the

constants a and b for the equation

Y = a + bx that results in the line of best fit. The method gives the values of a

and b for the equation (model) such that the same of squared errors (SSE) is minimum

SSE = e = (y-) .

The values of a and b which gives the minimum SSE are called the least squares

estimates and the line is called the least squares line.

For the line = a + bx

b = SSxy and a = - bx

SSxx

Where SSxy = xy - (x) (y)

n

SSxy = x - (x)

n

Find the least square regression line for the data on incomes and food expenditures of

seven households; we require to construct the table that would guide the computation

of a and b.

The table has the following;

Icome Food expenditure(x) (y) xy x35 9 315 1225


60/72

49 15 735 290121 7 147 44139 11 429 152115 5 75 22528 8 224 78425 9 225 625x=212 y=64 xy=2150 x=7222

x = 212 y = 64

x = x = 212 = 30.286

n 7

y = y = 64 = 9.143

n 7

xy = 2150 x = 7222

SSyx = xy- x y = 2150 - ( 212)(64) =211.714

n 7

SSxx = x - (x) = 7222- ( 212) = 801.429

n 7

b = SSxy = 212.714 = 0.2654

SSxx 801.429

a = - bx = y - b x = 64 - 0.2654(212)

n n 7 7

a =1.1414

y = 1.1414 - 0.2654x

Interpretation:


61/72

The line gives coefficients of a and b to four decimal points making it more accurate to

be used for prediction. We can check the accuracy as follows:

A household with monthly income 35 (83500) dented by x=35 would be expected to

spend some money on food as follows:

y = 1.1414 + 0.2642x

x = 35

y = 1.1414 + 0.2642 (35) 810.3884

i.e. in hundred dollars (81038.84)

The acted value in the data gives y = 9. The value 810.3884 could be regarded as an

average i.e. for households having an average i.e. for households having an income of

83500 (x=35) they spend an average 810.3884 (81038.84) on food.

The constant a is the value of y when x=0. That is the amount of money a household

would spend on food per month if there was no income.

It means that food expenditure does not only depend on income but there could be

other factors.

For purposes of prediction using the linear regression line obtained, we can only predict

values of y for values of x that lie within the range in our data.

For example, the incomes lie between (81500 to 84900) i.e. x=15 and x = 49. We can

only predict values of y with values of x between 15 and 49. We can only predict values

of y with values of x between 15 and 49.

Prediction outside this range may not hold true (prediction not reliable). X = 0 is a

value not within the range and so the prediction that households with no income spend

8114.14 per month cannot be supported by our equation.


62/72

The constant b in the model gives the gradient or change in y due to a charge of one

unit in x.

Example; when x increases by one unit of income in (hundreds) then y increase by

0.2642 (in hundreds) of dollars spent on food.

Example;

If the income of a household changes from x = 30 to x = 31

y will change as:

y = 1.1414 + 0.2642 (30) = 9.0674

y = 1.1414 + 0.2642 (31) = 9.3316

9.3316 9.0674 = 0.2642

When b is positive it means that as x increases y also increases and if x decreases y

also decreases. There is a positive linear relationship between the variables i. e. the

change in y and the charge in x are in the same direction; the variables move together.

If the value of b is negative, change in y is in opposite direction to change in x i.e. there

is a negative linear relationship between the variables.

When is greater than zero (b > 0) the line slopes upwards from left to right. If b < 0 the

line slopes down wards from left to right

Assumptions of the regression model

The mean value of error is zero. From the above example, among the households

with the same income some spend more on food and other less. The sum of the

differences (positive errors and negative errors) is equal to zero.

The errors associated with different observations are independent. That is, all

households decide independently how much to spend in food.

For any given x, the distribution of errors is normal, i.e. with the above example

the food expenditure for all households with the same income are normally

distributed.


63/72

The distribution of population errors for each x has the same (constant) standard

deviation. The assumption is that the spread of points around the regression line

is similar for all x values.

Example 2

A random sample of eight auto divers insured with a company and having similar auto

insurance policies was selected. The following table lists their driving experience (in

years) and the monthly auto insurance premium (in dollars) paid by them. Find the

linear equation using L. S M.

Driving expenditure

(year)

Monthly auto insurance premium

(dollars)

5 642 8712 509 7115 446 5625 4216 60

x = 90 y = 474

x = x = 90 = 11.25

n 8

y = y = 474 = 59.25

n 8-

xy = 4737 x = 1396

SSyx = xy- x y = 4739 - (90) = 383.5

n 8

SSxx = x - (x) = 1396 - (90) = 383.5

n


64/72

b = SSxy = -593.5 = - 1.5476

SSxx 383.5

a = - bx = 59.25 (-1.5476)11.25) = 76.6605

= 76.6605 - 1.5476x

Practice Questions

1.The age versus prices for printers I reported in the table below. Age is in years while

prices are in dollars (in hundreds)

Age (years)

x

Price 00

dollars (y)5 807 576 586 555 704 887 436 605 695 63

2 118

Required:

i. Find the equation of the regression line.

ii. Describe the apparent relationship between age and price for the printers

iii. What does the slope of the regression equation represent in terms of the price for

printers?

iv. Panama enterprise wants to buy 3 year old and 4 year old printers from the firm.

How much do you predict the firm will spend in buying the two printers?

8.HYPOTHESIS TESTING

Definition


65/72

A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be

tested statistically in order to establish whether it is correct or not correct

When testing a hypothesis, one must fully understand the 2 basic hypothesis to be

tested namely

i. The null hypothesis (H0)

ii. The alternative hypothesis(H1)

The null hypothesis

This is the hypothesis being tested, the belief of a certain characteristic e.g. Kenya

Bureau of Standards (KBS) may walk to a sugar making company with an intention of

confirming that the 2kgs bags of sugar produced are actually 2kgs and not less, they

conduct hypothesis testing with the null hypothesis being: H0 = each bag weighs 2kgs

The testing will set out to confirm this or to refute it.

The alternative hypothesis

While formulating a null hypothesis we also consider the fact that the belief might be

found to be untrue hence we will reject it. We therefore formulate an alternative

hypothesis which is a contradiction to the null hypothesis, thus when we reject the null

hypothesis we accept the alternative hypothesis.

In our example the alternative hypothesis would be

H1 = each bag does not weigh 2kg

Acceptance and rejection regions

All possible values which a test statistic may either assume consistency with the null

hypothesis (acceptance region) or lead to the rejection of the null hypothesis (rejection

region or critical region)

The values which separate the rejection region from the acceptance region are called

critical values

Type I and type II errors

While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis,

there are four possible occurrences.

a) Acceptance of a true hypothesis (correct decision) accepting the null hypothesis

and it happens to be the correct decision. Note that statistics does not give


66/72

absolute information, thus its conclusion could be wrong only that the probability of

it being right are high.

b) Rejection of a false hypothesis (correct decision).

c) Rejection of a true hypothesis (incorrect decision) this is called type I error, with

probability = .

d) Acceptance of a false hypothesis (incorrect decision) this is called type II error,

with probability = .

Levels of significance

A level of significance is a probability value which is used when conducting tests of

hypothesis. A level of significance is basically the probability of one making an

incorrect decision after the statistical testing has been done. Usually such probability

used are very small e.g. 1% or 5%

0.5000 0.4900

1%

provision for errors

Hypothesis testing procedure

Whenever a business complaint comes up there is a recommended procedure for

conducting a statistical test. The purpose of such a test is to establish whether the null

hypothesis or alternative hypothesis is to be accepted.

The following are steps normally adopted

1. Statement of the null and alternative hypothesis

2. Statement of the level of significance to be used.

3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean

sample proportion, difference between sample means or sample proportions


67/72

4. Type of test whether two tailed or one tailed.

5. Statement on critical values using the appropriate level of significance

6. Standardizing the test statistic

7. Conclusion showing whether to accept or reject the null hypothesis

Hypothesis testing for the mean

Example 1

A certain NGO carried out a survey in a certain community in order to establish the

average at which the girls are married. The results of the survey indicated that the

marriage age for the girls is 19 years

In order to establish the validity of the mean marital age, a sample of 50 women was

interviewed and the average age indicated that they got married at the age of 16

years. However the different ages at which they were married differed with thestandard deviation of 2.1years

The sample data indicates that the marital age is less 19 years. Is this conclusion true

or not ?

Required

Conduct a statistical test to either support the above conclusion drawn from the sample

statistics i.e. the marriage age is less than 19 years, use a level of significance of 5%

Solution

1. Null hypothesis

H0: (mean marital age) = 19 years

Alternative hypothesis H1: (mean marital age) < 19 years

2. The level of significance is 5%

3. The test statistics is the sample mean age, X= 16 years

4. The critical value of the one tailed test (one tailed because the alternative

hypothesis is an inequality) at 5% level of significance is 1.65

Solution

Z =X -

Sx, where xS =

S

n

Where, X = Sample mean

= Population mean


68/72

S = sample standard deviation

n = sample size

z = standard value (as per computation)

5. The standard value Z must fall within the acceptance region for us to accept the

null hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative

hypothesis.

Z = 2.150

16 19= - 10.1

Rejection region acceptable region

3 -2 -1 0 1 2 3

6. Since 10.1 < -1.65, we reject the null hypothesis but accept the alternative

hypothesis at 5% level of significance i.e. the marriage age in this community is

significantly lower than 19 years

Example 2

Test the hypothesis that weight loss in a new diet program exceeds 20 pounds during

the first month.

Sample data: n = 36, x = 21, s2 = 25, 0 = 20, = 0.05

H0: = 20 ( is not larger than 20)Ha: > 20 ( is larger than 20)

Z = X - 0 = 21 20 = 1.2

s/ n 5/36

Z =1.645


69/72

Acceptable region rejection region

-3 -2 -1 0 1 2 3

At 5%; with Critical value:z= 1.645

RR: Reject H0 ifz > 1.645

Decision: Do not reject H0 because the critical value is outside the reject region

Conclusion: At 5% significance level there is insufficient statistical evidence to conclude

that weight loss in a new diet program exceeds 20 pounds per first month.

Exercise:Test the claim that weight loss is not equal to 19.5.

Example 3

A machine is set to cut out bars to an average length of 150mm. an operator wants to

check whether the setting is accurate. She samples 50 bars and finds a mean of

148mm. the standard deviation is known to be 5mm. is the machine still reliable? Test

this at 1% significance level.

Solution

H0: = 150 (machine may be reliable)Ha: 150 (2- tailed test, machine not reliable; may produce lengths that are

too long or too. Short. We cannot get a direction from the wording of the question).

Alpha: a = 0.01, Critical value:z/2 = z0.005 = 2.575


70/72

Z = X - 0 = 148 -150 = -2.83

s/ n 5/50

0.005 0.005

3 -2 -1 0 1 2 3

Hypothesis testing for proportion

A member of parliament (MP) claims that in his constituency only 50% of the total

youth population lacks university education. A local media company wanted to

acertain that claim thus they conducted a survey taking a sample of 400 youths, of

these 54% lacked university education.

Required:

At 5% level of significance confirm if the MPs claim is wrong.

Solution.

Note: This is a two tailed tests since we wish to test the hypothesis that the hypothesis

is different () and not against a specific alternative hypothesis e.g. < less than

or > more than.

H0 : = 50% of all youth in the constituency lack university education.

H1 : 50% of all youth in the constituency lack university education.

Sp =pq

n=

0.5 0.5

400

x= 0.025

Z =0.54 0.50

0.025

= 1.6


71/72

at 5% level of significance for a two-tailored test the critical value is 1.96 since

calculated Z value

Z (sample) =-2.83

This falls in the region of Ha.

Conclusion: reject Ho. There is enough evidence to support that the machine is no

longer reliable.

Practice Questions

Kenya Commercial bank Ltd. commissioned a research whose results indicated thatautomatic teller machine (ATM) reduces the cost of routine banking transactions.

Following this information, the bank installed an ATM facility at the premises of JoyProcessing Company Ltd., which for the last several months has exclusively been, used

by JoyS 605 employees. Survey on the usage of the ATM facility by 100 of theemployees in a month indicated the following:

Number of

times ATM

used

Frequenc

y

0 20

1 32

2 20

3 13

4 10

5 5

Required:

a) An estimate of the proportion of Joys employees who do not use the ATM facility in

a monthb)i) Determine the 95% confidence interval for the estimate in (a) aboveii) Can the bank be certain that at least 40% of Joys employees will use the ATM

facility?c).The number of ATM transactions on average an employee of Joy makes per month


72/72

d).Determine the 95% confidence interval of the mean number of transactions madeby an employee in a month.e).Is it possible that the population mean number of transactions is four?

Explain.

Documents

Business Statistics May Module