Upload
michel-kabonga
View
217
Download
0
Embed Size (px)
Citation preview
7/28/2019 Business Statistics May Module
1/72
Business statistics bcm 307
INTRODUCTION
Expected Learning Outcomes: At the end of the course a student should be able to:
To demonstrate an understanding of statistics and its importance for business
and management
Demonstrate proficiency with the qualitative and quantitative measures: through
ability to organize and present data on tables, charts, graphs, polygons
Demonstrate an understanding of measures of central tendency and measure of
dispersion
Demonstrate an understanding in time series and application
Drawing scatter diagrams, construct simple linear regression equation and
application
Demonstrate proficiency in contingent table, probability concepts and application
in business
Demonstrate some level of understanding and application of hypothesis testing
1.Introduction
Definition of statistics; Application of statistics in business; Terminologies used in
statistics
2.Collection, Organization and Presentation of Data
Qualitative data: Summary tables, bar, pie and Pareto charts,
Quantitative data: Summary tables, histogram, graphs, polygons, ogive and
Lorenz curve
Time Series:Time series graphs, application and forecasting
3.Numerical Descriptive Measures: Discrete and Continuos Data
Measures of Central Tendency: Mode, Median, Mean
Measures of dispersion: Significance of the measures, Range, Interquartilesrange, Variance, Standard deviation, Coefficient of variation
4. Probability Distributions.
a) Discrete distribution
b) Normal distribution(Continuous)
7/28/2019 Business Statistics May Module
2/72
Introduction
Standard Normal Distribution
Z-scores
Areas to the Left and Right of x
Calculations of Probabilities Using the Central Limit Theorem
5.Confidence Interval
Introduction
Confidence Interval
Single Population Mean, Population Standard Deviation Known
Confidence Interval, Single Population Mean, Standard Deviation
6.Linear Regression Model and Scatter Diagram (Simple and Multi-
linear)
Drawing scatter diagram, Describing relationship, equation relationship between
variables
Use least square method to derive the simple regression equation
Explain the coefficients of the variables and their significance
7.Hypothesis Testing
Introduction
Definition of Hypothesis Testing
Null and Alternate
Hypothesis Testing for the Mean
Hypothesis testing for the Proportion
Support One of the Hypothesis
Decision and ConclusionCourse Texts:
1. Dean. S Illowsky.B; Principles of Business Statistics;
7/28/2019 Business Statistics May Module
3/72
2. Berenson. M, et al; Basic Business Statistics: Concepts and Application. 11th
edition (2009
3. T.Lucey; Quantitative Techniques: 6th edition (2002)
4. R.I. et al. Quantitative Approaches to Management. 8th edition (1992)
INTRODUCTION
Statistics is a branch of mathematics that transforms numbers into useful information
for decision making. It does this by producing a set of methods for analyzing the
numbers.
Statistics is therefore the science of data that involves collecting, classifying,summarizing, organizing, analyzing and interpreting numerical information.
Definition of Terms
The application of statistics can be divided into two broad areas:
Descriptive statistics
Inferential statistics
Descriptive statistics: It utilizes numerical and graphical methods to look for patterns
in a data set, to summarize the information revealed and to present the information in
a convenient form. This could be referred to as analysis of data.
The data is usually presented in form of tables, charts, graphs and analyzed using
statistics such as the mean, median, mode, variance, standard deviation, coefficient of
variation etc.
Inferential statistics: It uses the data collected from a small group to draw
conclusion about a larger group. The conclusion may be decisions, predictions or other
generalizations about a larger set of data.
Important applications can therefore be summarized as:
Summarizing business data
7/28/2019 Business Statistics May Module
4/72
Drawing conclusion from the data
Making reliable forecast about business activities
Improving business process
Statistics: The word statistics has two meanings:
Numerical facts derived from analysis of sample data, for example, mean,
standard deviation and proportions. Any numerical facts can also be referred
to as a statistic, e. g number of people, number of countries, marks scored in a
test etc.
Field or discipline of study. It is a branch of mathematics that transforms
numbers into useful information for decision making. It does this for by
providing a set of methods for analyzing the numbers. These methods help to
find patterns in the numbers and this enables one to determine whether
differences in the numbers are just due to chance. In this case statistics can
be seen as a science of data. It involves collecting, classifying summarizing,
analyzing and interpreting numerical information.
Population (target population): It is a set of units that we are interested in studying
and the one we need to draw a conclusion about: the whole set of elements of focus.
Characteristics are calledparameterse.g. population mean, , population standarddeviation, , population proportion, p
Sample: It is the portion or subset of the population that is selected for analysis. The
sample is randomly selected so as to consist of all the characteristics of the population.
Characteristics are called statistics e.g. sample mean,xsample standard deviation, s,
sample proportion,p
Representative sample: The sample selected is representative if it exhibits typical
characteristics that are possessed by the population of interest or the targetpopulation. The most common way to satisfy the representative sample requirements
to use methods that allow us to select a random sample, that is, giving every element
equal chance to be selected.
7/28/2019 Business Statistics May Module
5/72
Random sample: A random sample is selected from the population in such a way that
every different element and every sample size has equal chance of selection.
Element/member: Element of a sample or population is a specific subjects or object
about which information is collected, e.g. a firm, a country, a person, a university etc.
Variable: It is a characteristic under study that assumes different values for different
elements. For example scores in a test; different scores are expected for different
students when the do a given test. In the case of a firm, the profit made at different
time may be different; peoples tastes for a given product may be different etc.
Observation/measurement: The value of a variable for an element. 120 cm as a
height, a yes for an opinion, Sh. 20 000 as an income
Data set: A collection of observations on one or more variables.
Types of variables
A variable may be classified as qualitative and quantitative.Qualitative Variable
These are measurements that cannot be measured on a natural numerical scale but
can only be classified into groups or categories.
The data from categorical variable are measured on the scales; nominal or ordinal
scales.
Nominal scale: This divides distinct categories that cannot be ranked. For
example the gender (female or male), preference of a product or a service (softdrink), a yes or no response etc. This is the weakest form of measurement
Ordinal scale: It classifies data into distinct categories that can be ranked. E.g.
Responses such as excellent, very good, fair, poor etc Though it is possible to
rank the scale it is still weak in that the amount of the difference between
categories cannot be accounted.
Quantitative Variable
The measurements are recorded on a naturally occurring numerical scale. The scale
cans either b internal or ratio scale. These two scales can be ranked but also the
difference between two variables can be calculated and interpreted.
Internal scale:The scale cannot be used in comparing for example a student who
scores a 100% is not twice as intelligent as one who scales 50%.
7/28/2019 Business Statistics May Module
6/72
Ratio scale: It refers to data that can also be compared. This includes data that
incorporates arithmetic operations (addition, subtraction, multiplication and
division). For example sales of a company, income of families returns from an
investment etc.
2. COLLECTION, ORGANIZATION AND PRESENTATION OF DATA
The first step is to identify the type of data one wants to collect; quantitative orqualitative. The second step is to device a suitable method for collecting data. There
are various methods that are used to collect data, for example survey, designed
experiments and observational study.
Survey: The researcher samples a group of people and asks them questions from who
responses are obtained. Some tools used in the data collection are questionnaires,
mails, and telephone or in-person interviews.
Experiment: Designed experiments normally involve strict control over the elementsin study. Two groups are designed one of which composes of experiment treatment
group and a control group.
Observational study:The researcher observes the elements in their natural setting
and records the variables of interest.
Regardless of the data collection method, it is likely that the data will be from a
sample. Data is classified as primary or secondary. Primary data is the one collected by
the person analyzing data while secondary is obtained by the analyzer from
publications such as books, journals, newspapers etc.
Organization and Presentation of Data
After data is collected it is cleaned to remove unnecessary work in the record and it is
ready for analysis which is the process that transforms the raw data to meaningful
7/28/2019 Business Statistics May Module
7/72
information which analyst can use in decision making process. The analysis process
includes with organization, presentation, description and inference of the results
obtained from the sample to make generalization on the population.
The technique or methods used to present or analyze data will depend on the type of
data; quantitative or qualitative.
The process and method of analysis depends on the type of data that was collected;
qualitative or quantitative.
Qualitative Data
The data can be organized and presented in;
Summary tables- frequency, relative frequency or parentage frequency table
The bar charts
The pie charts
The Pareto charts
Examples 1
A sample was taken of 25 high school seniors who were planning to join college. The
following are categories of majors he/she intended to choose: Business (BUS),
economics (ECON), management information science systems (MIS), behavioral science(BS) and others.
The responses of the students as they are asked their choice are listed below:
ECON MIS ECON BUS BUSBUS BUS other other otherOther BS MIS other MISECON BUS MIS BUS otherBS MIS other other other
Required:
To organize and present data by constructing
i. Frequency distribution table
7/28/2019 Business Statistics May Module
8/72
ii. Bar chart
iii. Pie chart
Solution
The above data is measured on categorical basis. The analyst collected the data by
identifying those students who majored in any of the above categories.
i. Frequency Distribution Table:
This is a summary table that organizes the raw data into a frequency distribution table
that includes three columns as demonstrated below:
Categories Tally FrequencyBUS 6ECON 3MIS 6BS 2Other 8
Total 25
To make sure that all the responses from each category are included, the student goes
through the raw data putting a slash on each response and recording it as a tally on the
tally column.
The tallies are recorded as slashes and any group of five are written by counting the
number of numerical value. To ensure all items were considered the student must write
the total frequency as shown above that must match the sample size of the data set.
The result in the frequency distribution table gives us the number of students who took
that particular major. From the number or frequency we can identify the categories
with the highest number of students, or the least and generally we can describe how
the frequency is distributed among the different categories.
Relative frequency distribution table can also be used as summary table. In this case
an additional column for relative frequency from each category is written as a relative
value by dividing it by the total frequency to result with:
Categories Tally Frequency Relative frequencyBUS 6 6/25 0.24ECON 3 3/25 0.12MISS 6 0.24BS 2 0.08Other 8 0.32Total 25 1.00
7/28/2019 Business Statistics May Module
9/72
The total relative frequency is equal to 1.00. From this table we obtain the relative
or proportion of the students as distributed among the categories.
ii. Percentage frequency distribution Table
The summary table can also be in percentage frequency where the column is added
to represent this. The table may be like:
Categories Tally Frequency Percentage
FrequencyBUS 6 6/25 *100 24ECON 3 12MIS 6 24BS 2 08
Other 8 32Total 25 100
The column has a total of 100. Percentages are more simpler way of expressing
proportions.
NB: To develop either relative frequency distribution table or percentage frequency
distribution table, one must have constructed the frequency distribution table first
and then the rest and it would be advisable to do all of them in the same table.
iii) Bar Chart
The bar chart presents the data using a horizontal and vertical axis. The horizontal
axis takes the categories which are represented by bars of equal width and
7/28/2019 Business Statistics May Module
10/72
separated from each other by uniform space. The frequency or (relative frequency
or percentage) is written on the vertical axis.
The scale of the vertical axis is determined by the highest frequency of the
categories. The scale must be easy to use construct and read.
The Vertical axis may also use the relative frequencies or percentages; the scale must
be well selected to include the highest relative/p percentage value. The bar chart
becomes a good presentation of the data from which various information can be drawn.
For example; the business and MIS bars are both equal; the number of students doing
business and MIS majors are equal.
The category of other is the highest and so it means other cause not distinctly
identified are also offered. Behavioral science has the least number of students.bar
charts are best suited for comparing different categories by checking on the height of
the bars
iv) Pie Chart
The information could also be presented on a pie chart. The chart assigns the
categories according to their proportions reflected by the size of the sector. The
percentages (or relative frequencies) are converted to degrees.
Categories RF DegreesBUS 0.24 (360) 86.4ECON 0.12 43.2MIS 0.24 86.4
7/28/2019 Business Statistics May Module
11/72
BS 0.08 28.8Others 0.32 115.2
1.0 360
Then draw the pie chart to show the different categories in form of different sizes.
The pie chart presentation easily shows and identifies the size of the different portions
and makes it easy to draw conclusions.
v) Pareto charts
Pareto charts classify categories into vital few and trivial many.
The Pareto principle exists when the majority of items in a set of data occur in a small
number of categories and the few remaining items are spread out over a larger number
of categories.
The separation helps to identify and focus on the important categories.
Example 2
The hotel X Y Z samples complaints about the hotel rooms and categorizes them as in
the table below. The sample that gave the responses was made up of 106 customers.The table summarizes the complaint categories and the number of customers that
complained over certain issues.
7/28/2019 Business Statistics May Module
12/72
Required
a. Construct a Pareto chart
b. What reasons for the complaints do you think the hotel managers should focus on
if it wants to reduce the number of complaints. Explain
c. Construct also pie and bar chart and compare the suitability of each chart in
presenting this data
Solution
Orders the categories from the one with highest frequency to the one with the least,
convert frequencies to percentages and show this column. Also draw columns for
cumulative frequency and cumulative percentage frequency. The categories can be
identified with symbols to avoid a lot of writing. Let the categories in the question be
numbered from A to H before rearrangement.Arranging them in order from the one with highest frequency to the last would give us:
A B E C F G H
The table has the following information
Reasons for complaint Number of
customersA Dirty room 32B Not stocked 17C Not ready 12D Too noisy 10
E Needs maintenance 17F Has too few beds 9G Doesnt have promised
features
7
H No special
accommodation
2
7/28/2019 Business Statistics May Module
13/72
Reasons Frequency Cumulative
Frequency
% Cumulative %
A 32 32 30.2 30.2B 17 49 16.0 46.2E 17 66 16.0 62.2C 12 78 11.3 73.5D 10 88 9.4 82.9F 9 97 8.5 91.4G 7 104 6.6 98.0H 2 106 2.0 100
Summary:
Summary tables together with the chart are used to describe the portion of items of
interest in each category.
7/28/2019 Business Statistics May Module
14/72
Each chart best suits certain situations, for example;
Bar chart is more suitable for the purposes of comparing the size of categories
especially when they are not many in number; in our case we can have not more
than six. If they are more than that, they become crowded.
Pie charts are best suited for the situation where the main objective is to
investigate the portion a category occupies in relation to the whole part. Coloring
the portions with different colors enhance the display. It will also be best for few
categories.
Pareto chart sorts the frequencies in descending order and provides the
cumulative curve on the same graph. This allows the viewer to see which
categories account or matter most in the given situation. The chart allows
presentation of many categories and also those with small difefrences in
percentage because the curve enhances identification of the additional
proportion given by any added category.
Summary tables together with the chart are used to describe the portion of items of
interest in each category.
Each chart best suits certain situations, for example;
Bar chart is more suitable for the purposes of comparing the size of categories
Pie charts are best suited for the situation where the main objective is toinvestigate the portion a category occupies in relation to the whole part. Coloring
the portions with different colors enhance the display.
Pareto chart sorts the frequencies in descending order and provides the
cumulative curve on the same graph. This allows the viewer to see which
categories account or matter most in the given situation.
Quantitative Data
These are measurements that are recorded on a naturally concurring numericalscale. They are measured on an interval or ratio scale as explained earlier.
Quantitative data can be organized and presented in a number of ways that
include:
Ordered array
7/28/2019 Business Statistics May Module
15/72
Stem-and-leaf display
Summary tables
Histogram
Frequency polygons
The cumulative percentage polygon: ogiveQuantitative data; can either be discrete or continuous.
Discrete data: It is a variable whose values are countable i.e. they assume
whole number values e.g. number of persons, cars, companies etc.
Continuous data: It is a variable that can assume any numerical value over
continuum of certain interval or intervals e.g. time taken to serve a customer
in a bank, amount of money height of individuals etc.
Discrete data:
It can be organized and presented in
Ordered Array
Stem-and-leaf display
Bar chart
Summary tables
Example 1
The following data represents the stock price of 25 companies.
31 15 13 17 2316 22 12 23 3022 18 33 21 1813 26 16 26 2722 27 20 20 22
Required: Construct
i. Ordered arrayii. Stem-and-leaf display
i) Ordered Array:
This requires that the data is written in ascending or descending order.
7/28/2019 Business Statistics May Module
16/72
12 13 13 15 16 16 17 18 18
20 20 21 22 22 22 22 23 23
26 26 27 27 30 31 33
Ordered array is best applicable if the data is not so large.
ii) Stem-and-leaf display:
It creates suitable stem (main part one digit, two or three) depending on the nature of
the data. Then assigning the remaining digits in what is referred to as leaf.
Since the above data values are a two digit, the tens digit can form stem and the ones
digit the leaf. Tens are represented by 1, 2, 3, ie, tens, twenties and thirties while
the ones digit take the leaf.
Stem leaf
1 2 3 3 5 6 6 7 8 8
2 0 0 1 2 2 2 2 3 3 6 7 7
3 0 1 3
The ones are matched after the appropriate tens, from the display twenties are the
most and thirties the least.
Example 2
The following data represent the monthly rents paid by samples of 30 households
selected from a city.
429 732 550 1020 750
540 956 1070 871 880
650 950 780 900 750
585 675 989 620 660
7/28/2019 Business Statistics May Module
17/72
578 1030 930 765 975
1020 840 870 800 820
Solution:
The digits contain either 3 digits or 4 digits we can take the stem for 1 digit for the 3
digits number and 2 digits for the four digit number. The leaf can be taken as a two
digit number.
The stem-and-leaf display may not necessarily require data to be arranged in orderly
manner but even. If it is arranged, the pattern obtained is maintained.
Stem-and-leaf display
4 295 85 50 40 78
6 75 20 60 50
7 32 50 65 80 50
8 71 80 40 70 00 20
9 89 56 30 75 50 00
10 20 30 70 20
By looking at the stem-and-leaf display we can observe how the data values are
distributed. The stem and leaf display does not lose the information on individual
observation or measurement.
Example 3
The following data give the number of computer courses taken by 30 businesses major
who recently graduated from a university.
2 3 2 3 1 4 2 2 3 4
2 3 4 1 2 3 2 1 4 2
7/28/2019 Business Statistics May Module
18/72
1 2 3 1 1 3 2 2 4 1
Required
a. Prepare a frequency distribution table.
b. Compute relative frequency and percentage distributions
c. Draw a bar graph for the frequency distributions
d. What percentage of the graduates takes 2 or 3 computer courses?
Solution
Identify all the numbers presented in the data set: 1, 2, 3 and 4.
Construct the summary table to include the columns: Number of courses, tallies,
frequency and who relative and percentage frequency distributions can be included inthe same table.
Number
of
courses
Tally Frequency
(f)
Relative Frequency
f/30
Percentage
frequency (*100)
1 7 0.2333 23.332 11 0.3667 36.67
3 7 0.2333 23.334 5 0.1667 16.6730 1.000
Bar graph (chart)
7/28/2019 Business Statistics May Module
19/72
Those graduates who take 2 or 3 courses are are the total of those who take 2
and those who take 3: (36.67 + 23. 33) % = 60%
Grouped/Continuous data
Discrete data can be presented like categorical data in bar graph where the numbers
take the horizontal axis. Frequency distribution table, relative frequency distribution or
percentage distribution tables can be done as for the categorical variables where the
discrete data value stands as a category. However for grouped data the frequencies,
relative frequencies and percentages are assigned to an interval of numbers in the
table.
Stem- and- leaf display may not be very applicable and in place of bar chart grouped
data is presented in a histogram.
Sometimes it becomes necessary to look at values in a data set in form of class or
groups. Each class gives the total number of values that fall within a given range. It is
required that one identifies the class width, that is, the number of values
accommodated in the class.
Number of classes or groups: at least should not be so few (not less than 3 classes andnot too many (not more than 10) in the context of our class work. However in real life
we may have data grouped into so many classes.
7/28/2019 Business Statistics May Module
20/72
This is necessary because we are interested in presenting data in a more organized,
easily interpretable form and in a way that makes sense.
Example 1
The data on the stock price of 25 companies:
31 15 13 33 23 16 12 12 23 26 22 18 27 21 18 13 26 16 17 27 22 22 26
20 30
To group the data we can choose a class width of 4 or 5. If we choose 4 the
approximate number of classes will be = 25/4 = 6.25 = 6
If 5 then 25/5 = 5 classes. Either can be used.
Lets use a class width of 5. Identify the smallest value =12
This can be the lowest value in the data or we can decide to start at 10. This means wewill consider in first class 10, 11, 12, 13, 14, ie, 10-14. The next class will have 15, 16,
17, 18, 19, i.e. 15-19 etc. we write class to include all the values. Other classes then
become; 15-19, 20-24 etc. The lowest and highest values in each class are included in
the interval.
The above classes can also be written as 10 to less than 15, 15 to less than 20 etc.
when we use this style the upper value in each class is not included. However in each
case the class interval is five. Be careful to use each style correctly.
The summary table: We can consider including relative frequency distribution and
percentage distribution in the table.
Grouped data can be presented in
i. summary table,
ii. histogram
iii. frequency polygon
iv. cumulative frequency curve (ogive)
i) Summary table
7/28/2019 Business Statistics May Module
21/72
ii) Histogram
This is a graph in which classes are marked on horizontal axis. The classes are written
to include class limits. Each class in adjusted so that the lower value in the class is
subtracted 0.5 while the upper is added 0.5: .: 9.5 14.5, 14.5 19.5, 19.5 24.5 etc.
The vertical axis either takes the frequency, relative frequency or percentage
frequency. The scale must include the highest frequency: In this case 8.
Draw bars with height corresponding to the frequency in each class making sure that
the bars are adjacent (touch) because the data is continuous and any value can be
included in this data. The information that can be obtained from a histogram is so much
like that from a bar chart for discrete or qualitative data. Histogram also like stem and
leaf can display the distribution pattern of the data.
Data can be normally or approximately distributed or skewed and histogram can
display this information well.
Class Tally Frequency Relative Frequency
f/25
% frequency
*100
Cumulative %
10-14 4 0.16 16 1615-19 6 0.24 24 4020-24 8 0.32 32 7225-29 4 0.16 16 88
30-34 3 0.12 12 10025 1.00 100
7/28/2019 Business Statistics May Module
22/72
iii) Frequency polygon
It is formed by plotting the middle of each class against the frequency and joining
the points with straight lines. The polygon can be drawn in the histogram by
marking the middle of the bars and joining the points. It is also effectively used to
display the pattern of the data across the classes.
iv) Cumulative frequency graph: Ogive
The graph is drawn by plotting the higher value of the class limits against
cumulative frequency, relative frequency or percentage frequency, i.e. of
companies had their stock prices between 15 and 22
We may also want to know the number of companies whose stock prices
were 27 and below.
Locate 27 along prices and draw a vertical line to meet the curve. Drawa horizontal line to read the frequency =21. Therefore 21 companies
had their stock prices at price of 27 and below. Therefore 4 companies
have their stock prices above 27.
The cumulative frequency curve: ogive
7/28/2019 Business Statistics May Module
23/72
Lorenz curve: It is a special Ogive that can be used to plot either income or wealth of a
country against the population. It will show how the distribution of wealth is in a given
country. Many Lorenz curves will form a long S showing some level of unequal
distribution of wealth among citizens.
Tax policies can be used to level out the inequality by charging higher tax rates for the
more wealth and lower rates for the little wealth population. For equal distribution the
long S shape results in a straight line- an ideal situation but the more equitably wealth
is distributed nearer the shape to a straight line.
The curve is drawn with percentage as cumulative of the population on vertical axis
and the amounts wealth or income.
3. NUMERICAL DESCRIPTIVE MEASURES AND ANALYSIS
The descriptive measures can be classified as:
- Measures of central tendency mode, median and mean
- Measures of dispersion or spread range, variance, std deviation, semi-
interquartile range, coefficient of variation etc
Measures of Central Tendency
Discrete data
These are summary measures that give averages. The measures of central tendency
can be calculated for discrete (ungrouped) or continuos (grouped) data.
Discrete data:
7/28/2019 Business Statistics May Module
24/72
a. Mode- this is the most popular or common item in the data set. It is the
value with the highest frequency. Data set can either have unimodal (one
mode), bimodal (two modes) or multimodal. Example
29 31 35 39 39 40 43 44 44
The above set is a bimodal with 39 and 44
b. Median- it is the value of the middle term in a data set that has been
ranked in ascending or descending order. The position of the median is
identified as:
N+1
2 where N is the total frequency
The median in the above data is position 9+1
2 =5th which is 39
Example 123 36 210 249 257 506 385 13 50 97 210 275
Find the median
Solution
Arrange the data in ascending order
13 23 36 50 97 210 234 249 257 275 385 506
Middle position = n+1 = 13 = 6.5
2 2
The position between 6 and 7th position 210 + 243 = 222
2
The advantage of using median as a measure of central tendency is that it is not
influenced by outliers. It is preferred to the mean for data set that contains outliers.
Outliers are few figures in the data that have extreme values from the rest: either very
low or very high.
Mean = Arithmetic mean
It is the most frequently used measure of central tendency. It is the average of the sum
of all values divided by the total frequency. So the mean is preferred in that it
represents the whole data set from which it is computed.
7/28/2019 Business Statistics May Module
25/72
= Mean = x Sample data
n
n = sample size
= X = mean from a population data N = population sizes.
N
Example 2
The following data gives the profits thousand dollars of a sample of five companies in a
given year.
4725 1884 3807 4939 and 162
X = = = X = 16980 = 3396
n 5
The average profit on the 5 years is $3396000. A major shortcoming tendency is that
mean is very sensitive to outliers.
Example 3
The following data give the number of years eight employee have been with their
current employers
11 9 13 12 8 9 24 10
a) Identify the outlier.
b) What would be the mean if the outlier was ii) excluded ii) included
Solution
a) Outlier is 24 which seem to be the extreme number of years the employee has
been with the employer.
7/28/2019 Business Statistics May Module
26/72
i) Mean excluding 24:
11+9+13+12+8+9+10 = 10.286
7
ii) Including 24
11+9+13+12+8+9+10+24 = 12
8
The one extreme value changes the mean by almost 2 values (units) i.e. from 10.256 to
12 (1.714).
Mean is very sensitive to outliers. For example the mean mark of BCM 307 test can
easily be affected by few very poor performing students or very few very weeperforming students. The mean may not accurately represent the whole class.
Example 4
The mean of 60, 80, 90, 120
60+80+90+120
4
350=
4
=87.5
The arithmetic mean is very useful because it represents the values of most
observations in the population.
The mean therefore describes the population quite well in terms of the magnitudes
attained by most of the members of the population
Measures of Dispersion
Discrete data
These are statistics or measures that show how data is dispersed. The measures may
include
7/28/2019 Business Statistics May Module
27/72
Range
Inter-quartile range
Variance
Standard variation
Range:The difference between the highest and the lowest value.
Example 1
The example on the number of years the employees have stayed with the employer.
11 9 13 12 8 9 24 10
Range: 24-8 = 16
The range is influenced by outliers as it is only based on two values. Its disadvantage is
that it ignores the rest of values in a data set and so it is not a satisfactory measure of
dispersion.
Inter Quartile Range:The difference between the upper quartile Q3 and the lower
quartile Q1. It contains the middle 50% data.
Example 1
Arrange data in order
8 9 9 10 11 12 13 241st Quartile: x 8 = 2nd 2nd = 9
Q1 = 9
3rd Quartile (Q3) = x 8 6th
6th
= Q3 = 12Inter quartile Range: 12-9 =3.
Example 2
The following is a discrete data
2, 5, 8, 10, 11, 14, 17, 20
7/28/2019 Business Statistics May Module
28/72
Required:
(i) Find the 30th percentile
(ii) The quartiles.
Solution
Position = .3(n + 1) = .3(9) = 2.7
30th percentile = 5 + .7(8 5) = 5 + 2.1 = 7.1
Lower Quartile (25th percentile)
Position = .25(n + 1) = .25(9) = 2.25
Q1 = 5+.25(8 5) = 5 + .75 = 5.75
Median (50th percentile)
Position = .5(n + 1) = .5(9) = 4.5
Median: Q2 = 10+.5(11 10) = 10.5
Upper Quartile(75th percentile)
Position = .75(n + 1) = .75(9) = 6.75
Q3 = 14+.75(17 14) = 16.25
Interquartiles
IQ = Q3 Q1 = 16.25 5.75 = 9.50
Example 2 (Grouped Data)
The following table shows the levels of retirement benefits given to a group of workers
in a given establishment.
Retirement
benefits 000
No of
retirees (f)
Upper
class
limit
cf
20 29 50 29.5 5030 39 69 39.5 11940 49 70 49.5 18950 59 90 59.5 27960 69 52 69.5 33170 79 40 79.5 37180 89 11 89.5 382
7/28/2019 Business Statistics May Module
29/72
Required
i. Determine the semi interquartile range for the above data
ii. Determine the minimum value for the top ten per cent.(10%)
iii. Determine the maximum value for the lower 40% of the retirees
Solution
The lower quartile (Q1) lies on position
N + 1 382 + 1=
4 4
= 95.75
(95.75 - 50)the value of Q1 = 29.5 + x 1069
= 29.5 + 6.63
= 36.13
The upper quartile (Q3) lies on position
N + 1
4
382 + 1=
4
= 287.25
The value of Q3 = 59.5 +( )287.25-279
52 10
= 61.08
The semi interquartile range =Q3-Q1
2
61.08 - 36.13=
2
7/28/2019 Business Statistics May Module
30/72
= 12.475
= 12,475
ii. The top 10% is equivalent to the lower 90% of the retirees
The position corresponding to the lower 90%
90= (n + 1) = 0.9 (382 + 1)
100
= 0.9 x 383
= 344.7
The benefits (value) corresponding to the minimum value for top 10%
= 69.5 + ( )344.7-33140
x 10
= 72.925
= 72925
iii. The lower 40% corresponds to position
=10
40(382 + 1)
= 153.20
Retirement benefits corresponding to its position
= 39.5 +( )153.2-119
70x 10
= 39.5 + 4.88
= 44.38
= 44380
e. The 10th 90th percentile range
7/28/2019 Business Statistics May Module
31/72
This is a measure of dispersion which uses percentile. A percentile is a value which
separates one division from the other when a given data is divided into 100 equal
divisions.
This measure of dispersion is very important when calculating the co-efficient of
skewness
Variance: Variance is the square of standard deviation. Formula
= (x ) where x: the values in data
N N: size of population
: the mean
Standard Deviation: It is simply the average of all the Deviations of values of a
variable from the mean.The deviation of each value from the mean is squared and the sum of all the square of
deviations is divided by total frequency (N) of population data and size of sample less 1
(n-1) if sample data was used, them obtain square not.
Formula for calculation:
Population data
= = (x-)
Example 1
Assuming the data in the number of years employees remained with the employer to
have been collected from a sample:
Variance S = (x )
n 1
Mean = X = 12 (obtained earlier)
S = (11-12) + (9-12)) + (8-12) + (10-12) + (24-12)
8-1
7/28/2019 Business Statistics May Module
32/72
S = (-1) + 2 (-3) + (-4) + (-2) + (8-12) + (10-12)
7
S = 1 + 2 x 9 + 16 + 4 + 144 = 176 = 25.142
7 7
On average each value deviates from the mean on squared = 25.142.
Standard Deviation: Square root of variance
= 25.142 = 5.014
On average each value deviates from the mean by 5.014.
In general the lower the value of standard deviation for a data set from the mean. The
values are close together but higher value of standard deviation indicates that thevalues are relatively spread or scattered.
If the standard deviation of scores obtained by students in a BCM 307 class was
obtained to be higher compared to score obtained in different class, it means the
abilities of students are spread out. Some are very poor while others may be good in
their performance.
If data set is larger the working can be done from a frequency distribution table.
Example 2
A sample comprises of the following observations; 14, 18, 17, 16, 25, 31
Determine the standard deviation of this sample.
x ( )x x ( )2
x x
14 -6.1 37.2118 -2.1 4.4117 -3.1 9.61
16 -4.1 16.8125 4.9 24.0131 10.9 118.81121 210.56
7/28/2019 Business Statistics May Module
33/72
12120.1
6X = =
Standard deviation, ( )2
210.56
6n
x x
= =
= 5.93
Example 3
The data represents the number of bedrooms in homes owned by 30 families
3 5 2 3 2 3 1 2 1 3
4 1 4 3 1 3 3 2 2 3
3 4 3 1 2 4 2 2 5 3
Required a) identify the mode calculate the
i) mean
ii) variance and standard deviation
Solution
Construct frequency distribution table.
= 30 x = 80 (x- ) = 36.667
a) mode is 3 bedrooms
b) X = x = 80 = 2.67 30
Variance = S = (x- ) = 36.667 = 1.264
n-1 30-1
Number of
rooms (x)
Tally Frequency FX Xi-X (Xi-X)
1 5 5 -1.67 13.94452 8 16 0.67 3.59123 11 33 0.33 1.19794 4 16 1.33 7.07565 2 10 2.33 10.8578
30 80 36.667
7/28/2019 Business Statistics May Module
34/72
The variance = S = 1.264
Standard deviation = S = 1.264 = 1.124
The mean or average of all deviations of values from the mean is 1.124 i.e. each value
is an average difference of 1.124 from the mean.
Coefficient of Variation
The variance or standard deviation of different data set is not easy to compare. The
coefficient of variation makes it possible for different data sets to be compared based
on measure of central tendency (normally the mean and measure of dispersion
(normally the standard deviation).Coefficient of variation: CV = standard deviation
Mean
In the above example: CV = 1.124 = 0.421
2.67
CV can also be written as a percentage CV = 1.124 x 100 = 0.421x100
2.67
The lower the CV the less the spread of the values from the mean i.e. the values are
closer together.
Measures of Central Tendency and Measures of Dispersion for a Continuos
Data
Example 1
The Table gives the frequency distribution of the daily commuting time for workers
from home to work for all employees of a company.
7/28/2019 Business Statistics May Module
35/72
Solution:
Computation of the measures similar to that of discrete data whereby the value of x is
obtained as the mid-point of each class
X = sum of the class boundaries e.g. 0+10 = 5 is the mid-point of the 1st classs
2 2
Time Mid-point (x) Frequency (f) (fx) (x-x) f0 to less than 10 5 4 20 1075.8410 to less than 20 15 135 135 368.64
20 to less than 30 25 150 150 77.7630 to less than 40 35 140 140 739.8440 to less than 50 45 90 90 1113.92
=25 x
=535
(x-x) f
=3439.36
Time
(minutes)
Number of
employees0 to less than
10
4
10 to less than
20
9
20 to less than
30
6
30 to less than
40
4
40 to less than
50
2
25
7/28/2019 Business Statistics May Module
36/72
The mean can also be assigned instead of x given the data is from a population.
However whether the column writes (x-x) or (x-) should not make difference in the
value.
Mean = = x = 535
25
= 21.4
Standard deviation = (x-) f = 3439.36
N 25
= = 137.5744 = 11.729.
NB:For continuous data the mode is replaced by the term modal class; simply the class
with the highest frequency. For the above example the modal class 10 to less than 20.
Practice Questions:
1. The following data represent the age of a sample of 10 employees of a given
company
39 29 43 52 39 44 40 31 44 35
Required:
i) identify the mode and the median
ii) compute
iii) mean
iv) standard deviation
v) coefficient of variation
2. The data gives the frequency distribution of the number of orders received
each day during the past 50 days at the office of a mail order company.
Number of Number of days
7/28/2019 Business Statistics May Module
37/72
order10 12 413 15 1216 18 2019 - 21 14
a) Identify the modal class
b) Calculate
i) mean
ii) variance and standard deviation
iii) coefficient of variation
3. The price of the ordinary 25p shares of Manco PLC quoted on the stock exchange, at
the close of the business on successive Fridays is tabulated below
126 120 122 105 129 119 131 138
125 127 113 112 130 122 134 136128 126 117 114 120 123 127 140124 127 114 111 116 131 128 137127 122 106 121 116 135 142 130
Required
a) Group the above date into eight classes.
b) Calculate cumulative frequency, the median value, quartile values and the
Semi-quartile range
c) Calculate the mean and standard deviation of your frequency distribution.
d) Compute :
i) The median and mean
ii) The semi-interquartile range and the standard deviation
5. The managers of an import agency are investigating the length of time that
customers take to pay their invoices, the normal terms for which are 30 days net. They
have checked the payment record of 100 customers chosen at random and havecompiled the following table:
Payment in Number of
customers5 to 9 days 4
7/28/2019 Business Statistics May Module
38/72
10 to 14 days 1015 to 19 days 1720 to 24 days 2025 to 29 days 2230 to 34 days 1635 to 39 days 840 to 44 days 3
Required:
a) Calculate the arithmetic mean.
b) Calculate the standard deviation
c) Construct a histogram and insert the modal value.
d) Estimate the probability that an unpaid invoice chosen at random will be between
30 and 39 days old.
4. PROBABILITY DISTRIBUTIONS
Probability distribution can either be discrete or continuous. The distribution can
also assume the uniform, normal and skewed
For numerical data, any distribution: discrete, continuous or probability, the mean and
standard deviations can be used to find the proportions or percentage of the total
observations that fall within a given internal about the mean.
The pattern of any distribution of data values throughout the entire range of all values
given a certain shape. The shape can be identified from a bar chart for discrete data or
histogram for continuous data. The shape of the distribution can either be
i) Uniform
ii) Bell-shaped shaped that is- symmetrical
iii) skewed
i) Uniform or rectangular
7/28/2019 Business Statistics May Module
39/72
ii) Symmetrical- bell shaped
For a symmetrical continuous distribution the measures of central tendency mode,
median and mean are equal and the value is at the middle of the shape. Such a
distribution is called normal distribution Gaussian distribution.
a) DISCRETE DATA
A probability distribution for a discrete random variable is a mutually exclusive listing
of all the possible numerical out occurrence of each outcome.
7/28/2019 Business Statistics May Module
40/72
Example 1
The following table contains the probability distribution for the number of traffic
accidents daily in a small city.
Number of
accidents
Probability p(x)
0 0.10.1 0.202 0.453 0.154 0.055 0.05
Required:Compute:
a) expected number of accidents
b) The variance and standard deviation
c) Coefficient of variation
Solution
Probability is a term that reflects uncertainty. It is used to make predictions on
happenings by assigning the probability of the event happening.
The mean or average from such distribution is referred to as expected value E(x), E(x)
= X (Pxi)
Where X: - Variable Px: - probability that event xi will occur
The variance = = (xi E(x)) 2 Pxi
Number of Accounts probability
(x) P(xi) Xi Pxi Xi E(x) Pxi
7/28/2019 Business Statistics May Module
41/72
0 0.10 0 0.401 0.20 0.20 0.202 0.45 0.90 0.003 0.15 0.45 0.154 0.05 0.20 0.205 0.05 0.25 0.45
1.00 2.00 1.4
Pxi = 1.00 xi Pxi = 2.00 (xi E(x)) Pxi= 1.4
i) Expected value E (x) = 2.00
ii) Variance = = (xi E(x) Pxi = 1.4
iii) standard deviation = = = 1.4 = 1.1832
iv) Coefficient of variance CV = 1.18322
= 0.592 (59.2%)
Example 2Given the following probability distributions A and B
Distribution A Distribution BX p (x) X
p(x)0 0.25 0 0.151 0.25 1 0.252 0.25 2 0.453 0.25 3 0.15
a) Compute:
i) The expected value for each distribution
ii) The standard deviation for each distribution
iii) Compare the results of distribution A and B.
Distribution A Distribution BX PC(X) XP(x) [X-E(X)]
Px)
X P(X) X P(X) [X-E(X)] P(X)
0 0.25 0.00 0.5625 0 0.15 0.00 0.3841 0.25 0.25 0.0625 1 0.25 0.25 0.0902 0.25 0.50 0.0625 2 0.45 0.90 0.0723 0.25 0.75 0.5625 3 0.15 0.45 0.294
7/28/2019 Business Statistics May Module
42/72
= E(x) = 1.5 = 1.25 =E(x) = 1.6 = 0.84
Distribution of A is uniform and symmetric .The distribution has one mode i.e. the
variance 2.
b) CONTINUOUS DISTRIBUTION: NORMAL DISTRIBUTIONFrequency distribution for continuous data can be converted to a probability
distribution by calculating the relative frequency for each class. This column is taken
as equivalent of probabilities for each class.
Like total sum of relative frequency, the total probability is also equal to 1. i.e. Px
= 1
The distribution is the most common continuous distribution used in statistics based on
the following main reasons.
Numerous continuous variables common in business and other natural
occurrences have distributions that closely resemble the normal distribution
The normal distribution can be used to approximate various discrete probability
distributions.
It provides the basis for classical statistical inference.
The normal distribution is represented by the classical bell shape with it one can
calculate the probability density function is denoted by the symbol (x).
The mean () is in the middle of the symmetrical distribution. The standard deviation
() measures the distance from the mean to a point on the x (horizontal) axis. In
order to work with a set of standard values it is necessary to convert or transform any
normal distribution to a standard normal distribution which has a mean of o and a
standard deviation of 1.
The total area of the distribution is 1, and each half of the curve is 0.5. Any values of x
in a distribution can be converted to a value called z value or z- score, by the formula:
Z = x -
7/28/2019 Business Statistics May Module
43/72
Where x the variable
- mean
Standard deviation
Z values are obtained normal probability distribution. The Z values correspond to the
area shaded (identified from the normal curve).
Example 1
The heights of adult males are normally distributed with mean 170 cm and standard
deviation 10cm.
Find the probability that the height of students is:
Between 180 and 190
Taller than 190cm
Shorter than 180cm
Shorter than 165cm
Solution
The distribution is said to be normally distribution
=170
X = 170 is the mean at the middle of the curve.
Use formula to find 2 (standard deviation: )
Find the area (probability) that
The height of an adult is between 180cm and 190cm
7/28/2019 Business Statistics May Module
44/72
Z = x - = 180 170 = 1 (180 is 1 standard deviation)
10
= 190 170 = 2 (190 Is 2 standard deviation)
10
Find areas in the normal tables = P(z)
Z Area (Area under curve between Z = 0 and 2
1 0.3413
2 0.4772
P (180 x (190) = p (1 z = 0.4772 0.3413 = 01359.
-3 -2 -1 0 1 2 3
Taller than 190cm: P (x > 190)
Z = 190 170 = 2
10
-3 -2 -1 0 1 2 3
Z =2 and P(z) or area =0.4772
7/28/2019 Business Statistics May Module
45/72
P(z>2) = 0.5 0.4772 = 0.0228 c)
c) Shorter than 180cm
-3 -2 -1 0 1 2 3
Z = 180-170 =1
10
P(x
7/28/2019 Business Statistics May Module
46/72
-3 -2 -1 0 1 2 3P(-0.59).
Solution
(i) P(2 < X < 5) = P(0.33 < Z < 0.67)
= .3779.
(ii) P(X >0) = P(Z > 1) = P(Z < 1)
= .8413.
(iii) P(X >9) = P(Z > 2.0)
= 0.5 0.4772 = .0228
Example 3
A sample of students had a mean age of 35 years with a standard deviation of 5 years.
A student was randomly picked from a group of 200 students. Find the probability that
the age of the student turned out to be as follows
i. Lying between 35 and 40ii. Lying between 30 and 40
iii. Lying between 25 and 30
iv. Lying beyond 45 yrs
v. Lying beyond 30 yrs
7/28/2019 Business Statistics May Module
47/72
vi. Lying below 25 years
Solution
(i). the standardized value for 35 years
Z =
=
5
35-35= 0
The standardized value for 40 years
Z =
=
5
35-40= 1
The area between Z = 0 and Z = 1 is 0.3413 (These values are checked from the
normal tables see appendix)
The value from standard normal curve tables
When z = 0, p = 0
And when z = 1, p = 0.3413
Now the area under this curve is the area between z = 1 and z = 0
= 0.3413 0 = 0.3413
The probability age lying between 35 and 40 yrs is 0.3413
(ii). 30 and 40 years
Z =
=5
3530 =5
5
= -1
Z =
=
5
3540= 1
The area between Z = -1 and Z = 1 is
= 0.3413 (lying on the positive side of zero) + 0.3413 (lying on the negative side
of zero)
P = 0.6826The probability age lying between 30 and 40 yrs is 0.6826
(iii). 25 and 30 years
Z =
=
5
3525=
5
10= -2
7/28/2019 Business Statistics May Module
48/72
Z =
=
5
3530= -1
The area between Z = -2 and Z = -1
Probability area corresponding to Z = -2
= 0.4772 (the z value to check from the tables is 2)Probability area corresponding to Z = -1
= 0.3413 (the z value for this case is 1)
The probability that the age lies between 25 and 30 yrs
= 0.4772 0.3413 (The area under this curve)
P(Z) = 0.1359
iv). P(beyond 45 years) is determined as follow = P(x > 45)
Z =
=
5
3545=
5
10+= + 2
Probability corresponding to P(Z = 2) = 0.4772 = probability of between 35 and 45
P(Age > 45yrs) = 0.5000 0.4772
= 0.0228
Practice Questions
1. Identify the following as discrete or continuous random variables.(i) The market value of a publicly listed security on a given day
(ii) The number of printing errors observed in an article in a weekly news magazine
(iii) The time to assemble a product (e.g. a chair)
(iv) The number of emergency cases arriving at a city hospital
(v) The number of sophomores in a randomly selected Math. class at a university
(vi) The rate of interest paid by your local bank on a given day
2. A random variableXhas the following probability distribution:X 1 2 3 4 5
P(x) .05 .10 .15 .45 .25
(i) Verify thatXhas a valid probability distribution.
(ii) Find the probability thatXis greater than 3, i.e. P(X >3).
(iii) Find the probability thatXis greater than or equal to 3, i.e. P(X 3).
7/28/2019 Business Statistics May Module
49/72
(iv) Find the probability thatXis less than or equal to 2, i.e. P(X 2).
(v) Find the probability thatXis an odd number.
(vi) Graph the probability distribution forX.
3, Calculate the area under the standard normal curve between the following values.
(i)Z= 0 andz= 1.6 (i.e. P (0 Z 1.6))
(ii)Z= 0 andz= 1.6 (i.e. P (1.6 Z 0))
(iii)Z= .86 andz= 1.75 (i.e. P (.86 Z 1.75))
(iv)Z= 1.75 andz= .86 (i.e. P (1.75 Z .86))
(v)Z= 1.26 andz= 1.86 (i.e. P (1.26 Z 1.86))
(vi)Z= 1.0 andz= 1.0 (i.e. P (1.0 Z 1.0))
(vii)Z= 2.0 andz= 2.0 (i.e. P (2.0 Z 2.0))
(viii)Z= 3.0 andz= 3.0 (i.e. P (3.0 Z 3.0))
4. LetZbe a standard normal distribution. Findz0 such that(i) P (Z z0) = 0.05
(ii) P (Z z0) = 0.99
(iii) P (Z z0) = 0.0708
(iv) P (Z z0) = 0.0708
(v) P (z0 Z z0) = 0.68
(vi) P (z0 Z z0) = 0.95
5.A normally distributed random variableXpossesses a mean of = 10 and a standard
deviation of=5.
Find the following probabilities.
(i)Xfalls between 10 and 12 (i.e. P (10 X 12)).
(ii)Xfalls between 6 and 14 (i.e. P (6 X 14)).
(iii)Xis less than 12 (i.e. P(X 12)).
(iv)Xexceeds 10 (i.e. P(X 10)).
6. CONFIDENCE INTERVAL
The interval estimate or a confidence interval consists of a range (upper confidences
limits and lower confidence limit) within which we are confident that a population
7/28/2019 Business Statistics May Module
50/72
parameter lies and we assign a probability that this interval contains the true
population value.
Confidence interval is the interval between the confidence limits. The higher the
confidence level the greater the confidence interval.
For example
A normal distribution has the following characteristic
i. Sample mean 1.960 includes 95% of the population
ii. Sample mean 2.588 includes 99% of the population
Large Samples
The Central Limit Theorem: The theory states that if we select a large number of
simple random samples, say from any population and determine the mean of each
sample, the distribution of these sample means will tend to be described by the normal
probability distribution with a mean and variance 2
/n. This is true even if thepopulation itself is not normal distribution. Or the sampling distribution of sample
means approaches to a normal distribution irrespective of the distribution of population
from where the sample is taken and approximation to the normal distribution becomes
increasingly close with increase in sample sizes
Large samples that contain a sample size greater than 30(i.e. n>30). Such samples can
use levels of confidence based on the normal distribution.
Estimation of population mean
Here we assume that if we take a large sample from a population then the mean of the
population is very close to the mean of the sample
Steps to follow to estimate the population mean includes
i. Take a random sample of n items where (n>30); n is the sample size
ii. Compute sample mean (X ) and standard deviation (S)
iii. Compute the standard error of the mean by using the following formula
Sx
=n
s
Where Sx = Standard error of mean
S = standard deviation of the sample
n = sample size
iv. Choose a confidence level e.g. 95% or 99%
7/28/2019 Business Statistics May Module
51/72
v. Estimate the population mean as under
Population mean = appropriate number XSx
Appropriate number means confidence level e.g. at 95% confidence level is
1.96 this number is usually denoted by Z and is obtained from the norma
tables. The value of z corresponds to the confidence obtained as the
probability percentage
Example 1
The quality department of a wire manufacturing company periodically selects a sample
of wire specimens in order to test for breaking strength. Past experience has shown
that the breaking strengths of a certain type of wire are normally distributed with
standard deviation of 200 kg. A random sample of 64 specimens gave a mean of 6200
kgs. Find out the population mean of 95% level of confidence
Solution
Population mean = 1.96 Sx
Note that sample size is already n > 30 whereas s and x are given thus step i), ii) and
iv) are provided.
Here: X = 6200 kgs
Sx = ns
=64
200= 25
Population mean = 6200 1.96(25)
= 6200 49
= 6151 to 6249
At 95% level of confidence, population mean will be in between 6151 and 6249
Estimation of population proportions
This type of estimation applies at the times when information cannot be given as a
mean or as a measure but only as a fraction or percentage
The sampling theory stipulates that if repeated large random samples are taken from a
population, the sample proportion p will be normally distributed with mean equal to
the population proportion and standard error equal to
7/28/2019 Business Statistics May Module
52/72
Sp =Pq
n= Standard error for sampling of population proportions
Where n is the sample size and q = 1 p.
The procedure for estimating a proportion is similar to that for estimating a mean, we
only have a different formula for calculating standard error is different.
Example 1
In a sample of 800 candidates, 560 were male. Estimate the population proportion at
95% confidence level.
Solution
Here
Sample proportion (P) =560
800= 0.70
q = 1 p = 1 0.70 = 0.30n = 800
pq
n= ( ) ( )
0.70 0.30
800
Sp = 0.016
Population proportion
= P 1.96 Sp where 1.96 = Z.
= 0.70 1.96 (0.016)
= 0.70 0.03
= 0.67 to 0.73
= between 67% to 73%
Example 2
7/28/2019 Business Statistics May Module
53/72
A sample of 600 accounts was taken to test the accuracy of posting and balancing of
accounts where in 45 mistakes were found. Find out the population proportion. Use
99% level of confidence
Solution
Here
n = 600; p =45
600= 0.075
q = 1 0.075 = 0.925
Sp =pq
n= ( ) ( )
0.075 0.925
600
= 0.011
Population proportion
= P 2.58 (Sp)
= 0.075 2.58 (0.011)
= 0.075 0.028
= 0.047 to 0.10
= between 4.7% to 10%
Small Samples
Estimation of population mean
If the sample size is small (n
7/28/2019 Business Statistics May Module
54/72
x
S =s
n
S = standard deviation of samples = ( )2
1
x x
n
for small samples.
n = sample size
v = n 1 degrees of freedom.
The value of t is obtained from students t distribution tables for the required confidence
level
Example
A random sample of 12 items is taken and is found to have a mean weight of 50 grams
and a standard deviation of 9 grams
What is the mean weight of population
a) with 95% confidence
b) with 99% confidence
Solution
50;X = S = 9; v = n 1 = 12 1 = 11;9
12x
sS
n= =
= x xts
At 95% confidence level
= 50 2.2629
12
= 50 5.72 grams
Therefore we can state with 95% confidence that the population mean is between
44.28 and 55.72 grams
At 99% confidence level
7/28/2019 Business Statistics May Module
55/72
= 50 3.259
12
= 50 8.07 grams
Therefore we can state with 99% confidence that the population mean is between
41.93 and 58.07 grams
Note: To use the t distribution tables it is important to find the degrees of freedom (v =
n 1). In the example above v = 12 1 = 11
From the tables we find that at 95% confidence level against 11 and under 0.05, the
value of t = 2.201
7. SIMPLE LENEAR REGRESSION EQUATION
A regression model is a mathematical equation that describes the relationship between
two or more variables. A simple regression model includes only two variables;
Independent variables: the variables used to explain the variation in the
dependent variable i.e. they are used to make prediction on the dependent
variable.
The dependent variable is the one being explainedThe regression model that is linear shows the equation of a linear relationship between
two variables X (dependent) and of (independent) as shown below:
Y = a + bx
The value of a: it is the y- intercept; the value of y where the line cuts the y- axis. The
constant b; this is the slope or the gradient of the line.
The linear relationship between x and y can be defined if the values of the constants a
and b are determined. The values of a and b can be determined in the ways.
Scatter plots:
7/28/2019 Business Statistics May Module
56/72
Scatter plots are used to examine the relationship between two variables. One variable
takes the horizontal axis (x) while the other takes the vertical axis (y). The variation
between the variables can show a relationship that is positive or negative.
Positive relationship that is either linear or close to linear would indicate that the
variables more together in a linear manner. The scatter will show points lying in a
region reflecting a and of a line. When one variable increases the other also increases,
and when one decreases the other decreases the other also.
Negative relationship is accompanied by a decrease in the other variable. Linear
relationship shows points scattered in a way to lie in a line.
Relationships between variables can also be non-linear. In such cases the points will
concentrate in a region that reflects a curve. The relationship between two variables
may therefore assume many possible shapes; which can be classified as linear or non-linear relationship that are complicated mathematical functions. The simplest
relationship consists of a straight-line or linear relationship.
Check on the scatter plots on page 606 of the main test book.
A scatter plot from which a line that fits (line of best fit) the variables points in
the scatter into an approximately straight line.
This requires good and refined skill in identifying the line that best fits the nearer
is the accuracy of the line obtained. However there will always be an error causedrandom causes random error term. This is the difference between the actual
value of y the obtained from the survey and the estimated values of the y by
assuming they fall along the line. For every value of x, a different value of y may
be obtained by estimating the line. The error for each value of y can be written as
E = y - : where y is the actual value and the estimated value.
For all the values of y, there will be a sum of y are less than the actual while
others are more than the actual.
The sum of the less (-ve difference) and the sum of the greater (the difference) is
zero.
Example:
7/28/2019 Business Statistics May Module
57/72
The following data represents a sample of seven households showing their incomes and
food expenditures for a given month.
Income (hundreds of
dollars
Food expenditure
(hundreds of dollars)
35 944 1521 739 1115 528 825 9
Required:
Construct a scatter diagram; with income an x-axis and food-expenditure an y-
axis
Draw the prediction line
Identify the y-intercept (a) and the slope (gradient) (b).
Write the simple linear regression equation.
Scatter diagram
7/28/2019 Business Statistics May Module
58/72
Depending on ones skill different lines such as L1, and L2 can be drawn using L2
Y Intercept i.e. constant a = 1.2
The gradient i.e. coefficient of x; b = 12-6
46-22 = 0.25
The linear equation generally written as
Y = a + bx
= 1.2 + 0.25x.
The line is an estimate of values of y for different values of x. It can be used to predict
values of y given x.
However since the line is an estimate; the difference between the observed or actual
value of y and the obtained by the prediction line, there exists an error called random
error, also called the residual. It measures the surplus (positive or negative)
differences. The random error obtained from a population is denoted by while that of
a sample is denoted by e in the above example.
E= Actual food expenditure - predicted food expenditure = y - .
If the predicted line completely fits as the best line the sum of positive errors and the
negative errors is equal to zero.
7/28/2019 Business Statistics May Module
59/72
Drawing a scatter diagram may not give is the best of fit line. The other option that
results in sum of errors equal to zero:
e = (y-) = 0
The use of the least squares method
The Least squares method
The least of squares method minimizes the random error. It helps to determine the
constants a and b for the equation
Y = a + bx that results in the line of best fit. The method gives the values of a
and b for the equation (model) such that the same of squared errors (SSE) is minimum
SSE = e = (y-) .
The values of a and b which gives the minimum SSE are called the least squares
estimates and the line is called the least squares line.
For the line = a + bx
b = SSxy and a = - bx
SSxx
Where SSxy = xy - (x) (y)
n
SSxy = x - (x)
n
Find the least square regression line for the data on incomes and food expenditures of
seven households; we require to construct the table that would guide the computation
of a and b.
The table has the following;
Icome Food expenditure(x) (y) xy x35 9 315 1225
7/28/2019 Business Statistics May Module
60/72
49 15 735 290121 7 147 44139 11 429 152115 5 75 22528 8 224 78425 9 225 625x=212 y=64 xy=2150 x=7222
x = 212 y = 64
x = x = 212 = 30.286
n 7
y = y = 64 = 9.143
n 7
xy = 2150 x = 7222
SSyx = xy- x y = 2150 - ( 212)(64) =211.714
n 7
SSxx = x - (x) = 7222- ( 212) = 801.429
n 7
b = SSxy = 212.714 = 0.2654
SSxx 801.429
a = - bx = y - b x = 64 - 0.2654(212)
n n 7 7
a =1.1414
y = 1.1414 - 0.2654x
Interpretation:
7/28/2019 Business Statistics May Module
61/72
The line gives coefficients of a and b to four decimal points making it more accurate to
be used for prediction. We can check the accuracy as follows:
A household with monthly income 35 (83500) dented by x=35 would be expected to
spend some money on food as follows:
y = 1.1414 + 0.2642x
x = 35
y = 1.1414 + 0.2642 (35) 810.3884
i.e. in hundred dollars (81038.84)
The acted value in the data gives y = 9. The value 810.3884 could be regarded as an
average i.e. for households having an average i.e. for households having an income of
83500 (x=35) they spend an average 810.3884 (81038.84) on food.
The constant a is the value of y when x=0. That is the amount of money a household
would spend on food per month if there was no income.
It means that food expenditure does not only depend on income but there could be
other factors.
For purposes of prediction using the linear regression line obtained, we can only predict
values of y for values of x that lie within the range in our data.
For example, the incomes lie between (81500 to 84900) i.e. x=15 and x = 49. We can
only predict values of y with values of x between 15 and 49. We can only predict values
of y with values of x between 15 and 49.
Prediction outside this range may not hold true (prediction not reliable). X = 0 is a
value not within the range and so the prediction that households with no income spend
8114.14 per month cannot be supported by our equation.
7/28/2019 Business Statistics May Module
62/72
The constant b in the model gives the gradient or change in y due to a charge of one
unit in x.
Example; when x increases by one unit of income in (hundreds) then y increase by
0.2642 (in hundreds) of dollars spent on food.
Example;
If the income of a household changes from x = 30 to x = 31
y will change as:
y = 1.1414 + 0.2642 (30) = 9.0674
y = 1.1414 + 0.2642 (31) = 9.3316
9.3316 9.0674 = 0.2642
When b is positive it means that as x increases y also increases and if x decreases y
also decreases. There is a positive linear relationship between the variables i. e. the
change in y and the charge in x are in the same direction; the variables move together.
If the value of b is negative, change in y is in opposite direction to change in x i.e. there
is a negative linear relationship between the variables.
When is greater than zero (b > 0) the line slopes upwards from left to right. If b < 0 the
line slopes down wards from left to right
Assumptions of the regression model
The mean value of error is zero. From the above example, among the households
with the same income some spend more on food and other less. The sum of the
differences (positive errors and negative errors) is equal to zero.
The errors associated with different observations are independent. That is, all
households decide independently how much to spend in food.
For any given x, the distribution of errors is normal, i.e. with the above example
the food expenditure for all households with the same income are normally
distributed.
7/28/2019 Business Statistics May Module
63/72
The distribution of population errors for each x has the same (constant) standard
deviation. The assumption is that the spread of points around the regression line
is similar for all x values.
Example 2
A random sample of eight auto divers insured with a company and having similar auto
insurance policies was selected. The following table lists their driving experience (in
years) and the monthly auto insurance premium (in dollars) paid by them. Find the
linear equation using L. S M.
Driving expenditure
(year)
Monthly auto insurance premium
(dollars)
5 642 8712 509 7115 446 5625 4216 60
x = 90 y = 474
x = x = 90 = 11.25
n 8
y = y = 474 = 59.25
n 8-
xy = 4737 x = 1396
SSyx = xy- x y = 4739 - (90) = 383.5
n 8
SSxx = x - (x) = 1396 - (90) = 383.5
n
7/28/2019 Business Statistics May Module
64/72
b = SSxy = -593.5 = - 1.5476
SSxx 383.5
a = - bx = 59.25 (-1.5476)11.25) = 76.6605
= 76.6605 - 1.5476x
Practice Questions
1.The age versus prices for printers I reported in the table below. Age is in years while
prices are in dollars (in hundreds)
Age (years)
x
Price 00
dollars (y)5 807 576 586 555 704 887 436 605 695 63
2 118
Required:
i. Find the equation of the regression line.
ii. Describe the apparent relationship between age and price for the printers
iii. What does the slope of the regression equation represent in terms of the price for
printers?
iv. Panama enterprise wants to buy 3 year old and 4 year old printers from the firm.
How much do you predict the firm will spend in buying the two printers?
8.HYPOTHESIS TESTING
Definition
7/28/2019 Business Statistics May Module
65/72
A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be
tested statistically in order to establish whether it is correct or not correct
When testing a hypothesis, one must fully understand the 2 basic hypothesis to be
tested namely
i. The null hypothesis (H0)
ii. The alternative hypothesis(H1)
The null hypothesis
This is the hypothesis being tested, the belief of a certain characteristic e.g. Kenya
Bureau of Standards (KBS) may walk to a sugar making company with an intention of
confirming that the 2kgs bags of sugar produced are actually 2kgs and not less, they
conduct hypothesis testing with the null hypothesis being: H0 = each bag weighs 2kgs
The testing will set out to confirm this or to refute it.
The alternative hypothesis
While formulating a null hypothesis we also consider the fact that the belief might be
found to be untrue hence we will reject it. We therefore formulate an alternative
hypothesis which is a contradiction to the null hypothesis, thus when we reject the null
hypothesis we accept the alternative hypothesis.
In our example the alternative hypothesis would be
H1 = each bag does not weigh 2kg
Acceptance and rejection regions
All possible values which a test statistic may either assume consistency with the null
hypothesis (acceptance region) or lead to the rejection of the null hypothesis (rejection
region or critical region)
The values which separate the rejection region from the acceptance region are called
critical values
Type I and type II errors
While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis,
there are four possible occurrences.
a) Acceptance of a true hypothesis (correct decision) accepting the null hypothesis
and it happens to be the correct decision. Note that statistics does not give
7/28/2019 Business Statistics May Module
66/72
absolute information, thus its conclusion could be wrong only that the probability of
it being right are high.
b) Rejection of a false hypothesis (correct decision).
c) Rejection of a true hypothesis (incorrect decision) this is called type I error, with
probability = .
d) Acceptance of a false hypothesis (incorrect decision) this is called type II error,
with probability = .
Levels of significance
A level of significance is a probability value which is used when conducting tests of
hypothesis. A level of significance is basically the probability of one making an
incorrect decision after the statistical testing has been done. Usually such probability
used are very small e.g. 1% or 5%
0.5000 0.4900
1%
provision for errors
Hypothesis testing procedure
Whenever a business complaint comes up there is a recommended procedure for
conducting a statistical test. The purpose of such a test is to establish whether the null
hypothesis or alternative hypothesis is to be accepted.
The following are steps normally adopted
1. Statement of the null and alternative hypothesis
2. Statement of the level of significance to be used.
3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean
sample proportion, difference between sample means or sample proportions
7/28/2019 Business Statistics May Module
67/72
4. Type of test whether two tailed or one tailed.
5. Statement on critical values using the appropriate level of significance
6. Standardizing the test statistic
7. Conclusion showing whether to accept or reject the null hypothesis
Hypothesis testing for the mean
Example 1
A certain NGO carried out a survey in a certain community in order to establish the
average at which the girls are married. The results of the survey indicated that the
marriage age for the girls is 19 years
In order to establish the validity of the mean marital age, a sample of 50 women was
interviewed and the average age indicated that they got married at the age of 16
years. However the different ages at which they were married differed with thestandard deviation of 2.1years
The sample data indicates that the marital age is less 19 years. Is this conclusion true
or not ?
Required
Conduct a statistical test to either support the above conclusion drawn from the sample
statistics i.e. the marriage age is less than 19 years, use a level of significance of 5%
Solution
1. Null hypothesis
H0: (mean marital age) = 19 years
Alternative hypothesis H1: (mean marital age) < 19 years
2. The level of significance is 5%
3. The test statistics is the sample mean age, X= 16 years
4. The critical value of the one tailed test (one tailed because the alternative
hypothesis is an inequality) at 5% level of significance is 1.65
Solution
Z =X -
Sx, where xS =
S
n
Where, X = Sample mean
= Population mean
7/28/2019 Business Statistics May Module
68/72
S = sample standard deviation
n = sample size
z = standard value (as per computation)
5. The standard value Z must fall within the acceptance region for us to accept the
null hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative
hypothesis.
Z = 2.150
16 19= - 10.1
Rejection region acceptable region
3 -2 -1 0 1 2 3
6. Since 10.1 < -1.65, we reject the null hypothesis but accept the alternative
hypothesis at 5% level of significance i.e. the marriage age in this community is
significantly lower than 19 years
Example 2
Test the hypothesis that weight loss in a new diet program exceeds 20 pounds during
the first month.
Sample data: n = 36, x = 21, s2 = 25, 0 = 20, = 0.05
H0: = 20 ( is not larger than 20)Ha: > 20 ( is larger than 20)
Z = X - 0 = 21 20 = 1.2
s/ n 5/36
Z =1.645
7/28/2019 Business Statistics May Module
69/72
Acceptable region rejection region
-3 -2 -1 0 1 2 3
At 5%; with Critical value:z= 1.645
RR: Reject H0 ifz > 1.645
Decision: Do not reject H0 because the critical value is outside the reject region
Conclusion: At 5% significance level there is insufficient statistical evidence to conclude
that weight loss in a new diet program exceeds 20 pounds per first month.
Exercise:Test the claim that weight loss is not equal to 19.5.
Example 3
A machine is set to cut out bars to an average length of 150mm. an operator wants to
check whether the setting is accurate. She samples 50 bars and finds a mean of
148mm. the standard deviation is known to be 5mm. is the machine still reliable? Test
this at 1% significance level.
Solution
H0: = 150 (machine may be reliable)Ha: 150 (2- tailed test, machine not reliable; may produce lengths that are
too long or too. Short. We cannot get a direction from the wording of the question).
Alpha: a = 0.01, Critical value:z/2 = z0.005 = 2.575
7/28/2019 Business Statistics May Module
70/72
Z = X - 0 = 148 -150 = -2.83
s/ n 5/50
0.005 0.005
3 -2 -1 0 1 2 3
Hypothesis testing for proportion
A member of parliament (MP) claims that in his constituency only 50% of the total
youth population lacks university education. A local media company wanted to
acertain that claim thus they conducted a survey taking a sample of 400 youths, of
these 54% lacked university education.
Required:
At 5% level of significance confirm if the MPs claim is wrong.
Solution.
Note: This is a two tailed tests since we wish to test the hypothesis that the hypothesis
is different () and not against a specific alternative hypothesis e.g. < less than
or > more than.
H0 : = 50% of all youth in the constituency lack university education.
H1 : 50% of all youth in the constituency lack university education.
Sp =pq
n=
0.5 0.5
400
x= 0.025
Z =0.54 0.50
0.025
= 1.6
7/28/2019 Business Statistics May Module
71/72
at 5% level of significance for a two-tailored test the critical value is 1.96 since
calculated Z value
Z (sample) =-2.83
This falls in the region of Ha.
Conclusion: reject Ho. There is enough evidence to support that the machine is no
longer reliable.
Practice Questions
Kenya Commercial bank Ltd. commissioned a research whose results indicated thatautomatic teller machine (ATM) reduces the cost of routine banking transactions.
Following this information, the bank installed an ATM facility at the premises of JoyProcessing Company Ltd., which for the last several months has exclusively been, used
by JoyS 605 employees. Survey on the usage of the ATM facility by 100 of theemployees in a month indicated the following:
Number of
times ATM
used
Frequenc
y
0 20
1 32
2 20
3 13
4 10
5 5
Required:
a) An estimate of the proportion of Joys employees who do not use the ATM facility in
a monthb)i) Determine the 95% confidence interval for the estimate in (a) aboveii) Can the bank be certain that at least 40% of Joys employees will use the ATM
facility?c).The number of ATM transactions on average an employee of Joy makes per month
7/28/2019 Business Statistics May Module
72/72
d).Determine the 95% confidence interval of the mean number of transactions madeby an employee in a month.e).Is it possible that the population mean number of transactions is four?
Explain.