STX1110 Module Notes 0607

STX1110 Introduction to Quantitative Methods © Middlesex University Business School

STX1110

Introduction to Quantitative Methods

Module Notes

Mathematics and Statistics Group

Page 1

STX1110 Introduction to Quantitative Methods Mathematics and Statistics Group Middlesex University Business School

sbegin

Contents Unit Page

Statistics Section

S1 Collecting Data 1 5

S2 Collecting Data 2 19

S3 Summarising and Presenting Data 1 27



S6 Numerical summaries of Data 1 77

S7 Numerical summaries of Data 2 91

S8 Correlation and Regression 1 107

S9 Correlation and Regression 2 123

S10 Estimation 141

Mathematics Section

M1 Financial Maths 1 157

M2 Financial Maths 2 177

M3 Index Numbers 191

M4 Intro to Probability 215

M5 Standard Normal distribution 231

M6 General Normal distribution 243

M7 Linear Equations 253

M8 Linear Programming and Optimisation 269

M9 Time Series Analysis 285

Page 3


STATISTICS SECTION

Original author: Cathy Minett-Smith Revisions by: Thomas Bending

Alison Megeney

S1 Collecting Data 1 Page 5


CHAPTER=DC1

STARTSECTION=scope_1.htm= SECTION~

Collecting Data 1 Context

Statistics is concerned with scientific methods for collecting, summarising, presenting and analysing data (information) which may be numerical or non numerical. Quite often we need to make decisions, or draw valid conclusions when we are given incomplete information. For example, we may need to say something about the total sales of a company when we only have a small selection of the company’s invoices. Statistical methods can help you make the ‘best decision’ or draw the ‘most reasonable’ conclusion.

ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives

Having worked through this unit, you should:

• appreciate the relevance of Statistics; • be able to explain what is understood by the Statistical term ‘population’; • be able to explain what a statistical sample from a population is; • understand why in certain situations it may be necessary to take a

sample; • be able to explain the necessary steps involved in a selection of sampling

methods. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Why study Statistics?

There are a number of reasons why it makes sense to have a basic grasp of statistics. Below are listed just some of them.

• Much of the information we have to process at home, as consumers, at work and in the community comes in the form of numbers, graphs and charts. A statistical awareness helps us to make sense of the information we are confronted with, particularly when it is being used in a misleading way.

• Many of the important decisions we have to make involve handling numbers and weighing up risk. A knowledge of statistics won’t guarantee you make the right decision, but at least your decisions should be better informed.



• Almost every academic subject is becoming increasingly quantitative so it is becoming harder to hide from the need to have a basic understanding of statistics. This is particularly true of business.

• You may not believe me but Statistics can be enjoyable. By the end of this module you may find yourself agreeing with me!

ENDSECTION STARTSECTION=content_2.htm= SECTION~

Statistical quotes

“There are 3 kinds of lies: Lies, damned lies and Statistics”, Mark Twain

“Statistics is like a bikini; what they reveal is suggestive, what they conceal is vital”, Mr Motivator (GMTV 1996)


Statistical investigations

There are four key stages in a Statistical Investigation;

• pose a question; • collect relevant data; • analyse the data; • interpret the results.

Collecting relevant data is determined by whether or not the information will allow you to answer your central question. If your initial question is clearly formulated it will be easier to make sensible decisions at the other stages of the investigation.

We are going to spend time considering the second step of the investigation process: Collecting Relevant Data. This falls into two main sections;

• Identifying individuals or items to question, test or measure. • Collecting information from each person or item identified.

We will deal with the first point in this unit and leave the second point until the next unit.


Sampling methods

Before we get into the detail of various sampling methods we need to explain some terminology which we will be using.




Sampling terminology

Population

In statistics, this term ‘population’ isn’t restricted to referring just to a population of animals or humans but it is used to describe any large group of things that we are trying to measure.

Sample

A sample is a selection of items from the population which will be questioned or measured. A good sample is one which fairly represents the population from which it was taken.

Examples

Suppose the central question of our investigation is ‘What is the average amount of time each week that Middlesex University Students spend in the library?’

• Population : All Middlesex University Students • Possible sample: Students registered for STX1110.

Would this sample be likely to generate an answer that is representative of the population?

Suppose the central question of our investigation is ‘What proportion of the components coming off a production line are defective?’

• Population : All products produced. • Possible sample : Every 50th item produced.

Would this sample be likely to generate an answer that is representative of the population?


Taking a Census

If you measure or question all of the population this is called ‘taking a census’. Obviously, if you question everyone and record the answers accurately then your conclusions will be absolutely correct. This may leave you thinking; ‘Why bother to take a sample then?’ Consider the following points.




Why take a Sample?

• In practice, a census rarely achieves the completeness required. • A census is typically expensive both financially and in respect of the

time required. The time and money may not be available and the cost of the census may exceed the value of the results.

• By the time you complete a census the results may be out of date. • If your central question means the items measured need to be tested to

destruction, a census will result in the manufacturer having no product left to sell!

• It may be the case that your population is unidentifiable. This is particularly true in situations such as market research.

The main disadvantage of taking a sample is that sometimes it can be difficult to convince other people that your sample results will generalise to the whole population. It is important to ensure that your sample is representative or unbiased. Generally, an unbiased sample result will generalise to a correct population result. Consequently, sampling is one of the most important subjects in quantitative methods. Various sampling methods have been developed to ensure that the resulting sample is unbiased.

When choosing a sample it is important that the individuals or items in the sample cover all areas of the population to be examined. If this requirement is not met the resulting sample will be biased.


Sampling Frame

Some sampling methods require all members of the population to be known and identifiable. The structure which supports this identification is called a sampling frame. Some methods require a sampling frame only as a listing of the population while other methods also need certain characteristics of each member to be known. For example, employees can be identified from company records; in addition we can identify if the employee is male or female, full time or part time etc. A sampling frame should have the following characteristics;

• Completeness. Are all the members of the population included on the list and are the necessary subgroups, i.e. male/female identifiable?

• Accuracy. Is the information correct? • Is the list up to date? • Is the sampling frame readily available? • Does each member appear only once in the list?



Two readily available sampling frames for the UK population are the council tax register (list of dwellings) and the electoral register (list of individuals).


Simple Random Sampling

A simple random sample is selected in such a way that every item in the population has an equal chance of being selected. The following steps are involved in taking a simple random sample.

• Acquire a sampling frame and number all the individuals from 1 to the size of the population.

• Generate a random number. Your calculator will generate a random number for you, or use a computer or use random number generator tables.

• Select the member of the population whose number matches the generated random number.

• Repeat the process to the required sample size.

A sample generated this way will be unbiased but it does have some disadvantages.

• You need a sampling frame • Each selected person needs to be located and questioned. This may take

a long time and the individual may be untraceable. • Certain attributes may be over or under represented. For example the

ratio of male/female employees in your sample may not reflect the ratio of male/female employees in your workforce.


Stratified Sampling

This is a method of selecting the right proportion of respondents from each attribute/subgroup of the population. To take a stratified sample you need to complete the following steps;

• Acquire a sampling frame with the required attributes known for each individual.

• Split the population into certain attributes or subgroups. • Calculate how many individuals to sample from each subgroup as

explained below. • Use simple random sampling to select the relevant number of

respondents from each subgroup.



Example

Suppose your workforce is made up of the following types of employee

Number of Employees by category

Type of employee Number of employees

Manual 200

Administrative 70

Manager 30

300

If a sample of size 50 was required the sample would be made up as follows

Number of Employees by category in the sample

Type of employee Number of employees in the sample

Manual 3333.3350300200

==×

Administrative 1267.115030070

==×

Manager 55030030

=×

50

Select the 33 manual workers by completing a simple random sample on the list of all manual workers in the company etc.

The need for a detailed sampling frame generally causes an increase in time and effort required to obtain the sample. Consequently stratified sampling is only used when you feel the people in the different subgroups or strata will respond differently. In many situations the subgroups or strata could involve multiple classifications. In the above example I could classify each manual worker as male or female, full time or part time etc and likewise for the other types of worker. As the number of strata or subgroups increase the sample size also needs to increase to reflect the complexity in the structure of the sample.

We are not always fortunate enough to have a sampling frame. In such situations we need to resort to non-random sampling techniques to generate our sample.




Systematic Sampling

Systematic sampling can be used with or without a sampling frame and may provide a good approximation to random sampling. To take a systematic sample you randomly select a starting point and then select every nth item . For example to select a sample of size 30 from the 300 students on a module list then every 10th ( 300 ÷ 30) student after a random start in the first 10 students should be selected from the list. It is particularly useful for situations where the population is physically in evidence such as items coming off a production line or a row of houses etc.

Systematic sampling does not require a sampling frame, although you can use it with a sampling frame, and it is very easy to use. Bias can occur if recurring sets in the population are possible. For example, if 4 machines feed a production line and we sample every 4th item it is possible to end up with a sample consisting of products all made on the same machine which would then be biased.


Quota Sampling

This is the most popular method of sampling in areas such as market research. The method uses an interviewer or team of interviewers each with a set number (quota) of subjects to interview. This method of sampling places a lot of responsibility on the interviewer as the choice of the subject to be interviewed is entirely up to them which can lead to bias in the sample. This can be overcome to some extent by subdividing the quota into different types to ensure the sample represents the population. For example the interviewer could be told to interview 30 males between the ages of 20 and 30 etc. This method of sampling is not random and it is hard to eliminate interviewer bias but it is administratively easy and relatively cheap.

Refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 4 (pages 59-80), for a more detailed discussion of these sampling techniques and details of techniques not discussed

in this lecture such as cluster sampling and multi-stage sampling.


The Size of a Sample

As well as deciding on an appropriate sampling method for a given situation, the size of the sample selected also needs some consideration. There is no rule for determining a sample size but various points need to be considered when deciding on the size of the sample.



• The larger the size of the sample the more accurate will be the results. However, there reaches a point when there is little to be gained by increasing the sample size.

Administrative considerations usually play a greater role in determining the sample size. Considerations such as;

• Money and time available; • Aims of the study and the precision required; • The number of subgroups or strata required.

Module requirements,

Every week you will need to complete the following

• Read through the notes covered in the lectures and complete the seminar questions at the end of the units covered in lectures before the following week’s seminar. You will need to make a note of any points you need to discuss further with your seminar tutor or an advice centre tutor.

• Go to the STX1110 Oasis page attempt the quizzes and access the seminar solutions which are released at regular intervals. Make a note of anything you wish to discuss with your STX1110 seminar tutor or an advice centre tutor.

• Check your e-mail and the STX1110 Oasis page for any announcements and information sent by the STX1110 team.

• Complete further reading from relevant chapters from Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot.



Seminar Questions

For each unit you are expected to complete the seminar questions that accompany each lecture and bring your answers to the seminars. Tutors will check your work has been completed.

Therefore, you must complete the following seminar questions before the next seminar.

You are advised that if you do not participate in this module by attempting questions and arriving prepared for seminars you will NOT receive a participation (attendance) mark for the seminar.

ENDSECTION STARTSECTION=activity_1.htm= SECTION~

Seminar Question S1.1

a. A workforce consists of 500 workers, 350 of whom are male and 150 female. How many males and females would be included in a sample of size 100 if the proportions of males and females in the sample are to be the same as those in the population?

b. If males and females are further classified as being full time or part time explain the four subgroups that the population can be split up into and indicate how many workers in the population are in each subgroup if 100 men and 50 women are part time. How many employees would be selected from each of these groups in a sample of size 100?



Identify the major sources of bias in each of these situations.

a. A survey is conducted to study the extent of use of convenience foods (such as frozen foods) by households in a community. A random sample of households is selected and the data collected by telephone interviews made during the hours of 8am to 5pm. Non respondents are ignored.

b. A radio station conducts a poll to identify what are the best restaurants in a community by asking its listeners to call the station and state their opinions.

c. An organisation is interested in monthly household expenditures for groceries. Representatives of the organisation conduct exit surveys of every 3rd shopper at several major supermarkets on weekday afternoons.





What biases might the following sampling methods pick up?

a. Estimating yearly sales figures of a shop over a ten year period by sampling sales figures systematically every 6 months.

b. Estimating a household’s monthly expenditure by sampling their bills and payments on the first week of every month.



A proposal was received by the Local Authority Planning Officer for a motel, public house and restaurant to be built on some private land in the city suburbs. Following an article by the builder in the local paper, the office received 300 letters of which only 28 supported the proposal. What conclusions can the planning officer draw from these statistics? Describe what action could be taken to gauge people’s views further.



The following is a list of general practice doctors. Also recorded is whether the doctors are in practice by themselves (S), have a partner (P), or are in a group practice (G).

a. Using the following list of 10 random numbers indicate which doctors will be included in a simple random sample of size 10. 0.9120 0.0124 0.5246 0.3287 0.7895 0.3366 0.0003 0.1025 0.1157 0.8425

b. Indicate which doctors would be included in a systematic sample if the sampling interval is 5 and you start with Mark Hillard.

c. The survey will consider issues of how doctors respond to out of hours calls from their patients and consequently the type of practice may be an important factor in the response. Indicate how many doctors from each type of practice would be selected if a stratified sample of size 15 were to be used. How would you then select the relevant numbers of doctors who are in partnership?



Physician Type of practice

Physician Type of practice

R. E. Scherbarth, M.D. S Gregory Yost, M.D. P

Crystal R. Goveia, M.D. P J. Christian Zona, M.D. P

Mark D. Hillard, M.D. P Larry Johnson, M.D. P

Jeanine S. Huttner, M.D. P Sanford Kimmel, M.D. P

Francis Aona, M.D. P Harry Mayhew, M.D. S

Janet Arrowsmith, M.D. P Leroy Rogers, M.D. S

David DeFrance, M.D. S Thomas Tafelski, M.D. S

Judith Furlong, M.D. S Mark Zilkoski, M.D. G

Leslie Jackson, M.D. G Ken Bertka, M.D. G

Paul Langenkamp, M.D. S Mark DeMichie, M.D. G

Philip Lepkowski, M.D. S John Eggert, M.D. P

Wendy Martin, M.D. S Jeanne Fiorito, M.D. P

Denny Mauricio, M.D. P Michael Fitzpatrick, M.D.

P

Hasmukh Parmar, M.D. P Charles Holt, M.D. P

Ricardo Pena, M.D. P Richard Koby, M.D. P

David Reames, M.D. P John Meier, M.D. P

Ronald Reynolds, M.D. G Douglas Smucker, M.D. S

Mark Steinmetz, M.D. G David Weldy, M.D. P

Geza Torok, M.D. S Cheryl Zabarowski, M.D.

P

Mark Young, M.D. P

ENDSECTION STARTSECTION=think_1.htm= SECTION~

How much do you know?

If you have understood the content of this unit you should have knowledge of, and be able to answer any questions relating to, all of the following points:

• explain what is meant by the terms statistical population and sample; • why is sampling used to collect data; • explain the steps involved in taking a simple random sample; • explain what a stratified sample is and why one might be used; • calculate how many respondents should be surveyed in each group of a

stratified sample; • be able to explain what a systematic and quota sample are, give

examples of where they are appropriate and list their weaknesses and strengths.



Extra Activities

Every week you should log on to the STX1110 Oasis page and attempt the quizzes in the extra section.

A guide on how to find the STX1110 Oasis quizzes is given below.

Step 1. Log on to Oasis and open the STX1110 oasis page. Step 2. Click the London Module content icon.

London Module Content

Step 3. Choose whether you would like to try either a mathematics or statistics

topic from STX1110.

Statistics

Mathematics

Step 4. Now you will see a page containing links for each topic in that section. For example, unit S1 Collecting Data.Click on the topic that you would like to try.

S1: Collecting Data 1

Step5. You will now see a table of contents for this topic. To get to the quizzes go

to the bottom of the list and click on EXTRA .

Table of Contents

1. UnitS1 - Data Collection 1



2. UnitS1 - Content

3. UnitS1 - Activity

4. UnitS1 - Think

5. UnitS1 - Extra

Step 6. You should now see the following. Just click on begin quiz and have a go at the questions. Keep a record of your working and answer so that you can compare it to the solution and feedback provided. If you need help drop in to an advice centre session in S206 or see your seminar tutor.

Additional Content and Activities

Complete the following Quiz and Questions. There are 7 questions in this formative (non-assessed) assessment with two optional questions.

*Click here to begin quiz*

Remember to click on the button at the top of the screen to move onto the next question

Do you want to know more?

For further examples, and exercises refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot, or any of the books on the STX1110 reading list.

ENDSECTION ENDCHAPTER



CHAPTER=DC2


Collecting Data 2 Context

In the previous unit we said that a statistical investigation has four stages.

• pose a question; • collect relevant data; • analyse the data; • interpret the results.

We began looking at the collection of data in the previous unit by introducing the ideas of sampling as a way of selecting items or individuals from whom to collect relevant data. We also said that the data which is collected will depend on the question or purpose of the investigation. In this unit we will move onto methods of collecting data or information from identified individuals.


Objectives

Having worked through this unit, you should be able to:

• appreciate the basic different types of data; • explain the difference between primary and secondary data; • design questionnaires; • appreciate the advantages and disadvantages of some methods of

questionnaire distribution.


Secondary and Primary Data Sources

Data generation costs time and money so before you rush into trying to generate your own data take time to check that someone else hasn’t already done it for you. Data which have already been collected by someone else are called Secondary Data.

There are various sources of secondary data such as published statistics through the Office of National Statistics. The European Union and United Nations also publish statistics as do many newspapers. Libraries are full of valuable sources of secondary data.



For a more detailed discussion of sources of secondary data refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 4, page 76.

In the absence of suitable secondary data you may have to carry out an experiment or survey yourself. In doing so you would generate primary source data which would be specific to your particular investigation.

In this unit we will concentrate on survey methods of collecting primary data from people. There are two basic methods of conducting surveys. You can either observe people’s behaviours or ask people questions. We will be concentrating on the latter of these two options.

There are basically two ways of surveying the opinions of people.

• Interview them (which could involve the use of a questionnaire) • Conduct a postal questionnaire


Interviews

Interviews typically take one of two forms: face to face interviews or telephone interviews. In principle collecting information by conducting interviews is easy. It needs someone to pose questions, which may be provided in the form of a questionnaire, and then listen to and record the answers. In practice it is more complicated and interviewers need training to make sure they get reliable answer and that they record those answers accurately. For example, without training an interviewer may explain questions to people in such a way as to provoke a particular sort of response introducing a bias known as interviewer bias into the results. Similarly interviewers should not direct respondents to a particular answer or response by their expression or tone of voice. One of the main drawbacks of personal interviews is the cost.


Postal Surveys

Sending a printed questionnaire through the post has the advantage of being cheap and easy to organise so that very large samples can be used. Obviously one of the drawbacks of a postal survey is that the questions can’t be explained to respondents and there is no opportunity to clarify points that they don’t understand. So it is important that the questionnaire is relatively straight forward. Perhaps the main disadvantage of postal surveys is the low response rate. Generally a survey can expect to get replies from about 20% of the questionnaires. Response rates can be improved by making the questionnaires short or making follow up telephone calls. Alternatively a reward for completing



the questionnaire can be offered which is known as an incentive. However, the use of incentives often introduces a bias to the survey.

For a greater discussion of personal interviews and postal surveys refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 4.


Questionnaire Design

Usually each of the survey methods mentioned relies on the use of a questionnaire to collect the information. You have no way of checking that a questionnaire has been answered truthfully or that the respondent has understood it properly. It is therefore crucial that a questionnaire is well designed to reduce inaccuracies in answers which is termed response bias.


Questionnaire Design Guidelines

Although it may seem easy, designing a good questionnaire is difficult. An enormous amount of work has been done on the design of questionnaires which has led to some guidelines for good practice.

• Questionnaires should ask a series of related questions and should be as short as possible.

• Questions should follow a logical sequence. • The questions should be simple, unambiguous and easy to understand. If

people don’t understand the question they will give a convenient answer or no answer rather than a true one.

• Questions should use everyday language and not involve technical jargon.

• Questions shouldn’t involve calculations or test of memory. • Be careful with the wording of questions so that they are not offensive or

leading. Even simple changes in phrasing can give quite different results. Use neutral phrases where possible; for example, instead of saying ‘do you like this cake?’ the question could be rephrased as ‘rate the taste of this cake on a scale of 1 to 5’

• Do not ask irrelevant questions. • Avoid vague questions such as ‘do you usually buy more meat than

vegetables?’ this raise questions such as, what is usual and what is more?



• Avoid open questions where possible. They are difficult to answer. Instead ask questions which allow precoded answers so that respondents are offered a series of choices and can select the most appropriate. They are easier to answer and analyse.

• Phrase all personal questions carefully. ‘Have you retired from paid work?’ might receive better responses and be just as useful as the more sensitive ‘How old are you?’

These are just a few pointers in the design of a good questionnaire. If you read around the subject you’ll discover many more.


Pilot Surveys

Having developed a questionnaire it is a good idea to trial it on a few respondents before using it to collect you data. This is called a pilot survey and can help sort out any problems in your questionnaire which may save lots of time and money later.


Errors in survey methods for collecting data

There are three main types of error that can appear in survey methods.

• Sampling error which arises when the sample selected is not representative of the population. A great deal of consideration needs to be given at the sampling stage to decide what your target population is and insure that the sample represents this target population.

• Response error which can occur when respondent are unable to respond either because they didn’t understand the question or were guessing etc. It can also occur if an answer is incorrectly or inaccurately recorded.

• Non response error occurs when respondents refuse to take part in the survey and is a particular problem for postal questionnaires.




Validity and Precision

In addition to the correct recording of data there are two important ideas about data quality which relate to the way in which the data were obtained.

Validity: A valid method of measurement or observation which actually measures the concept you intended it to. A related concept is the absence of bias and systematically producing results that are different (above or below) the true value.

For example: “Children like exams. Attendance is highest at exam time” This statement was reportedly said by Sir Rhodes Boyson MP former headmaster and education minister on ITN News at One 17th August 1988. What variable was Boyson trying to measure and what can you say about the validity of his measurement?

Precision: How precisely has a variable been measured? For example, is age measured to the nearest year, month or day? A related concept is reliability. A reliable method of measurement is one that will produce a similar answer if repeated.

For example: Suppose you have the annual salary of 5 employees

£13,458 £19,496 £12,752 £26,785 £16,220

How would they appear if they had been measured to the nearest £1000? What is the total salary of these employees to the nearest £1000?




Seminar Questions


a) Calculate the total for the following set of data.

41 62 87 96 32 39

b) Calculate the total again after first fairly rounding the data to the nearest 10.

c) Calculate the total again using biased rounding by rounding each data item down to the lowest 10.

d) Calculate the total again using biased rounding by rounding each data item up to the highest 10.

e) What do your answers suggest about the effect of rounding on subsequent calculations?

f) Can you think of an example of data that is typically rounded down when it is recorded? Can you also think of an example of data that is typically rounded up when it is recorded?



The questionnaire below, surveying people’s health practices and attitudes, is based on examples given in the book Surveys in Social Research by D.A. Vaus. Have a go at answering the questionnaire and then give some thought as to the sorts of biases and inaccuracies it might produce.

A Survey on Health

How healthy are you?

Are the health practices in your household run on matriarchal or patriarchal lines?

How often do your parents visit the doctor?

Do you oppose or favour cutting health spending, even if cuts threaten the health of children and pensioners?

Do you agree or disagree with the government's policy on the funding of medical training?





List some possible sources of secondary data. You will need to read a text such as Essential Quantitative Methods for Business, Management and Finance, or Business Basics to answer this question.



On 11th February 1985, under the banner headline ‘We’ve had ENOUGH!’, a tabloid newspaper produced the following questionnaire. Make a note of any criticisms of the way it was designed.






• explain the difference between primary and secondary sources of data; • list some sources of secondary data; • design/criticise questionnaires; • explain sources of error in conducting a survey and suggest ways of

minimising these errors; • explain how interviews and postal surveys are conducted and discuss

their strengths and weaknesses.

Extra Activities

Every week you should log on to the STX1110 Oasis page and attempt the quizzes in the extra section.

A guide on how to find the STX1110 Oasis quizzes is on page 16 of this book.




S3 Summarising and Presenting Data 1 Page 27


CHAPTER=SPD1


Summarising and Presenting Data 1 Context

Data is a term you will come across repeatedly during your study of quantitative methods but what does it mean? People tend to think of data as collections of numbers. Yet data may be non numerical and even numeric data can belong to different categories. Data is simply a scientific term for facts, figures, information and measurements. The nature of the data that you have will determine what form of statistical analysis is appropriate. From the outset it is important to determine what sort of data you have.


Objectives


• identify different types of data; • compile a frequency table of a discrete variable; • compile and discuss cross tabulations; • appreciate the use of percentages for making comparisons; • construct and interpret graphical methods useful in summarising

discrete data. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Types of Data

Text books will describe a number of data classifications. However, all data falls into one of two basic types: attributes and variables.

Attributes

An attribute is something an object has either got or not got. For example, an individual will be classified as either male or female. Another word for attribute data is categorical or qualitative data. The measurements of a categorical variable fall into one and only one of a set of categories. For example, an employee may be considered to be full time, part time or contractual. The specific names or qualities do not contain any implicit ordering, i.e. a full time employee is not considered to be better than a part time employee.



Variables

A variable is something which can be measured. For example a person’s weight can be measured according to some scale such as kilograms (kgs), or the number of rooms in a house. Variables can be further classified as discrete or continuous.

Discrete variables

Discrete data is characterised by the fact that it can be measured precisely. For example, the number of defective items coming off a production line, shoe size and the number of children in a family are all discrete variables. It should be fairly obvious to you that the number of defective products and the number of children are discrete as these variables can only take integer values, i.e. 1,2,3,4 etc. Shoe size may appear a little different since we have half sizes in the U.K. i.e. someone could have a shoe size of 5½ which is not an integer. However, a British shoe size could not be 5.236 etc hence the variable is discrete as it has a limited range of values and is measured precisely. Typically a discrete variable is recorded as a count, but not always.

Continuous Variables

Continuous data may take on any value and is typically measured rather than counted. Continuous variables are not measured precisely but are approximated. For example we typically measure someone’s height to the nearest centimetre but there is no reason why the measurements could not be made to the nearest one hundredth of a centimetre. Two people who have the same height to the nearest centimetre could almost certainly be distinguished if more precise measurements were taken.

If you read through some text books data will be further classified as being measured on the nominal, ordinal, interval or ratio scale. This level of detail is not required for this module but you will find an explanation of these scales in a text book such as Quantitative Methods for Business by Donald Waters (Addison-Wesley).




Summarising and presenting data

Having collected your data it needs to be presented in a concise and meaningful manner so that patterns and characteristics in the data are immediately apparent. Tables and graphs are a very effective way of displaying any patterns in the data or relationships that may exist between variables. During the next three weeks we will look at some of the tabular and graphical methods popular in presenting the data in a concise and easy to understand way. They are purely descriptive techniques and provide little opportunity for further detailed numerical analysis of the data.

In this unit we will concentrate on attribute and simple discrete variables as the presentation techniques used for each of these types of data is essentially the same.

Tables

One of the easiest and most effective ways of presenting data is in a table. This is perhaps the most widely used method of data presentation. Whenever you pick up a newspaper, magazine or report you are likely to see a table. Spreadsheets make the design and manipulation of tables very easy.


Frequency table or frequency distribution

A frequency table (or distribution) is a tabular summary of a set of data showing the frequency (or number) of data items in each of several categories or non-overlapping classes.

Example

A frequency table of the categorical data gender for a workforce could be

Table 1

Gender Frequency (count)

Male 12

Female 3

Total 15



The table shows the number of data items in each category. Sometimes a table of counts is presented so that the percentage or proportion of data items in each category can be seen.

The relative frequency of a class or category is the proportion of the total number of data items belonging to that class. For a data set with n observations, the relative frequency of each class or category is

relative frequency = frequency of the class ÷n

This is easily converted to a percentage by multiplying by 100. The relative frequencies and percentages for the above example are given in the following table.

Gender Frequency Relative Frequency %

Male 12 8.01512 =

0.8×100=80%

Female 3 2.0153 =

0.2×100=20%

Total 15 1 100

Percentages are particularly useful for making comparisons. Now go and do Exercise S3.1


Exercise S3.1

Would it be true to say that the company detailed in table 2 has the same number of males in its workforce as the company whose workforce is summarised in table 1?

Table 2

Gender Frequency

Male 12

Female 10

Total 22



Example

The following data set refers to the number of children in each of 23 surveyed families.

0 1 2 0 3 0 1 1 0

2 3 2 1 1 2 4 3 4

2 2 2 1 0 3

The corresponding frequency distribution is

Number of children Frequency Relative frequency

%

0 5 0.22 22

1 6 0.27 27

2 7 0.30 30

3 4 0.17 17

4 2 0.04 4

Total 23 1 100

This form of frequency distribution can be used for discrete numeric data provided there are not too many distinct numeric outcomes.




Cross Tabulations

More complex tables of counts can summarise data on two variables simultaneously. Such cross tabulations (or contingency tables) allow investigation of the relationship between the tabulated variables. The values of one variable define the rows of the table and the values of the other variable define the columns. The number in each cell of the table (the intersection of a row and a column) represents the count of the corresponding combination of values. Often row and column totals are included; these give the ordinary frequency distributions of the row and column respectively and are referred to as the marginal distributions of the table. Cross tabulations provide a standard method of summarising the data from a survey and of presenting data in reports and publications.

Example

A personnel department produces a summary of its workforce by gender and marital status in the following cross tabulation.

Table 3: Tabulation of gender against marital status for a company.

Gender Marital Status

Male Female

Total

Single 1 1 2

Married 10 2 12

Widowed 1 0 1

Total 12 3 15

Now go and do Exercise S3.2


Exercise S3.2

• What does the value 10 in this cross tabulation represent? • What is the total number of employees in this company? • What would be the marginal, or ordinary, frequency distribution for the

marital status of employees in this company?



It is possible to extend cross tabulations to include discrete data as well as category data. It would be a relatively easy exercise to produce a cross tabulation of gender against number of children for the employees in a company.

A cross tabulation can be summarised by calculating percentages of the row or column totals. If one variable (the explanatory variable) is believed to influence the other (the response variable), then one normally takes percentages of the totals for the explanatory variable.

Example A random sample of 309 furniture defects were recorded and classified according to the type of defect (A, B, C, D) and the production shift (1, 2, 3) in which the item of furniture was manufactured.

Table 4: Production shift against type of defect for a furniture manufacturing process.

Type of defect Shift

A B C D

Total

1 15 21 45 13 94

2 26 31 34 5 96

3 33 17 49 20 119

Total 74 69 128 38 309

We are interested in knowing if the different shifts can be used to explain the occurrence of the different types of defect. So we will be regarding the type of defect as the response variable and the shift as the explanatory variable. As the shift forms the different rows of the table we will use the row totals to calculate the percentages which can then be used to make comparisons.

Table 5: Comparison of type of defect by shift

Type of defect Shift

A B C D

Total

1(%) %161009415 =× %2210094

21 =× %481009445 =× %1410094

13 =× 100%

2(%) 27% 32% 35% 5% 100%

3(%) 28% 14% 41% 17% 100%

All Shifts(%) 24% 22% 41% 12% 100%



The percentages calculated from the row totals give the proportions of the various types of defect for each shift. The percentages make it easier to describe the relationship between the two variables and attention should be paid to the way in which percentages differ from row to row. For instance, shift 1 produces a lower percentage of defect A and a higher percentage of defect C than the other 2 shifts.


Relationship between a category (or discrete) variable and a continuous variable

Cross tabulations can be used to summarise the relationship between a continuous variable and a category variable if the continuous numerical variable is first categorised (or grouped) In this way, the relationship between a category variables and a continuous variable can be explored.

Example

The following cross tabulation shows the price of a meal and whether the meal was rated as good, very good or excellent. Notice that the price of a meal is a continuous variable and is presented as the price of a meal being within a certain range: the price variable has been grouped!

Table 6: Cross tabulation of the quality of a meal by price.

Price of meal Quality Rating £5 – £9 £10 – £14 £15 – £19 £20 – £24

Total

Good 42 40 2 0 84

Very Good 34 64 46 6 150

Excellent 2 14 28 22 66

Total 78 118 76 28 300

To explore if the quality of a meal affects its price we will treat the quality as the explanatory variable and take row percentages.



Comparison of the quality of a meal by price

Price of meal Quality Rating

£5 – £9 £10 – £14 £15 – £19 £20 – £24

Total

Good(%) 50% 48% 2% 0% 100%

Very Good(%) 23% 43% 31% 3% 100%

Excellent(%) 3% 21% 42% 34% 100%

All Qualities(%) 26% 39% 25% 10% 100%

Refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 5, for a discussion of good guidelines for tabulation.

Three (or more) variables can be summarised in more complicated tables. However, some care must be taken in interpreting such tables.


Graphical descriptions of discrete data

Instead of presenting data in a table it might be better to give a visual display in the form of a graph of chart. Visual displays are good for summarising the data and drawing attention to a particular point. They can also be useful for comparing data sets. Tables on the other hand usually give more detailed information about the data set.

The success of any presentation can be judged by how easy it is to understand. A good presentation should make information clear and allow us to see the overall picture. But good presentations do not happen by chance and need careful planning. If you look at a diagram or table and cannot understand it, it is most probable that the presentation is poor and the fault is with the presenter rather than the viewer.

Sometimes, even when a presentation seems clear, you can look closer and see that it does not give a true picture of the data. This may be a result of poor presentation but sometimes comes from a deliberate decision to present data in a form that is misleading. The problem is that diagrams are a powerful means of presenting data, first impressions last! But they only give a summary and this summary can be misleading, either intentionally or by mistake.




Pie charts

A pie chart is used to show pictorially the relative sizes of component elements of a total. The circle (or pie) represents the total of the data. The circle is then split into sectors (pieces of pie), the size of each one being drawn in proportion to the frequency of each data item.

Example

The costs of production at Factory A and Factory B during March of one year were as follows.

Table 7: Production costs at two factories

Factory A Factory B £ 000’s % £ 000’s %

Materials 70 35 50 20 Labour 30 15 125 50

Overheads 90 45 50 20 Administration 10 5 25 10

Total 200 100 250 100

A pie chart presentation for the figures of each of these factories is presented below:

Factory A

Materials35%

Labour15%

Overheads45%

Admin5%



Factory B

Labour50%

Admin10% Materials

20%

Overheads20%

For a detailed discussion of how to draw these pie charts refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 5.

Pie charts are very good for comparing the relative sizes of elements of a total. They show very clearly when one element is a bigger or smaller proportion of the total. In this example they are very good for comparing the costs of production at the two factories. Factory A’s costs consist largely of overheads whereas at factory B labour is the largest cost. However, pie charts do have a number of disadvantages.

• Actual numbers or % associated with each category need to presented on the diagram as they are virtually impossible to calculate from the pie chart.

• They are not a very good presentation method if there are too many different categories.

• The impression they can give is easily distorted, by presenting a 3 dimensional pie chart for example.




Bar Charts

The bar chart is one of the most common methods of presenting data in a visual form. There are 3 main types of bar chart;

• Simple bar charts • Component bar charts, including percentage component bar charts • Multiple (or compound) bar charts

A simple bar chart is a chart consisting of a set of non-joining bars. A separate bar for each data item is drawn to a height which is proportional to the frequency of the data item. The widths of each bar are always the same. The bars are usually drawn vertically but they can be drawn horizontally.

Example

The following frequency distribution shows the number of sales of computers sold by each of 5 computer companies in a sample of 50 sales.

Table 8: Number of computers sold by company

Company Frequency Apple 13

Compaq 12 Gateway 5

IBM 9 Packard Bell 11

Total 50

The corresponding bar chart for this data is

0

2

4

6

8

10

12

14

Apple Compaq Gateway IBM Packard Bell

Freq

uenc

y




Pareto Charts

A pareto chart is essentially a bar chart but the categories are arranged according to frequency so that the tallest bar is at the left. Pareto charts can be extremely useful tools in business applications as attention is focused on the more important categories. Presented below is a pareto chart of the above frequency distribution.

0

2

4

6

8

10

12

14

Apple Compaq Packard Bell IBM Gateway

Freq

uenc

y

For a discussion of other forms of bar charts and their compilation refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot .


Pictograms

Pictograms are becoming very popular as they are easy to generate with modern PCs. They are a very elementary form of visual representation but they can be informative and more effective than other methods of presenting data to the general public who, by and large, may lack the understanding and interest demanded by the less attractive forms of presentation. However, they are not accurate forms of presentation. Furthermore, they provide lots of scope for confusion or misleading interpretations of the data.

Type 1 Pictogram

A picture is selected which represents the data. Each picture is then repeated to the required size.



Number of chairs sold by Fred's factory

1997

1998

1999

2000

2001 = 5000 chairs

Type 2 Pictogram

The representative picture is magnified instead of being repeated.

WONDA WASH

WONDA WASH

26,500

149,800

Year 1

Year 2

Detergent sales (kg)




Seminar Questions


A large express delivery company is intending to invest money to upgrade the quality of service it offers its customers. The following frequency distributions show:

A: How a sample of customers responded when asked what was important to them in terms of the quality of the service they received.

B: Results of some further investigations by the company.

Frequency Distribution A

Most important service requested by customers

Frequency

Lower cost 19

Less damage to goods 11

Correct billing 60

Faster delivery 9

On time delivery 65

Frequency Distribution B

Reasons for late delivery Frequency

Driver unavailable 10

Van unavailable 8

Waiting for supplies from another van

30

Van damage 5

a) Draw a pareto chart for each of the above frequency distributions.

b) What action should the company take to improve quality of service?





A random sample of 500 households was surveyed and data on three variables, household size, household income and number of cars owned were collected. The data are summarised in the following table.

Number of cars Income Household size 2 or fewer More than 2

4 or fewer 125 100 Less than £20,000

More than 4 15 60

4 or fewer 100 50 £20,000 or more

More than 4 10 40

a) Make a two way table of income by number of cars owned by summing entries in the table so that household size is not considered. Include the marginal totals.

b) Discuss the nature of the apparent association between household income and number of cars owned.

c) Make a two way table of household size by number of cars owned by summing entries in the table so that income is not considered. Include the marginal totals.

d) Discuss the nature of the apparent association between household size and number of cars owned.



A children’s charity is concerned about the number of children living in poverty. It decided to undertake an analysis to draw conclusions about what sort of families were most likely to have children affected by poverty. The following table was compiled using government statistics. The table includes all families whose income was below 50% of the national average. Households have been split into four categories and the number of children in each of these categories presented.



Status of household Number of children

Parent in full time work 1,026,000

Lone parent 938,000

Unemployed Parent 763,000

Pensioner 320,000

a) Convert the figures into a percentage of the total and represent these percentages on a bar chart. What other graphical method could you have used?

b) What does this graph tell you about the sort of household most likely to have children living in poverty?

The following table gives you additional information about the total number of children in each of these categories in the U.K.

Status of household Number of children in poverty

Total number of children

Parent in full time work 1,026,000 9,330,000

Lone parent 938,000 1,250,000

Unemployed Parent 763,000 930,000

Pensioner 320,000 470,000

Total 11,980,000

c) Using this table calculate the percentage of children in each category that are in poverty. For example, the calculation for the lone parent household category would be

%75100000,250,1

000,938=×

d) Draw these results as a bar chart. How do you interpret these figures?

e) Which of these two analyses do you think is more informative? Explain your answer.





The equal opportunities officer of a public sector organisation has been examining the results of applications for promotion over the last year.

‘In the last year, there were 450 female applicants for promotion, of whom 40 were successful and 410 were unsuccessful. There were 760 male applicants for promotion, of whom 124 were successful and 636 unsuccessful.’

a) Present this information in a suitable tabular form.

b) Summarise your table in such a way that comparisons between women and men can be made more easily.

c) Comment on the result obtained. ENDSECTION STARTSECTION=activity_7.htm= SECTION~


A health authority has 5 hospitals in its district. The number of beds in each hospital is classified as follows.

A component and a component percentage bar chart are provided below. Referring to these diagrams, write a brief report to the health authority on the provision of beds for its patients. Briefly comment on the differences between these graphical representations.

Hospitals

Foothills General Southern Heathview St Johns

Maternity 24 38 6 0 0

Surgical 86 85 45 30 24

Medical 82 55 30 30 35

Category of bed

Psychiatric 25 22 30 65 76



Component bar chart

0

50

100

150

200

250


PsychiatricMedicalSurgicalMaternity

Percentage component bar chart

0%

20%

40%

60%

80%

100%


PsychiatricMedicalSurgicalMaternity




• summarising discrete data in frequency tables; • using cross tabulations to summarise information on 2 variables (one of

which may be continuous); • using % to compare counts and being able to identify whether to use row

or column totals to calculate % when discussing cross tabulations; • drawing and interpreting bar charts; • discussing pie charts and pictograms; • knowledge of the advantages and disadvantages of different methods of

graphical summary.



Extra Activities

Log on to the STX1110 Oasis page and attempt the quizzes in the extra section.







CHAPTER=SPD2



In the last unit we began to look at how tables and graphs can be used to present the information in a data set in a manner which makes it easier to see the important features of the data set. The data sets we considered in the last unit were similar in that they consisted of category data or discrete data with a limited range of values. When the number of distinct data values in the data set is large (20 or more say), or the data is continuous in nature, the techniques covered in the last unit are of little use, so we need to do something different.


Objectives


• use grouped frequency distributions to summarise numerical data; • present the information in a grouped frequency table graphically using a

histogram; • construct a histogram for grouped frequency tables with unequal class

widths; • use frequency polygons for comparing two grouped frequency

distributions; • appreciate how the choice of scale on a graph affects the visual

impression of the graph. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Grouped Frequency Distribution

A grouped frequency distribution organises the data items into groups or classes of values. It then shows how many data items are within each class which is referred to as the class frequency.

Example 1

The following data refer to the age of 25 employees:

63 27 46 47 22 64 30 19 69 36 65 60 40 66 55 33 47 42 49 23 22 46 62 30 20



To present this as a grouped frequency distribution we need to decide on what the classes should be. There is no set rule as to how you should choose your classes but it will depend to some extent on the size of the data set. Generally a grouped frequency is easier to read if the class intervals are in round numbers, i.e. multiples of 5 or 10 or 100 etc. In the above example we have ages ranging from 19 to 69. If we choose intervals of size 5 we will need a lot of classes (about 12) to cover a range of 19 to 69 and the data set is itself not very large so we would be spreading the data out among too many classifications. So instead we will use a class interval of size 10. This would mean that the classes of the grouped frequency would be;

10 – 19 (this includes the 10 values 10, 11, 12, 13, 14, 15, 16, 17, 18, 19),

20 – 29,

30 – 39, etc

Counting how many of the observation lie in each class produces the following grouped frequency distribution:

Table 1: Grouped frequency distribution of the age of 25 employees

Age of employee

Number of employees

10 – 19 1

20 – 29 5

30 – 39 4

40 – 49 7

50 – 59 1

60 – 69 7

Total 25

Grouping allows you to see any pattern in the data. However, it is important to realise that grouping results in a loss of information. In the above example we know that there are 7 employees whose age is between 40 and 49 but, without access to the original data set, we do not know their exact ages. A clearer pattern has been bought at the cost of a loss of information. Consequently if you should wish to use a grouped frequency distribution to perform any mathematical calculations the answers will not be exact.



Example 2

The following grouped frequency distribution shows the height of 50 individuals. A different convention has been used to describe the classes because height is a continuous variable.

Table 2: Height of 50 individuals

Height (cm) Number of individuals

Less than but including 155 1

Over 155, up to and including 166 3




Over 195 4

Total 50

Choosing the intervals for a grouped frequency

The choice of the intervals or classes in a grouped frequency is entirely up to you. However, you do want to produce a grouped frequency that is easy to read so choosing interval in 10s, 100s, etc is usually a good idea. There are some general rules to follow when compiling a grouped frequency distribution:

• You should not have too many classes and you should not have too few classes. If you have too few classes too much information is lost and hence important details of the data will also be lost. If you have too many classes the resulting grouped frequency distribution has too much detail and patterns in the data set are hard to observe. Somewhere between 5 and 12 classes should be enough.

• Classes must not overlap. If for example you had the two classes 10 – 20 and 20 – 30 which class would a data item with a value of 20 belong to? You can’t put it in both as this would result in double counting!

• Open-ended classes can be used but only at the two ends of the distribution.

For a greater discussion of how to prepare grouped frequency distributions and the choice of class intervals refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 5.




Definitions associated with grouped frequency distribution

To illustrate some ideas associated with grouped frequency distributions we will consider the first three classes of the distribution in example 1.

10 – 19 20 – 29 30 – 39

Class Limits: These are the upper and lower values of the classes as physically described in the distribution. So for the 10 – 19 class above the lower limit is 10 and the upper limit is 19. You should be able to detail the limits for the remaining classes.

Class Boundaries: These are the upper and lower values of a class that mark common points between classes. Considering the 20 – 29 class, the lower boundary would be the common point of where the previous class (10 – 19) and this class (20 – 29) meet. This would be considered to be the middle of where the first class ends (19) and the next class starts (20) so the lower boundary is 19.5 which is half way between 19 and 20. A similar argument can be applied to the upper boundary which would then give 29.5 as the value for the upper boundary. Boundaries for example 2 would be more difficult. We will only be using boundaries when discussing the construction of histograms so we shall return to this later.

Class midpoint: This is what is says it is: the value half way through the range of the class. It is calculated as

(Upper class limit + Lower class limit) / 2

So for the 10 – 19 class, the class midpoint would be

.5.142

1910=

+

We will be using class midpoints to draw frequency polygons. They are also useful for performing calculations with grouped frequency distributions. As we have already said, the grouping results in a loss of information. If we did have to do a calculation with the distribution in example 1 we would know that there are 7 data items in the range 40 – 49 but we don’t know their value. The best we can do is say that each of the 7 items has a value of 44.5 which is the midpoint of the class.



Class Width: This is the difference in the boundaries of the class or the difference in the lower limit of this class and the following class. So for the 20 – 29 class the class width is either calculated as;

the difference in the boundaries,

29.5 – 19.5 = 10

or as the difference in the lower limits of this class and the following class,

30 – 20 = 10. ENDSECTION STARTSECTION=content_4.htm= SECTION~

Histograms

A grouped frequency distribution can be represented graphically using a histogram. A histogram is similar to a bar chart with a rectangle (bar) being used to represent the frequency of each class. However, the rectangles of a histogram join up to distinguish it from a bar chart and remind us that we are typically dealing with continuous data. The horizontal axis of a histogram represents a continuous number scale and it is important to be aware that the bars on a histogram do not represent separate categories (as on a bar chart) but rather adjacent intervals on a number line. In other words, although a bar chart and a histogram look similar, they are designed to represent two different types of data. The bar chart is useful for depicting separate categories while the histogram describes the ‘shape’ of data that have been measured on a continuous number scale.


Histogram with classes of equal width

Provided each class in the grouped frequency distribution is of equal width, the height of the rectangle for each class represents the frequency of that class. In this respect, histograms are similar to bar charts for distributions with classes of equal width. The classes of the grouped frequency distribution are represented along the horizontal axis on a continuous number scale and the frequencies are presented along the vertical axis. The histogram for example 1 would be as follows.



Histogram for the ages of 25 employees.

0

1

2

3

4

5

6

7

14.5 24.5 34.5 44.5 54.5 64.5

Age

Freq

uenc

y


Histogram with classes of unequal width

If the frequency distribution has unequal class widths the area and not the height of each rectangle represents the frequency of each class. Put another way, the heights of the bars have to be adjusted for the fact that the bars do not have equal width.

Example 3

Consider the following distribution which shows the length of time in minutes that a computer help line advisor spends on the phone with each caller.

Time in Minutes Number of callers

Less than 10 8

10 or more, but less than 20 10










A histogram of this distribution would be as follows:

0

5

10

15

20

0 10 20 30 40 50 60 70 80 90 100

Time in minutes

It is clear from the distribution that few callers are on the phone for more than 50 minutes. This is also apparent from the histogram as it has a long tail at the higher end of the number scale where there are only 8 values in the last 4 classes. So it is decided to combine the last three classes together so that the distribution is now


Less than 10 8





50 or more, but less than 90 4 + 2 + 1 + 1 = 8

When drawing the histogram we cannot change the numerical scale on the horizontal axis so the last class is now four times as wide as the others.



If the heights of the bars were to still represent the frequency of each class the resulting histogram would be:

0

5

10

15

20

0 10 20 30 40 50 60 70 80 90 100

Time in minutes

It seems that the process of collapsing several class intervals together has the misleading effect of accentuating the importance of the wider interval. Since this interval is 4 times as wide as the others, it makes sense to adjust for this by reducing the height of this bar to a quarter of the height shown in the above histogram. The adjusted height is called the frequency density of the class. If the classes of a distribution are of unequal width then the frequency density of each class should be calculated prior to drawing the histogram. The frequency densities are calculated as

Frequency density = frequency of class/width of class


Width Frequency density

Less than 10 8 1 8/1 = 8

10 or more, but less than 20 10 1 10/1 = 10





NB: The width of the second class is really 20 – 10 = 10 and the width of the last class is 90 – 50 = 40. However, it is easier to say that the first 5 classes all have the same width which we call 1 and the last class is 4 times as wide.



The resulting histogram of the frequency densities would then be:

0

5

10

15

20

0 10 20 30 40 50 60 70 80 90 100

Time in minutes

Refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 5 for further discussion on histograms.


Frequency polygons

Instead of using a histogram, which is a series of rectangles, it might be preferable to display the frequency distribution as a single curve. This is known as a frequency polygon. Each class is represented by a single point and the height of each point represents the class frequency. The position of the point must be directly above the class midpoint. The points are then joined up to form the frequency polygon.

Example The following distribution details the prices of 80 cars sold at a car show room last month

Selling price (£ thousands) Midpoint Number of cars

6 and up to but not including 8 7 8







Total 80



A frequency polygon of this distribution follows. Notice that in order to complete the polygon, midpoints of 5 and 21 were added to the x-axis to ‘anchor’ the polygon at zero frequencies.

Frequency polygon of the selling prices of cars

0

5

10

15

20

25

0 5 10 15 20 25Selling price (£000s)

Freq

uenc

y

The frequency polygon is particularly useful if you want to compare two frequency distributions. The following graph compares the car sales at two car show rooms. The total number of cars sold at these two show rooms is similar so a direct comparison is possible. If the difference in the total number of cars sold at each show room were large, converting the frequencies to relative frequencies and then plotting the two polygons would allow a clearer comparison.

Frequency polygons comparing car salesat two showrooms

05

1015202530

0 5 10 15 20 25 30Selling price (£000s)

Freq

uenc

y


The effect of scale on line graphs

The last sort of graphical representation we will consider in this unit is a line graph. You will have seen examples of these scattered throughout the newspapers and text books etc. A classic use of the line graph is a time series plot which displays figures recorded over time such as monthly sales figures etc.




Seminar Questions


The following frequency distributions represent the number of days during a year that employees of a large retail company were absent due to illness. The table includes a summary for two departments, namely Customer Relations and Finance.

Number of days absent

Number of employees in Customer Relations

Number of employees in Finance

0 and up to 3 5 5

3 and up to 6 6 12

6 and up to 9 11 23

9 and up to 12 15 8

12 and up to 15 10 2

Total 47 50

a) Construct a histogram of the distribution for the employees in customer relations.

b) Construct a frequency polygon comparing the distributions for the two departments.

c) Discuss the rate of employee absenteeism for the two departments. ENDSECTION STARTSECTION=activity_2.htm= SECTION~


Driving under the influence of alcohol is a serious offence. The following data gives the ages of a random sample of 50 drivers arrested whilst driving under the influence of alcohol.

46 16 41 26 22 33 30 22 36 34 63 21 26 18 27 24 31 38 26 55 31 47 27 43 35 22 64 40 58 20 49 37 53 25 29 32 23 49 39 40 24 56 30 51 21 45 27 34 47 35

a) Construct a frequency distribution of these age figures and determine the relative frequency distribution.

b) Draw a histogram of the frequency distribution.





A bank is studying the number of times their automatic cash point located in a supermarket is used each day. The following data set details how many times it was used on each of the last 30 days.

83 64 84 76 84 54 75 59 70 61

63 80 84 73 68 52 65 90 52 77

95 36 78 61 59 84 95 47 87 60

a) Produce a frequency distribution for the number of times the cash point was used.

b) What was the smallest and largest number of times that the machine was used?

c) Around what values did the number of times the machine was used tend to cluster?

d) From the distribution, how many times would you say the machine was used on typical day?



As a preliminary to a review of recruitment policy, a study was made of the age structure of a firm. The results for the 1,000 staff were:

Age in years Number of staff

20 but less than 25 60







Total 1000

Draw a histogram of this distribution, commenting on any difficulties you encounter.





The following data set details the strength of the wind for the 31 days of July in a given year.

Wind type Number of days

Strong wind 10

Calm 5

Gale 7

Light breeze 9

Total 31

A series of 7 graphical representations of this data set follow. Some of the graphs suggested are quite sensible whilst others are not. Look at the suggested graphs and try to decide which are useful and which are not.

Graph A

0

2

4

6

8

10

Strongwind

Calm Gale Lightbreeze

Day

s

Graph B

0

5

10

15

20

25

30

35

Strongwind

Calm Gale Lightbreeze

Total

Day

s



Graph C

0

5

10

15

20

25

30

35

Strong wind Calm Gale Light breeze Total

Day

s

Graph D

4

5

6

7

8

9

10

Calm Light breeze Strong wind Gale

Day

s

0

Graph E

0123456789

10

Calm Light breeze Strong wind Gale

Day

s



Graph F

Strong wind

Calm

Gale

Light breeze

Total

Graph G

Strong wind

CalmGale

Light breeze




• compiling grouped frequency distributions of continuous data; • constructing histograms of grouped frequency distributions; • adjusting the frequencies for grouped frequency distributions to produce an

undistorted histogram; • using frequency polygons to compare distributions by considering questions

of where the data cluster and how the data spread; • appreciating how the choice of scale influences the visual impression of a

graph.



Extra Activities








CHAPTER=SPD3



In the previous unit we considered how to organise data into a grouped frequency distribution and to present this graphically using a histogram. The major advantage of presenting data this way is that we get a quick visual picture of the shape of the distribution. That is, we can see where the data are concentrated and also determine whether there are any extremely large or small values. However, there are two disadvantages to organising data into a frequency distribution:

• we lose the exact identity of each data value; • we are not sure how the data values within each class are distributed.

A stem and leaf display, or a stemplot, can be used as an alternative to a histogram. It allows us to see the general shape of the distribution of the data, but it has the advantage of not losing the value of each data item.


Objectives


• compile a stem and leaf display of numerical data; • recognise the similarities and differences between a stem and leaf

display and a frequency distribution; • use back to back stem and leaf displays to compare two data sets by

considering ideas of where the data are clustered and how spread out the data are;

• understand and be able to calculate a median value. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Stem and Leaf Display

A stem and leaf display is a statistical technique to present a set of data. Each numerical value is divided into two parts. The leading digit(s) becomes the stem and the trailing digit(s) the leaf. The stems are located along the vertical axis, and the leaf for each observation along the horizontal axis. The following example will explain the details of developing a stem and leaf display.



Example 1

To illustrate the use of a stem and leaf display we shall return to the data set concerning the age of 25 employees which we used in the previous unit:

63 27 46 47 22 64 30 19 69

36 65 60 40 66 55 33 47 42

49 23 22 46 62 30 20

The data consists of two digit numbers so it is fairly obvious how we will split the numbers into stems and leaves. The first digit (the ‘tens’ digit) will form the stems and the second digit (the ‘units’ digit) will form the leaves. For example, for the data item 63, the stem is the 6 and the leaf is 3. For a stem and leaf display, write all the possible stems, in order, on the left hand side of a vertical line. Then go through the data values, in the order they are given, and record the leaf of the value opposite the corresponding stem. The first five values (63, 27, 46, 47, 22) are put in like this:

1

2 7 2

3

4 6 7

5

6 3

When all the values have been recorded the display looks like this:

1 9

2 7 2 3 2 0

3 0 6 3 0

4 6 7 0 7 2 9 6

5 5

6 3 4 9 5 0 6 2

1 9

2 7 2 3 2 0

3 0 6 3 0

4 6 7 0 7 2 9 6

5 5

6 3 4 9 5 0 6 2



To complete the stem and lead display we need to indicate the scale which is done by indicating the unit value of a leaf. This is very important and no stem and leaf display is complete without an indication of what the scale of the numbers are. It is also a good idea to order the leaves from smallest to largest and include a count column which indicates how many data items are on each stem. Then to finish it all off, include a title!

Stem and leaf display of ages of part time employees in years

Count

1 9 (1)

2 0 2 2 3 7 (5)

3 0 0 3 6 (4)

4 0 2 6 6 7 7 9 (7)

5 5 (1)

6 0 2 3 4 5 6 9 (7)

Leaf unit = 1 year

Notice that, as well as sorting the data into order, the stem and leaf provides a visual display of the data: it is easy to compute the numbers of employees in different age groups. Note that it is therefore essential to space out the leaves evenly. The leaves on each stem can be counted and these counts have been shown in brackets on the right of the leaves. Each count shows the frequency associated with each stem. In this example the counts show the number of employees in each age group. So there is 1 employee in the age range 10 – 19, 5 in the age range 20 – 29 etc. If you take this information and present it in a table you get:

Age of employee Number of employees

10 – 19 1

20 – 29 5

30 – 39 4

40 – 49 7

50 – 59 1

60 – 69 7

Total 25

This is exactly the same as the grouped frequency for this data set presented as table 1 in the lecture for the previous unit. So stem and leaf displays can be useful as a method of compiling grouped frequency distributions.



Suppose boxes are put around each row of leaves as follows:

Stem and leaf display of ages of part time employees in years

1 9

2 0 2 2 3 7

3 0 0 3 6

4 0 2 6 6 7 7 9

5 5

6 0 2 3 4 5 6 9

Leaf unit = 1 year

Now remove the leaves but keep the boxes. This shows the shape of the stem and leaf display but not the individual values. Each stem is represented by a box or bar whose length represents its frequency. This is what we have previously referred to as a histogram but here it is presented on its side rather than vertically. Notice that the stems have been replaced by intervals like 20 – 29, which in this example represents age groups.

Ages of 25 employees

10 – 19

20 – 29

30 – 39

40 – 49

50 – 59

60 – 69

So stem and leaf displays are very similar to grouped frequency distributions and histograms. However, they have certain advantages over histograms.

• The actual values of the raw data from which it has been drawn have been preserved.

• When the stem and leaf display is drawn in its sorted form, the data are displayed in rank order from lowest to highest. As we will see later, this can be very useful.




Comparing data sets

When two data sets are to be compared, they can be drawn onto separate stemplots and placed back to back.

Example 2

The following data set details the age of 40 successful women. The data set was taken from the birthday column of The Independent newspaper over a period of several consecutive days.

77 32 55 55 59 67 55 60 51 82

66 66 100 29 61 47 52 46 53 63

74 47 58 72 55 50 36 52 58 48

80 41 54 53 70 68 42 62 98 45


ENDSECTION STARTSECTION=activity_1.htm= SECTION~ Exercise S5.1

Show that the stem and leaf display of this data set is:

2 9

3 2 6

4 1 2 5 6 7 7 8

5 0 1 2 2 3 3 4 5 5 5 5 8 8 9

6 0 1 2 3 6 6 7 8

7 0 2 4 7

8 0 2

9 8

10 0

Leaf unit = 1 year



The same newspaper also gave the ages of 40 successful men. This data set is :

38 43 46 48 49 49 50 51 51 51

51 51 53 53 56 58 62 64 64 66

66 66 67 69 69 71 71 74 74 76

77 79 80 80 80 81 81 82 82 87

If you present this data set in a stem and leaf display you get the following:

3 8

4 3 6 8 9 9

5 0 1 1 1 1 1 3 3 6 8

6 2 4 4 6 6 6 7 9 9

7 1 1 4 4 6 7 9

8 0 0 0 1 1 2 2 7

Leaf unit = 1 year

Trying to compare the two data sets from separate stem and leaf displays can be difficult. It is much easier if we draw them on back to back stem and leaf displays as shown below.

Back to back stem plot comparing the ages of successful men and women

Men Women

2 9

8 3 2 6

9 9 8 6 3 4 1 2 5 6 7 7 8

8 6 3 3 1 1 1 1 1 0 5 0 1 2 2 3 3 4 5 5 5 5 8 8 9

9 9 7 6 6 6 4 4 2 6 0 1 2 3 6 6 7 8

9 7 6 4 4 1 1 7 0 2 4 7

7 2 2 1 1 0 0 0 8 0 2

9 8

10 0

Leaf unit = 1 year



When interpreting a back to back stem plot you need to ask yourself two questions.

• Do the two data sets appear to be clustering in the same place? • Do the two data sets appear to have similar spread?

It is clear from the back to back stem plot that the women’s ages are more widely spread than that of the men (a range of 29 – 100 years as opposed to 38 – 67 for the men). Also, the women were, overall, slightly younger than the men. There are 15 men who are over 70 as opposed to only 8 women.

Stem and leaf displays should be used in a fairly flexible way, there are no hard and fast rules. The examples we have covered here have used data with only two digit numbers. With values like these it is fairly obvious that the stem should represent the ‘tens’ and the leaves should correspond to ‘units’. However, if the data to be displayed were all decimal numbers less than 1 or very large numbers bigger than, say, 1000, then the stem and leaves would need to be redefined accordingly. For three digit numbers the stems could be hundreds and the leaves tens with the unit digits forgotten.

Sometimes there are too many leaves to fit onto a single stem so you can split each stem into two: leaves with values 0 to 4 go on the first half of the stem and those with values 5 to 9 go on the second half of the stem. You can continue and split the stem into smaller and smaller categories if necessary.

Example 3

Suppose we want to compile a stem and leaf display of the ages of 20 students in a class.

17 18 19 19 20 20 21 22 22 23

23 24 24 24 27 28 29 29 31 32

If we represent this as a stem and leaf display using the tens as the stem we would get;

Ages of 20 students

1 7 8 9 9

2 0 0 1 2 2 3 3 4 4 4 7 8 9 9

3 1 2

Leaf unit = 1 year



As the range of ages is quite limited we end up with a stem and leaf display with a lot of information on the middle row. Maybe it would be better if we split each stem in half. That is, instead of looking at how many student are aged between 20 and 29, look at how many students are in each of the ranges 20 – 24, and 25 – 29. This would give us;

Ages of 20 students

1

1* 7 8 9 9

2 0 0 1 2 2 3 3 4 4 4

2* 7 8 9 9

3 1 2

3*

Leaf unit = 1 year

The notation 2* is used to indicate that the stem has been split into two and it is assumed that the split is even so that the part of the stem labelled as 2 has leaves from 0 to 4 and the part of the stem labelled as 2* has leaves from 5 to 9. If the split is more complicated than this you may need to include a key explaining the split.


Numerical summaries based on ordered data

Simple numerical summaries of data can be obtained if the data is first sorted into ascending numerical order. As a stem and leaf display produces the ordered data this is the perfect opportunity to begin to discuss these ideas. If the data are ordered from smallest value to largest value, three important summary values are;

• the minimum: the smallest data item; • the maximum: the largest data item; • the median: the middle data item.

Example 4

Suppose we have the following set of 5 exam marks.

65% 61% 60% 58% 71%

If we order these values from smallest to largest we get

58% 60% 61% 65% 71%



The minimum is obviously 58% and the maximum is 71%. The median is the middle value. As this data set is quite small it is easy to see that the middle value is 61%. However, if the data set were larger it would be much harder to pick out the median value by eye. So we need to find another method for calculating a median. For larger data sets, provided we know the position of the median we can then locate it in the list of ordered data.

We calculate the position of the median as (n + 1)/2 where n is the number of data items.

In this example, there are n = 5 observations so the median is at position

(5+1)/2 = 6/2 = 3.

So the median is the third data item. The value of the third data item is 61%, so the median is 61%.

In this example the number of data items, n = 5, was odd so there is a unique data item in the middle of the data. What happens when the number of data items is even?



Exercise S5.2

Consider the following set of exam marks.

58% 60% 61% 65% 71% 73%

Again the minimum and the maximum values are easily found but what value would you now quote as a median?

A difficulty arises as there is no unique middle value. We have two middle values: 61% and 65%. When this happens the convention is to take the average of the two middle values so that the median is

%.632

%65%61=

+

What happens to our technique for finding the position of the median if the data set has an even number of data items? In this example n = 6 so the position of the median is

.5.327

216

21

==+

=+n



The decimal point in this calculation alerts you to the fact that you have an even number of data items so you need to take an average of two values. So we will take the average of the 3rd and the 4th data items as shown above.


Using a stem and leaf display to calculate a median

As a stem and leaf display orders the data it can be very useful when trying to calculate a median.

Example 5

Returning to the stem and leaf display of the ages of 25 employees, calculate the minimum, maximum and median. The stem and leaf display is repeated below.

Stem and Leaf Display of Ages of part time employees in years

Count

1 9 (1)

2 0 2 2 3 7 (5)

3 0 0 3 6 (4)

4 0 2 6 6 7 7 9 (7)

5 5 (1)

6 0 2 3 4 5 6 9 (7)

Leaf unit = 1 year

The minimum for this data set is easily read off as 19 from the first row, and the maximum as 69 from the last value of the last row. To calculate the median we need its position first.

Position of the median .132

262

1252

1==

+=

+=

n

So the median is the 13th data item. We can either count through the stem and leaf display or use the counts at the end of each row to help. The 13th data item is on the 4th row and its value is 46. So the median is 46.




Seminar Questions


The following stem and leaf display shows the number of units produced per day in a factory.

Count

3 8 (1)

4 (0)

5 6 (1)

6 0 1 3 3 5 5 7 9 (8)

7 0 2 3 6 7 (5)

8 3 5 9 (3)

9 0 0 1 5 6 (5)

10 3 6 (2)

Leaf unit = 1 unit

a) How many days were studied in this survey of production?

b) List the actual data values in the fourth row.

c) What are the minimum, maximum and median values?

d) What percentage of days did the factory produce 80 or more units? ENDSECTION STARTSECTION=activity_4.htm= SECTION~


A community college requires all its students to take a basic maths test before beginning any degree program. The scores on the test can range from 0 to 100. For 70 students interested in the associate of management degree the scores were as follows.



22 60 80 75 87 92 65 46 33 95

72 98 100 37 58 75 86 92 77 85

86 97 83 81 87 42 91 89 87 84

72 86 63 42 26 97 93 98 72 82

85 79 84 75 83 92 89 63 86 68

80 97 81 87 72 89 87 73 65 52

76 86 91 53 67 67 69 72 92 81

a) Make a stem and leaf display of these scores.

b) Briefly discuss the distribution of marks commenting on where the marks cluster and how spread out they are.

c) Calculate the minimum, maximum and median score.

d) One of the financial maths options on this degree course requires the student to have a score of at least 70 on this test. What percentage of these students are eligible to take this financial maths option?



The Boston Marathon is the oldest in the U.S. The distance is approximately 26 miles. Boston University has a record of all the winning times for the Boston Marathon: they are all over 2 hours. The following data are the minutes over 2 hours for the winning male runner.

Years 1953 – 1972

18 20 18 14 20 25 22 20 23 23

18 19 16 17 15 22 13 10 18 15

Years 1973 – 1992

16 13 9 20 14 10 9 12 9 8

9 10 14 7 11 8 9 8 11 8



a) Construct a back to back stem plot of the two data sets for the minutes over 2 hours of the winning times. You will need to use a split stem so split each stem into two equal parts.

b) Compare the two distributions. How many times under 15 minutes are there in each distribution? What does the display show you about how the winning times have changed?



The following table shows the daily number of absentees for a company of 500 employees over a period of eight weeks.

a) Form a stem and leaf display of the absenteeism figures.

Week Monday Tuesday Wednesday Thursday Friday

1 20 8 9 5 34

2 15 6 11 12 19

3 26 8 12 8 19

4 21 12 16 16 24

5 13 9 5 12 35

6 23 13 14 8 33

7 14 13 5 13 31

8 26 11 10 9 35

b) Calculate the minimum, maximum and median number of absentees.

c) Write a short verbal summary of your results. Consider whether there are any other features of the data that you think should be explored.






• the construction of stem and leaf displays; • the similarities and differences between stem and leaf displays and

histograms; • using back to back stem and leaf displays to compare data sets by

considering issues such as where the data sets cluster and how spread out the data is;

• using ordered data to calculate numerical summaries of data, minimum, maximum and median.

Extra Activities






S6 Numerical Summaries of Data 1 Page 77


CHAPTER=NSD1


Numerical Summaries of Data 1 Context

In the last 3 units we have looked at how data sets can be summarised and presented using tables and diagrams. We’ve also begun to look at how data sets can be compared using back to back stem and leaf plots. When comparing data sets we need to be able to summarise the information as briefly and usefully as possible. This unit will begin to look at ways to do this.


Objectives


• appreciate what a statistical measure of location is; • appreciate what a statistical measure of dispersion; • understand that for a measure of location there is a partner

measure of dispersion; • understand and be able to calculate a mode and discuss its

strengths and weaknesses; • extend your knowledge of the median to calculate a five figure

summary; • understand the use of a median and be able to discuss its strengths

and weaknesses; • appreciate that the quartile deviation is the partner measure of

dispersion for a median. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Measures of Location and Dispersion

In practice, the two most useful considerations which help to summarise a mass of figures are:

• What is a typical or average value for this data set? • How widely spread are the figures?

Numerical summary statistics are single numbers that represent particular features of the data set and try to answer the two questions above. These single number summaries may be of interest in their own right or they may be used in conjunction with histograms etc to allow more objective comparisons of data sets. Single number summaries of data sets are important because they provide immediate impressions of order of magnitude and they allow simple comparisons.



For example, to decide if one department’s production is higher than another’s you don’t necessarily need all the production figures for the past month for both departments. If you know department A typically completed 300 orders and department B completed 175 orders a quick comparison of these numbers will tell you that department A seems to be performing better.


Measure of Location

A measure of location summarises the information from a set of data into just one number which gives us an idea of what the numbers in the data set are typically like and is useful for answering the first question above. The measure of location should reflect where the data are tending to cluster on a histogram. If the data set is nicely behaved, in the sense that there are not too many extremely high or low values, the data should tend to cluster in the centre of its range so measures of location are often referred to as measures of central tendency.

How do we go about choosing the single number which will be representative of all the numbers in the data set? If the data is clustering in the middle of its range then an obvious candidate for the measure of the location is the middle data item or the Median. We considered the calculation of a median in the last unit and will return to this later. In statistics there are three main measures of location:

• Arithmetic mean: we will refer to this simply as the mean. This is the average in the usual sense of the word; the sum of the observations divided by the number of observations.

• Mode: the value of the data item that occurs most frequently in the data set.

• Median: the value of the middle data item when the observations are arranged in numerical order.

Often in publications all three of these measures of location are referred to as an average. As we shall see later, for some data sets all three of these measures of location may be different so it useful to know which of these averages are actually being used.


Measure of Dispersion

As well as knowing what a typical value for the data set is and where the data is clustering, it is useful to have some idea of how spread out the data are. A measure of dispersion is a single number which tries to describe how spread out the data are and attempts to answer the second question above.

Two sets of data could quite easily have the same location (be clustering in the same place) but the spread, or dispersion, of the data in each data set may be very



different. The measure of dispersion gives you information about how the individual data items vary around the measure of location. Do the data items cluster tightly around the measure of location or are they more spread out?

Some commonly used measures of dispersion are

• Range: Spread between the highest and lowest data values. • Interquartile range: Spread in the middle 50% of the data • Standard deviation: A measure of how the data cluster around the

mean.

Measures of location and measures of dispersion are related in that when you quote a measure of location you should also quote an accompanying measure of dispersion. The measures of location tend to have a natural partner from amongst the measures of dispersion.


Medians Revisited and Five Figure Summaries

We have discussed medians and the calculation of a median value in the last unit. The median is defined to be the value of the data item at the middle of the ordered data set. It is calculated by first evaluating its position using the formula (n+1)/2 where n is the number of data items. Once you have the position of the median you look through the ordered data to find the value of the data item at the relevant position.

Example 1 (recap)

The following data set gives the salary of 12 part time employees.

4,800 5,110 5,520 5,570 6,325 6,750

6,785 7,320 7,320 7,320 8,894 9,500

The minimum value is 4,800 and the maximum value is 9,500.

To evaluate the median we need to know its position.

Position of the median is 5.62

132

1122

1==

+=

+n

As the position is not an exact integer (it ends in .5) we know that the data set does not have a unique middle value and we need to average the 6th and 7th data items to find the median.



The 6th data item has a value of 6,750 and the 7th has a value of 6,785 so the median is evaluated as;

Median = 50.67672

67856750=

+

The Median has a number of advantages and disadvantages as a measure of location.

• As the median only uses the middle observation in a data set it is not sensitive to extreme high or low values in a data set. Extreme values in data sets are often mistakes and so this means that a median is not distorted by mistakes in the data. As we shall see later, other measures of location are not so resilient to extreme values in data.

• The median is based purely on the data positions. The actual values of the data items are not used in its calculation so its use in more advanced statistical work is limited.

• The median often takes a value equal to one of the original data items in the data set.

• The median is useful for situations where data sets are difficult or expensive to obtain. For example, the median life of 100 light bulbs can be tested by waiting until the 50th one goes out.


Quartiles

The median divides the ordered data into two halves, each with the same number of observations. Each of these halves may, in turn, be divided into two by quartiles, so that the data is split into four quarters.

Minimum Median Maximum o • o • o Lower quartile Upper quartile

This cannot be done exactly unless the number of observations is divisible by 4, but it is easiest to define the lower quartile as the median of the lower half of the data and the upper quartile as the median of the upper half of the data.

As with calculating the median, we calculate the value of the upper and lower quartiles by first evaluating their position in the data set.

Position of the lower quartile is given by 4

1+n

and the position of the upper quartile is given by 4

13 +n .



Example continued

Returning to the example of part time employees salaries, evaluate the lower and upper quartiles.

The position of the lower quartile is 25.34

134

1124

1==

+=

+n .

Obviously there is no data item at position 3.25 so we need to make a decision. There are a number of things you can do but the easiest thing to do is round this position to the nearest whole number. So, 3.25 rounded gives us 3 so we will use the data item at position 3 to be the lower quartile.

The value of the third data item is 5,520 so the lower quartile is 5,520.

The position of the upper quartile is

75.925.334

1334

11234

13 =×=⎟⎠⎞

⎜⎝⎛=⎟

⎠⎞

⎜⎝⎛ +

=+n .

Again, there is no data item at position 9.75 so we will round this value to 10 and use the data item at position 10 to represent the upper quartile.

The value of the 10th data item is 7,320 so the upper quartile is 7,320. ENDSECTION STARTSECTION=content_6.htm= SECTION~

Five figure summary

Having calculated the median and quartiles the five figure summary is then made up as:

• Minimum • Lower Quartile (Q1) • Median (Q2) • Upper Quartile (Q3) • Maximum

For the example above the five figure summary would be;

Minimum 4,800

Lower quartile (Q1) 5,520

Median (Q2) 6,767.50

Upper quartile (Q3) 7,320

Maximum 9,500





Exercise S6.1

The following data set is the age of 25 employees considered in previous lectures. Having been provided with the data set and the stem and leaf plot, calculate the five figure summary.

63 27 46 47 22 64 30 19 69

36 65 60 40 66 55 33 47 42

49 23 22 46 62 30 20

Stem and Leaf Display of Ages of part time employees in years

Count

1 9 (1)

2 0 2 2 3 7 (5)

3 0 0 3 6 (4)

4 0 2 6 6 7 7 9 (7)

5 5 (1)

6 0 2 3 4 5 6 9 (7)

Leaf Unit = 1 year

The measure of location which forms part of the five figure summary is the Median. We indicated earlier that whenever we use a measure of location we need to quote an accompanying measure of dispersion. A five figure summary allows us to calculate two measures of dispersion; the range and the interquartile range.


Range

The range is simply the difference between the minimum and the maximum values of a data set. The five figure summary quotes the minimum and maximum as two of its values. The five figure summary for example 1 is:

Minimum 4,800

Lower quartile (Q1) 5,520

Median (Q2) 6,767.50

Upper quartile (Q3) 7,320

Maximum 9,500



So the range is evaluated as 700,4800,4500,9 =− .

As you can see the range is very easy to calculate and understand. However, it does have a number of disadvantages.

• As only two values are used to calculate the range it is very sensitive to extreme values (outliers) in the data set.

• The range indicates the variation between the smallest and largest values in the data set but does not tell us how much the values vary from one another.

• The range has no natural partner amongst the measures of location and is not used in further advanced statistical work.

For the above reasons, the range has limited practical use except in the area of quality control.


Interquartile Range

The Interquartile Range is the difference between the upper and lower quartiles and hence it shows the range of the values in the middle half of the data set.

Interquartile Range = Q3 - Q1.

In the above example : Interquartile Range = 7,320 – 5,520 = 1,800.

• The range is inappropriate as a measure of dispersion when there are extreme values in a data set. As the Interquartile Range only uses the middle 50% of the data the extreme values are eliminated from the calculations and hence the Interquartile Range is not influenced by extreme values.

• The Interquartile Range is the natural partner to the Median. The smaller the Interquartile Range, the less dispersed are the data and the data is clustered quite close to the median. So, it could be argued, that the smaller the Interquartile Range is, the better the Median is at representing a typical value of the data set.




Box plots

A Box plot (or box and whiskers plot) is a graphical method for representing a five figure summary. A box plot for the five figure summary in example 1 is;

10

20

30

40

50

60

70

Age

The central rectangle which marks out the two quartiles is called the ‘box’ while the horizontal lines on either side are the ‘whiskers.’ Just by observing the size and balance of the box and the whiskered components we can gain a quick and useful overall impression about how the data is distributed.

Box plots are not as informative as stem and leaf plots or histograms because they do not show the patterns of the data between the points of the five figure summary. That is, you know the range of the middle half of the data but you do not know how the data is spread within this range. Box plots however, are particularly useful for comparing two or more data sets. The following Box plot compares the verbal reasoning scores for students admitted to graduate study in departments in America classified according to the general categories displayed.

GRE verbal scores

200

300

400

500

600

700

800

Alldepartments

Naturalsciences

Engineering Socialsciences

Humanitiesand arts

Education



The centre, spread and range of the distributions of average score are immediately apparent. For example, the scores for Engineering departments are tightly concentrated about a median average score of about 540. The highest median verbal score occurs for students admitted to departments in the Humanities and Arts and the lowest median scores were for student admitted to Education department. The interquartile range is about the same for all the categories with the exception of engineering where it is smaller. Finally, although there are some differences in overall spread as measured by the range, the median scores do no vary a great deal.


Mode

The Mode is another example of a measure of location. The mode of a data set is defined to be the value of the most frequently occurring item. Therefore, it could be argued that the mode is the best measure of a typical value for a data set if it quotes the value of the item that occurs the most often.

Example

Consider the following hotel room prices.

£ per night 49 52 55 55 55 55 55 60 69

The mode of this data set is £55 as this is the value which occurs most frequently. This value occurs 5 times and no other value occurs more than once.

This data set has 9 observations, so the median is the 5th observation which is £55. For this set of data the median and the mode are equal, however, this will not be the case in general.




Frequency Tables

Frequency table of discrete data

Finding the mode for a frequency table of discrete values is very easy. The frequency column in the table tells you how many times each data value occurred in the data set. The mode is the value of the data item with the highest frequency. To find the mode look down the frequency column to identify the highest frequency and read off the corresponding modal value.

The following frequency table of the number of children in 23 surveyed families was discussed in the unit Summarising and Presenting Data 1.

Number of children Frequency

0 5

1 6

2 7

3 4

4 2

Total 23

Looking down the frequency column the highest frequency is 7. This tells us that the data item 2 occurred 7 times in the data set and was the most frequently occurring item. So the modal value for this data set is 2 children.

Frequency tables of grouped data

When data is presented in a grouped frequency table is possible to identify an interval with the highest frequency but it is not really possible to identify a modal value. The following frequency table shows the distance travelled by a group of 120 salespeople.

Distance travelled in Kms Number of salespeople (Frequency)

400 – 419 12

420 – 439 27

440 – 459 34

460 – 479 24

480 – 499 15

500 – 529 8

Total 120



From the table it is clear that the highest frequency is 34 which corresponds with a data interval of 400-459. So we could say that the modal class for this distribution is 400-459, i.e. salespeople are most likely to travel distances between 400 and 459 km.

Some text books will detail techniques for calculating a modal value for grouped frequency distributions using a formula or a graphical technique from a histogram. Both of these techniques assume that the modal value is in the modal class!!! There is no reason to suggest that this assumption is true; maybe every data item is different (which is highly likely for a distribution such as the one presented here), or maybe the data item which occurs most frequently isn’t in the modal class!. Therefore, I think it is better to just quote a modal class and don’t attempt to estimate a modal value.


Notes on the Mode

The mode has a very specific use in statistical summaries of data. It is only used when the purpose of the summary is to say what happens most often in the data set. If this is not the prime objective of the analysis the mode is not used as a regular measure of location for a number of reasons.

• Sometimes a data set does not have a modal value. This is particularly true of continuous data where every single value is likely to be different.

• Not all the data values are used when you calculate the mode so it can’t be used in more advanced statistical work.

• The mode is the only measure of location which you can use with category or attribute data.

• It is possible to have more than 1 mode. For example, the following table has 2 modal values, 2 and 4. If asked for the mode in such a situation you should quote both values (Don’t average them!).

Data 1 2 3 4 5

Frequency 2 6 4 6 5




Seminar Questions


At an inner city hospital there is concern about the high turnover of nurses. A survey was done to determine how long (in months) nurses had been in their current positions. The responses of 20 nurses were

23 2 5 14 25 36 27 42 12 8

7 23 29 26 28 11 20 31 8 36

Another survey was done at the hospital to determine how long (in months) clerical staff had been in their current positions. The responses of 20 clerical staff were

25 22 7 24 26 31 18 14 17 20

31 42 6 25 22 3 29 32 15 72

a) Rank each set of data.

b) Calculate the five figure summary for each set of data.

c) Compare the data sets using box and whiskers plots. (Discuss the location of the median, the location of the middle halves of the data sets etc.). Does the turnover of nursing staff appear to be different from that of clerical staff?



The following box plot shows the height of 50 women.

1.55 1.6 1.65 1.7 1.75

Height (m)

a) From the box plot estimate the values of the five figure summary.

b) Calculate the interquartile range

c) Does the data appear to be evenly spread throughout its range？ (HINT: Think about the spread of the first 50% of the data and the last 50% of the data.).





A sixth form college needs to make a report to the budget committee about the average number of hours a student spends in timetabled classes each week. A student needs to have 12 timetabled hours a week to be classified as full time but all students can participate in up to 20 hours of timetabled classes. A random sample of 40 students yielded the following information about the amount of hours they were spending in the classroom each week.

12 12 12 12 12 12 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 15 15 16 16 16 16 17 17 17 17 17 18 18 18 19 19 20 20

a) What is the modal amount of time that students spend in the classroom?

b) Calculate the five figure summary of the length of time students spend in the classroom.

c) Are the median and mode the same?

d) If the budget committee is going to fund the college according to the average amount of time students spend in the classroom, which of these two measures of location do you think the college will use? (This funding structure implies that there will be more money if the classroom time load is higher.)




• what is understood by the terms measure of location and measure of dispersion;

• calculation of a five figure summary; • compilation of a box plot and their use in comparing data sets; • calculation of a mode; • knowledge of what measure of dispersion is the partner to the

median.



Extra Activities







CHAPTER=NSD2


Numerical Summaries of Data 2 Context

In the previous unit we explained what is understood by the statistical terms measure of location and measure of dispersion. We discussed the median, mode, range and interquartile range. In this unit we will be considering the mean and standard deviation, which are the most usual forms of numerical summary statistics that are used in statistical analyses.


Objectives


• calculate a mean and discuss its strengths and weaknesses; • understand that the mean can be distorted by extreme values in a data

set; • calculate a standard deviation and appreciate that it is the partner

measure of dispersion to use with the mean; • appreciate that in many situations the three measures of location we

have considered will all give different values, and understand of which measure may be more appropriate.


Mean

This is the summary statistic which should be familiar to you. It is calculated by dividing the sum of the observations by the number of observations. Notation:

nx

x ∑=

In this notation;

• The data set is referred to by the letter x. • Putting a line, or bar, above x is the standard method of denoting a

mean calculation. So x denotes the mean of the data set x. If we had a second data set we could refer to this data set using the letter y and its mean would be denoted by y .



• The statistical notation for the mean of a set of data uses the symbol

∑ (sigma). ∑ means ‘the sum of’ and is used as shorthand to

represent the ‘sum of a set of values.’ So, ∑ x represents the sum of

the data set x. • n represents the number of data items in the data set in question.

Example 1 Returning to the data set of the age of 25 employees which we have used in previous units.

63 27 46 47 22 64 30 19 69

36 65 60 40 66 55 33 47 42

49 23 22 46 62 30 20

To calculate the mean of this data set we first add up the data so;

.1083...........47462763∑ =++++=x

This data set has 25 data items so 25=n .

With these two numbers the mean of the data set is calculated as;

32.4325

1083=== ∑

nx

x years.

Note that no-one actually had an age of 43.32 years. The mean is merely representative of how old an employee will typically be.


ENDSECTION STARTSECTION=activity_1.htm= SECTION~ Exercise S7.1

The number of issues of a particular monthly magazine read by 20 people in a year were as follows;

0 1 11 0 0 0 2 12 0 0

12 1 0 0 0 0 12 0 11 0

i) Calculate the median, mode and mean of this data set.

ii) To what extent do these three values provide an adequate summary of the data set? What are the most important features of the data?




Notes about the mean

Advantages

• The mean is easy to calculate and widely understood as an average value.

• All of the values in the data set are used to calculate the mean so it is representative of the whole data set.

• It is supported by mathematical theory and is suited to further statistical analysis.

Disadvantages

• Its value may not correspond to any actual value. For example, the average family may have 2.3 children but no family can have exactly 2.3 children.

• The mean may be distorted by high or low values in a data set. For example, the mean of the numbers, 100, 105, 110 and 110 is 106.25. However, the mean of the numbers 100, 105, 110, 110 and 500 is 185. The high value of 500 distorts the mean and in some cases the mean would be a misleading and inappropriate figure. Extreme values are not uncommon in financial data!

In the seminar we will be looking at how you calculate the mean of a frequency distribution of discrete data. We will not be considering the mean of grouped frequency distributions on this module. Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 6 covers both of these situations.


Standard Deviation

As we already know, along with a measure of location we need a measure of dispersion which tells us how tightly the data cluster around the measure of location. If the measure of location being used is the mean, the appropriate measure of dispersion is the standard deviation. The standard deviation is based on measuring how far each data item deviates from the mean. There are various ways that you could do this but the standard deviation is the method most commonly used.



Rationale for the Standard Deviation

Before we get into calculating a standard deviation we’ll use the following set of numbers to try and explain the rationale behind the formula used in its calculation.

Suppose we have the 5 numbers

£1 £2 £3 £4 £5

It’s very easy to calculate the mean of this data set as 3£=x .

The measure of dispersion that we need to quote along with this mean should reflect how close the individual data items are to the mean. So the obvious thing to do is calculate how far each data item is away from the mean (we’ll ignore the units for a while so that the calculations are less cluttered).

xx − :

1 – 3 = –2 2 – 3 = –1 3 – 3 = 0 4 – 3 = 1 5 – 3 = 2

These numbers, (–2, –1, 0, 1, 2) are the deviations, , of each data item from the mean )( xx − . We’re not interested in the individual deviations of each data item from the mean. What we want is an idea of a typical or average deviation of a data item from the mean. So what we need to do is calculate an average of these deviations.

Average Deviation = 050

5210)1()2(

==+++−+−

This average has come out to be 0 which would imply that there is no deviation (space) between the mean value and the data items. This could only happen if all values in the data set were the same. Clearly this is not the case! The problem is that the negative deviations (e.g. –2) and positive deviations (e.g. 2) have cancelled out in this calculation. The deviation is negative if the data item is less in value than the mean and it is positive otherwise. When looking at the measure of spread we don’t actually need to worry if the data item is less or more in value than the mean, we just want to know how far away from the mean it is. In other words we can ignore the minus signs (–) in the above lists of deviations; the deviations would then be

2 1 0 1 2



And the average of these deviations would be

Average Deviation = .2.156

521012

==++++

This average deviation is referred to as Mean Deviation and is a valid measure of dispersion. It tells us that, on average, the data items are 1.2 units away from the mean. The units in this case are £.

Refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 6 for further discussion of the mean deviation.

The mean deviation is the ideal measure of dispersion in terms of describing how the data spread about the mean. However, this process of ignoring the minus signs, although easy to do, is mathematically difficult to manipulate so the mean deviation is not suitable for further statistical analysis. We were ignoring the minus signs to get around the problem of negative and positive deviations cancelling each other out when we calculated the average deviation. Is there something else we could have done to deal with this problem? We could have squared each deviation instead because when you multiply two negative numbers together the answer is positive. So let’s list the deviations again. Notice I’ve included the units (£) again as we’ll need them in a minute to explain part of the rationale of the calculation of a standard deviation.

xx −

£1 – £3 = –£2 £2 – £3 = –£1 £3 – £3 = £0 £4 – £3 = £1 £5 – £3 = £2

Instead of ignoring the minus signs we will now square each deviation instead. This gives us:

( )2xx −

(–£2)2 = £24 (–£1)2 = £21 (£0)2 = £20 (£1)2 = £21 (£2)2 = £24

If we now calculate the average of these squared deviations we get

( ) ( ) ( ).£25

)(£105

£41014 2222

==++++

=−∑

nxx

This process of calculating the average of the squared deviations is called a variance and is a valid measure of dispersion.




Variance and Standard Deviation

Variance

( )n

xx∑ − 2

The problem with using this as a measure of dispersion to go with the mean is that it has different units. The units of the mean are £ and the units of the variance are £2. We get a round this problem by taking a square root of the variance which gives us the standard deviation.

Standard Deviation

( )n

xx∑ − 2

In the above example the standard deviation would be

( ) ( ) ( ) .4142.1££25£10 2

22

===−∑

nxx

Notes about the standard deviation:

• The standard deviation is by far the most important of the measures of dispersion but its importance is due to its mathematical properties rather than its descriptive properties. However, the more the individual data items differ from the mean, the greater will be ( )2xx − and so the standard deviation will be larger. Hence the greater the dispersion, the larger the standard deviation will be.

• The standard deviation is the natural partner to the mean as the mean is used in its calculation.

• The standard deviation is truly representative of the data set as all the data items are used in its calculation.

• The standard deviation gives too much importance to extreme values in the data set and hence is a distorted measure of dispersion when extreme values exist in the data set.

• Many people use definitions of the variance and standard deviation which are slightly different form the one given above. They use the following formula

Variance = ( )

.1

2

−−∑

nxx



This will obviously give a different numerical result. However the reasons for dividing by n – 1 instead of n are related to sampling theory and cannot be easily understood in the context of describing the dispersion of a single data set. For large data sets the two formulas will produce similar results. However, you should be aware that many (but not all!) computer packages divide by n – 1 when working out the variance and the standard deviation. Similarly a standard deviation button on a calculator may divide by n – 1. It doesn’t actually matter which definition you use as long as you stick to the same one when comparing two or more data sets.


Hand Calculation of the Standard Deviation

Unless all the observations and their mean are reasonably small integers, hand calculation of the variance can be long and messy. For this reason it is usual to use the fact that

Variance = ( )

.222

xnx

nxx

−=− ∑∑

This is just a mathematical result which can be derived using algebra. Using this formula you calculate the squares of the original data values rather than the squared deviations.

Returning to the above example

£1 £2 £3 £4 £5

We already know that the mean of this data set is £3.

To use the hand calculation formula we now need to calculate the squared data items and add them up.

x2

12 = 1(£2) 22 = 4(£2) 32 = 9(£2) 42 = 16(£2) 52 = 25(£2)

( )∑ =++++= 22 £552516941x .

The variance is now calculated as

( ) ( ) ( ) ( ) ( ).£2£9£113£5£55 2222

22

2

=−=−=−∑ xnx



The standard deviation is then the square root of the variance:

Standard deviation = ( ) .4142.1££2 222

==−∑ xnx

So we get exactly the same answer as we had before. Now go and do Exercise S7.2


Exercise S7.2

Calculate the standard deviation for the data set concerning hotel room prices which we used in the last unit.

£ per night

49 52 55 55 55 55 55 60 69

We showed in the last unit that the median and mode for this data set is £55. If you do the mean calculations it gives you x =£56.11. Using this value you now need to calculate the variance using the technique described above.


Coefficient of Variation

Sometimes we need to compare two data sets to see if one set of numbers is more variable than another.

Example

Consider the two following sets of summary statistics

Data Set 1 Data Set 2

Mean ( x ) 10 100

Standard Deviation (s) 2.5 25

Which of the data sets is more variable?

Comparing the standard deviations you would conclude that data set 2 is more variable as it has the higher standard deviation. However, the two data sets relate to sets of figures of quite different orders of size which is clearly demonstrated when you compare the mean values (10 compared to 100).



We can obtain some idea of the degree of relative variability if we relate the size of the variation to the average of the figures it was derived from. This leads to the coefficient of variation.

Coefficient of Variation = 100×xs

where s stands for the standard deviation. Returning to the summary statistics for these two data sets.

Data Set 1 Data Set 2

Mean ( x ) 10 100

Standard Deviation (s) 2.5 25

Coefficient of Variation 2510010

5.2=×

25100

10025

=×

So, using the coefficient of variation, both data sets are equally variable. ENDSECTION STARTSECTION=content_7.htm= SECTION~

Comparing Mean, Median and Mode

As we have seen during the last few units, for a given data set the mean, median and mode may all be different. This raises the obvious question of which we should use in a given situation.

If the mean, median and mode of a set of data are all the same, the distribution of the data is said to be symmetric.

Symmetric (zero skewness)

17 18 19 20 21 22 23

Years

Freq

uenc

y

mode = mean = median = 20

The median and mean are both located in the middle of this distribution. The modal value is the one with the highest frequency, so it will be the value corresponding to the highest point of the above curve and so the mode is also the same as the median and the mean.



In practice, the frequency distribution will tend to lean in one direction or the other because the data set will either have predominantly low values with a few high values or vice versa. This is called skew.

If a data set is made up predominantly of small values with a few high extreme values the distribution would resemble the one below. The mode is again located under the highest part of the curve so would be on the left of the median. The mean is distorted by the high values and would be bigger in value than the median.

This is known as right skew or positive skew.

Skewed to the right (positively skewed)

0 200 400 600 800 1000 1200 1400 1600 1800

Weekly income

Freq

uenc

y

mode = £300

median = £510

mean = £600

If instead the data is largely made up of high values with a few low extreme values, the mean will be distorted by the low values and be less in value than the median. The mode is again under the highest part of the curve so the mode is bigger than the median.

This is known as left skew or negative skew.

Skewed to the left (negatively skewed)

0 100 200 300 400 500 600 700 800 900

Tensile strength

Freq

uenc

y

mode = 750

median = 645

mean = 600



So if it is possible for all the values to be different which one should we use? The following points may be useful as a guide to which measure of location to use.

• The mode is only used when you want to talk about the most frequently occurring data item.

• If you’re not using a mode then generally you could use a median or a mean. Most usually, you would use the mean as a measure of location unless you are in the position described in the next bullet point.

• If you have either a few extremely low or extremely high data values the mean may provide a distorted measure of location. So in this situation it may be better to quite the median instead of the mean as the measure of location.

• When you want to compare two or more data sets you will usually use a mean. Only a mean uses all the data items in its calculation so only the mean will really reflect differences in the data sets. The only exception to this is when you are comparing data sets using box plots to make a quick comparison, as the median is a fundamental part of a box plot.

Having decided which measure of location to use, which measure of dispersion should you use?

You should never really be in the position where you are trying to answer this question. Measures of dispersion form natural partners with the measures of location. So the use of the measure of dispersion is straight forward:

• use a range and interquartile range if you’re quoting the median as the measure of location;

• use a standard deviation if you’re quoting the mean as the measure of location.




Seminar Questions


Jane purchased a new home computer and has been having trouble with voltage spikes on the power line. Such voltage jumps can be caused by the operation of appliance such as clothes dryers and electric irons or just by a power surge on the outside power line. The following data was obtained about voltages when certain appliances were turned on and off. (The normal voltage in the U.K is 240)

146 280 156 284 160 280 180 266

i) Compute the mean, standard deviation, coefficient of variation and range of this data set.

Jane was advised to buy a power surge protector which protects the computer from strong voltage spikes. The voltages were again measured when certain appliance were turned off and on. The results were

200 240 216 228 210 234 206 228

ii) Compute the mean, standard deviation, coefficient of variation and range of the voltages using the power surge protector.

iii) Compare your answers to parts i) and ii). Were the means about the same? Were the voltage distributions different with and without the power surge protector? How did the standard deviation, coefficient of variation and range reflect this when the mean did not?



A production department uses a sampling procedure to test the quality of newly produced items. The department employs the following decision rule at an inspection station: If a sample of 14 items has a variance of more than 0.005, the production line must be shut down for repairs. Suppose the following data has just been collected:

3.43 3.45 3.43 3.48 3.52 3.50 3.39

3.48 3.41 3.38 3.49 3.45 3.51 3.50

Should the production line be shut down? Explain your answer. ENDSECTION STARTSECTION=activity_5.htm= SECTION~




The following data sets show the number of days of work missed by employees in two departments of the same company.

Department A : 20 employees

Number of days missed by each employee in one year.

0 0 0 0 1 1 1 2 2 2

3 3 3 5 5 5 8 10 15 95

Department B : 30 employees

Number of days missed by each employee in one year.

2 2 2 2 2 2 3 3 3 4

4 5 5 5 6 6 7 7 7 7

8 8 8 8 8 8 10 10 12 15

For each department calculate the five figure summary, the mean and the standard deviation of these absenteeism figures. Describe any difference between the two departments.



This question will take you through the process of calculating the mean for a frequency table of discrete data.

The following table summarises how many questions a group of 11 students answered correctly on a diagnostic test.

Number of correct questions Number of students 12 3 13 4 14 2 15 2

i) Using this table write down the raw data set, i.e. there are 3 occurrences of the item 12 etc so the raw data set will begin as 12 12 12 etc

ii) Write down, in long hand, the calculation of the mean; i.e on the top line of the formula right down the expression which adds up all the numbers, i.e. 12+12+12 etc



iii) Is there a quicker way of writing down the expression on the top line (HINT: use multiplication.)

iv) Can you now see how this quicker expression relates to the frequency table above?



A production process randomly selects 30 boxes of components for inspection during its quality control process. Each box contains 20 items and the quality control process counts how many defective components are in each of the 30 boxes. If, on average, there are more than 2 defective components in each box the process is shut down. During the first quality inspection of the day the number of defective components found in the 30 sample boxes is summarised below.

Number of defective components Number of boxes

0 9

1 11

2 5

3 3

4 2

i) Write down the data set and show that the mean, median and mode of the number of defective components are 1.3, 1 and 1 respectively.

ii) Five hours later the quality inspection is repeated and the number of defective components found in 30 boxes is summarised below.

Number of defective components

Number of boxes

0 6

1 10

2 5

3 5

4 2

5 2

The median and mode of this data set is 1 and the mean is 1.8. Comment on the strengths and weaknesses of using each of these measures of location in this situation.






• calculating a mean; • calculating a standard deviation; • using the coefficient of variation to compare the spread of two data sets

as measured by the standard deviation; • appreciation of why the values for mean, median and mode can all be

different for some data sets and the concept of skew; • understanding of which measure of location is appropriate in a given

situation; • knowledge of which measure of dispersion to use given you have

selected an appropriate measure of location.

Extra Activities






S8 Correlation and Regression 1 Page 107


CHAPTER=CR1


Correlation and Regression 1 Context

The statistical techniques we have considered so far in this module have concentrated on methods of describing a single variable (set of data). We now turn our attention to methods which will allow us to examine two variables to see to what extent they are related. This is called bivariate analysis.

It is often the case that two variables are related, i.e. changes in one variable are accompanied by changes in the other variable. Bivariate analysis is concerned with assessing how strong the relationship between two variables is (correlation) and modelling that relationship (regression).


Objectives


• understand the concept of correlation; • construct and use a scatter plot to asses the existence of a linear

relationship between two variables; • know the difference between positive and negative correlation; • be able to calculate and interpret the product moment correlation

coefficient; • be able to calculate and interpret Spearman’s rank correlation

coefficient; • be aware of the problems of spurious correlation.


Correlation Analysis

The basic idea of correlation analysis is to assess the strength of the linear (straight line) association that may exist between two variables.

Scatter Plot

A useful method of investigating if there is a relationship between two variables is to draw a scatter plot. Plot one of the variables along the x axis and the other one along the y axis and examine the resulting scatter plot for any pattern. The correlation analysis we are considering assesses the strength of a linear relationship so we are hoping to see the pattern in the scatter plot suggesting a straight line relationship between the two variables.



Example

A sales manager has collected information for ten of his staff relating to their length of experience in years and their annual sales. The data that was collected is shown below.

Experience (in years) Annual sales (£000’s)

1 40

2 49

3 46

4 51

5 52

6 56

7 60

8 62

9 59

10 68

In this example we would expect the length of experience to explain the annual sales. So we will make the length of experience the explanatory variable and the annual sales the response variable (we discussed the ideas of explanatory and response variables in the unit Summarising and Presenting Data 1). Having decided this we will produce a scatter plot with the explanatory variable (years of experience) on the horizontal (x) axis and response variable (annual sales) on the vertical (y) axis.

N.B. It doesn’t matter which way round we plot the variables for correlation analysis, i.e. we don’t need to decide which variable is the explanatory and which is the response at this stage. However, we will need to be able to make this distinction in the next unit.



A scatter plot of this data set is as follows:

01020304050607080

0 2 4 6 8 10 12Years of experience

Ann

ual s

ales

(£00

0s)

To produce this plot we begin with the first pair of observations. This relates to an individual with 1 year of experience who made 40 thousand pounds worth of sales. To plot this point move along the horizontal axis to x = 1, then go vertically to y = 40 and place a dot at the intersection. This process is repeated for the remaining pairs of data.

This scatter plot does suggest that there is some relationship or correlation between length of service and annual sales. As the length of service increases, the annual sales also increase. Furthermore, the points seem to following the pattern of a straight line. This means that the correlation analysis techniques that we will be discussing can be used to assess the strength of this relationship.


Degrees of Correlation

The relationship between two variables can be classified as one of the following.

• A perfect relationship (perfect correlation) • A partial relationship (partly correlated) • No relationship (uncorrelated)

Furthermore, the relationship can be described as positive or negative.

Positive correlation

As one variable increase so does the other. Low values of one variable are associated with low values of the other variable and high values of one variable are associated with high values of the other.



Negative correlation

As one variable increases the other one decreases. So high values of one variable are associated with low values of the other variable.

All these differing degrees of correlation and the positive or negative nature of the relationship can be illustrated using scatter plots.

Perfect negative correlation

x

y

Line has negative slope

r = –1

Perfect positive correlation

x

y

Line has positive slope

r = 1

If the relationship between the two variables is perfect then all the values lie on a straight line. An exact linear relationship exists between the two variables. If the line slopes down the relationship is negative, if the line slopes up the relationship is positive.

As the points begin to move away from the line the relationship gets weaker. The following two plots indicate what you can expect to see on a scatter plot as the relationship between the two variables weakens. These plots demonstrate partial correlation which is what you’re most likely to meet in practice.

Weak negative correlation (x and y l inearly related to some

extent)

Price x

Qua

ntity

sold

y

Strong positive correlation (x and y linearly related to a large extent)

School GPA x

Col

lege

GPA

y



If there is no relationship between the two variables then the scatter plot will resemble a random scatter of points as shown below.

Zero correlation, r = 0 (x and y not linearly related)

Height x

Inco

me

y


Product Moment Correlation Coefficient

The degree of the relationship between the two variables can be measured and we can decide numerically if the relationship is perfect or partial. If two variables are partially correlated we can decide if the relationship is strong or weak.

The degree of correlation is measured by the product moment correlation coefficient which we will denote by the letter r.

The correlation coefficient will always be a number between –1 and +1. If you get a value outside of this range you have made a mistake.

• If r = +1 we have perfect positive correlation. • If 0 < r < 1 we have partial positive correlation. • If r = 0 we have no correlation. • If –1 < r < 0 we have partial negative correlation. • If r = –1 we have perfect negative correlation.

The correlation coefficient is calculated using the following formula

( )( ) ( )( )2 22 2.

n xy x yr

n x x n y y

−=

− −

∑ ∑ ∑∑ ∑ ∑ ∑

The letters x and y represent the pairs of data for the two variables and n is the number of pairs of data used in the analysis.

The formula may look complicated but all it involves is calculating relevant sums and substituting them into the formula.



Example continued

In this example, the length of experience is referred to as x and the annual sales is referred to as y. The formula requires us to calculate 5 sums:

• ∑ x . This is calculated by adding up the variable referred to as x, experience.

• ∑ y . This is calculated by adding up the variable referred to as y, annual sales.

• ∑ 2x . This is calculated by first squaring each value of the variable x and then adding these values up.

• ∑ 2y . This is calculated by first squaring each value of the variable y and then adding these values up.

• ∑ xy . This is calculated by multiplying each value of x by its corresponding y value and adding the results up.


Product Moment Correlation Coefficient Continued

The best way to perform the calculations is to set them out in a table.

Experience

x

Annual Sales

y

x2

y2

xy

1 40 1×1=1 40×40=1600 1×40=40

2 49 2×2=4 49×49=2401 2×49=98

3 46 9 2116 138

4 51 16 2601 204

5 52 25 2704 260

6 56 36 3136 336

7 60 49 3600 420

8 62 64 3844 496

9 59 81 3481 531

10 68 100 4624 680

∑ = 55x ∑ = 543y ∑ = 3852x ∑ = 301072y ∑ = 3203xy



Having calculated the sums we now need to substitute them into the formula.

( )( ) ( )( )2 22 2.

n xy x yr

n x x n y y

−=

− −

∑ ∑ ∑∑ ∑ ∑ ∑

We have 10 pairs of data in this example so n=10. Substituting this value for n and the sum )(Σ values from the table gives:

( ) ( )( ) ( )[ ] ( ) ( )[ ]22 54330107105538510

54355320310

−×−×

×−×=r

( )[ ] ( )[ ] 62218252165

294849301070302538502986532030

×=

−−−

=r

2165 2165 0.96.2265.46355132325

r = = =

This confirms what we saw on the scatter plot. The correlation between years of experience and annual sales is positive. So as a member of staff gains more experience their sales increase. The relationship is also very strong as r=0.96 is quite close to 1. This is substantiated by the points on the graph following the pattern of a straight line quite closely.



Exercise S8.1

A large industrial plant has seven divisions that do the same type of work. A safety inspector visits each division of 20 workers regularly. The number of work hours devoted to safety training and the number of work hours lost due to industry related accidents are recorded for each separate division in the following table and scatter plot.

Hours in safety training 10.0 19.5 30.0 45.0 50.0 65.0 80.0

Hours lost due to accidents

80 65 68 55 35 10 12



Scatter plot of number of hours lost due to accidents against number of hours of safety training

0102030405060708090

0 20 40 60 80 100

Hours in safety training

Hou

rs lo

st d

ue to

acc

iden

ts

i) What does this information tell you about the relationship between safety training and the number of accidents?

ii) If x represents hours in safety training and y represents hours lost due to accident, calculate the product moment correlation coefficient. You may find it easier to make use of the following table.

x y x2 y2 xy

10.0 80

19.5 65

30.0 68

45.0 55

50.0 35

65.0 10

80.0 12

∑ =x ∑ =y




Spearman's Rank Correlation Coefficient

In the example and exercise above the data were given in terms of the values of the relevant variables. So in the example we knew how many years of experience the salesmen had and the actual value of their annual sales. Sometimes however, the data may be in terms of the order or rank of the data rather than actual values. So in the example we could have been given the salesmen rank in terms of length of service and their sales ranked according to who sold the most. So the salesman with the highest sales would get a rank of 1, the next highest would get a rank of 2 and so on. When this happens the product moment correlation coefficient is no longer appropriate and a correlation coefficient know as Spearman’s Rank Correlation Coefficient, rS , should be calculated using the following formula.

( ) ⎥⎥⎦⎤

⎢⎢⎣

⎡

−

×−= ∑

16

1 2

2

nnd

rs

where n = number of pairs of data as in the product moment correlation coefficient.

d = the difference between the rankings in each set of data.

As with the product moment correlation coefficient, Spearman’s Rank Correlation Coefficient, rs, will be a number between –1 and +1 and it is interpreted in the same way.

Example

The following data set shows the placing of seven students in their statistics and economics examination. A 1 indicates the student who performed the best.

Student A B C D E F G

Statistics placing 2 1 4 6 5 3 7

Economics placing 1 3 7 5 6 2 4

From the table it is clear that Student B produced the best performance on the Statistics examination and student G produced the worst performance. Similarly, Student A produced the best performance on the Economics paper and student C produced the worst performance. What we now want to know is if there is any relationship between the placing on the two papers. In other words, if you do well in statistics are you also likely to do well in economics and vice versa.



To do this we will calculate Spearman’s Rank Correlation Coefficient, rS. The first thing we need to do is calculate the differences in the ranks and square these differences, which we will do in a table.

Student Statistics Rank

Economics Rank

d d2

A 2 1 (2 – 1) = 1 12 = 1

B 1 3 (1 – 3) = –2 (–2)2 = 4

C 4 7 –3 (–3)2 = 9

D 6 5 1 1

E 5 6 –1 1

F 3 2 1 1

G 7 4 3 9

∑ = 262d

Then we need to substitute ∑ 2d and n into the formula for Spearman’s Rank Correlation Coefficient.

( ) ⎥⎥⎦⎤

⎢⎢⎣

⎡

−

×−= ∑

16

1 2

2

nnd

rs

( ) ( )2

6 26 156 156 1561 1 1 1 1 0.4643 0.5357.7 49 1 7 48 3367 7 1sr

⎡ ⎤×⎢ ⎥= − = − = − = − = − =

− ×−⎢ ⎥⎣ ⎦

The correlation is positive which suggest that a high placing on the Statistics paper corresponds to a high placing on the Economics paper. However, the correlation is not strong as the value of the correlation coefficient is only 0.5357.





Exercise S8.2

A student organisation surveyed both recent graduates and current students to obtain information on the quality of teaching at a university. An analysis of the responses produced the teaching ability ranking shown in the following table.

Professor A B C D E F G H I J

Current Students 4 6 8 3 1 2 5 10 7 9

Recent Graduates 6 8 5 1 2 3 7 9 4 10

Is there a relationship between the rankings of the current students and the recent graduates? You may make use of the following table in calculating your answer.

Professor Current Students

Recent Graduates

A 4 6

B 6 8

C 8 5

D 3 1

E 1 2

F 2 3

G 5 7

H 10 9

I 7 4

J 9 10



It is possible to get tied ranks when presenting data in this way. This would happen if two students did equally well in the Statistics paper and got joint first place. Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 11 discusses how to calculate Spearman’s Rank Correlation Coefficient when you have tied ranks.

Note

It is worth remembering that ranked data are less precise than actual data values. An item ranked first may be slightly better than an item ranked second or it may be much better. It follows that the results of rank correlations are less precise and we must interpret them carefully. Wherever possible, the product moment correlation coefficient must be used. There is a question in the seminar problems which illustrates this point.


Correlation and Causation

A causal relationship is one where the value of one variable is directly attributable to the value of the other. A causal relationship between two variables implies a strong correlation. However a strong correlation does not imply a causal relationship. When interpreting r it is important to realise that there may be no direct connection at all between strongly correlated variables. Such correlation is termed Spurious Correlation.

What we can conclude when we find two variables with a strong correlation is that there is a relationship between the two variables, not that a change in one causes a change in the other. The relationship may be due to the dependence of both variables on a third variable. For example, sales of ice cream and sunglasses are strongly correlated, not because of a direct causal link but because the weather influences both variables.

For a discussion of correlation and causation see Business Basics Quantitative methods or any other suitable text from the reading list.




Seminar Questions


The following figures give (in units of £10m) the turnover and profit before taxation for a firm. Calculate the product moment correlation coefficient and comment on the result.

Turnover 106 125 147 167 187 220

Profit 10 12 16 17 18 22 ENDSECTION STARTSECTION=activity_4.htm= SECTION~


The finance division of a large company is investigating its procedures for the selection of new accountancy trainees. Potential applicants are given, prior to appointment, both a written test and a formal interview. The performances of eight successful applicants were rated after their first full year with the company. The independent rankings of written test, interview assessment and job performance for the eight trainees are given as follows.

Trainee A B C D E F G H

Written test 6 2 7 4 1 5 3 8

Interview 1 4 2 3 6 5 8 7

Job Performance 1 2 3 4 5 6 7 8

a) Calculate the Spearman’s Rank Correlation coefficient between:

i) job performance and written test

ii) job performance and interview assessment.

b) Which of the variables, written test and interview assessment is more strongly related to job performance?

c) Can you conclude that the variable most weakly correlated with job performance is not necessary as part of the selection process? Explain your answer.





Produce a scatter plot of the following set of data.

X 2 4 6 8

Y 10 20 30 40

a) Without performing any calculations what would be the value of both the product moment correlation coefficient and the Spearman’s rank correlation coefficient? Explain your answer.

b) Suppose the data were altered slightly due to a typing error and the data is now

X 2 4 6 8

Y 10 24 30 47

Produce another scatter plot of this data set. From this plot and the data set would you expect there to be a change in the values of;

i) The Product Moment Correlation Coefficient

ii) Spearman's Rank correlation Coefficient

Explain your answer. What does this question illustrate about the precision of each of these correlation coefficients?



Consider the following data set.

Year 1986 1987 1988 1989 1990

Girls Average weekly pocket money (in pence)

114 120 122 136 147

Cautions for violent offences (thousands)

9.5 11.3 12.7 14.7 16.8

a) Produce a scatter plot of the data and calculate the product moment correlation coefficient between pocket money and the number of cautions for violent offences.

b) What explanation can you come up for the resulting strong correlation?





If you have understood the content of this unit you should have knowledge of and be able to answer any questions relating to all of the following points;

• scatter plots and discussing the nature of any apparent pattern between two sets of data;

• calculating and interpreting the product moment correlation coefficient; • calculating and interpreting the Spearman’s Rank Correlation

Coefficient; • understand the way in which the product moment correlation

coefficient is better; • understand the idea of spurious correlation.

Extra Activities








CHAPTER=CR2


Correlation and Regression 2 Context

In the last unit we discussed the use of correlation and scatter plots as tools to assess if a relationship existed between two sets of data and a means of deciding how strong the relationship was.

We said that correlation tests the strength of a linear relationship that may exist between two sets of data. So on the scatter plot we were looking for the points to be following a pattern that suggested a straight line. The word used to describe the overall shape of points plotted on a scatter plot is the trend. In the first example in the last unit the trend of the points was to follow a straight line sloping upwards and in the exercise the trend of the points was to follow a straight line sloping downwards. However, it is possible to be a bit more precise about this pattern. Imagine drawing a straight line through the middle of the points on the scatter plots. Finding the equation of such a line would allow us to move from vague, wordy descriptions of the trend to a more precise mathematical description of the relationship in question. Once the line has been defined in this more formal way, it becomes possible to make predictions about where we think other points may lie.


Objectives


• produce a scatter plot and identify a line of best fit by eye; • understand the basis of the least squares estimation of the line of best

fit; • use the formula to calculate the equation of the line of best fit; • assess the fit of the line using the coefficient of determination; • use a line of best fit for estimation and prediction and be aware of the

reliability of the resulting estimates.




Recap on the equation of a straight line

This process of fitting the best line through the points on a scatter plot involves a few basic mathematical ideas about graphs and straight lines.

The equation that describes a linear (straight line) relationship between two variables x and y is

bxay +=

This is the general equation of a straight line and describes all possible straight line that you could draw on a set of axes. What we want is the very specific straight line which best describes or best fits the points on the scatter plot. In order to define a specific straight line we need to know two pieces of information;

• The slope of the line (b) • Where the line crosses the y axis (a)

The slope of the line can be either positive or negative depending on whether the line slopes upwards or downwards.

So calculating a line of best fit for a set of data means we need to find appropriate values for a and b in the above equation of a straight line.

For more discussion of the mathematical ideas of straight lines refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 1. Also work through unit M7.


Fitting the best line by eye

One method of finding the best line that passes through a set of points on a scatter plot is to use your own judgement as to what is the best line. Using a ruler draw a line through the middle of the points so that there are just as many points above the line as below the line. Having done this there are methods that will allow you to calculate values for a and b resulting in the equation of this line. However, this effort is unnecessary really as you can use the line itself on the scatter plot for estimation.




The Method of Least Squares

When fitting a line through the data points by eye you will probably draw a line which goes as close to the points as possible. The method of least squares is a mathematical method for calculating a and b so the resulting line is such that the sum of the squares of the distance from each point to the line is a minimum.

Suppose we have a scatter plot of the data such as the one that follows. The line of best fit will not pass perfectly through all the points which means there is a deviation from each point to the line. Obviously the “best line” will be one such that these deviations are as small as possible. Another way of saying this is that the best line will be one such that the sum of the deviations from the points to the line will be as small as possible (i.e. the sum is minimised). However, as some of the points are above the line and some are below the line this means some of the deviations are positive and some are negative. As we already know from the lecture on standard deviation, when you add positive and negative numbers together they cancel out. So instead of finding values for a and b such that the sum of the deviations is minimised, we will square all the deviations and find values for a and b such that the sum of the squared deviations is minimised.

x

y

d 1

d 2

d n

So the method of least squares chooses values for a and b so that the sum of the squared deviations;

222

21 ................ nddd +++

is as small as possible.



The mathematics involved in proving how to calculate a and b to achieve this is quite involved. However, it can be proved that if we use the formulae

( )∑ ∑∑ ∑ ∑

−

−= 22 xxn

yxxynb

and

xbyn

xb

ny

a −=−= ∑∑

the resulting straight line will be such that the sum of the squares of the distance from each point to the line is a minimum. The resulting regression line is called the Least Squares Regression Line of y on x. If you compare the formula for b to the formula for r last week you will see that they are very similar:

( )( ) ( )( )∑ ∑∑ ∑∑ ∑ ∑

−−

−=

2222 yynxxn

yxxynr

The top lines (numerators) of r and b are identical and the bottom line (denominator) of b is part of the expression in the denominator of r. This emphasises the fact that correlation and regression are closely related and the calculations involved in regression will be pretty similar to those of correlation.


The Method of Least Squares continued

Example

For this example we will return to the data set which we used in the last unit when we discussed correlation.

A sales manager has collected information for ten of his staff relating to their length of experience in years and their annual sales. The data that was collected is shown below.



Experience (in years) Annual sales (£000’s)

1 40

2 49

3 46

4 51

5 52

6 56

7 60

8 62

9 59

10 68

As we discussed last week, we would expect years of experience to explain annual sales so we will make experience the x variable or explanatory variable and annual sales the y variable or response variable. (We shall return to this idea later.)

Just to remind ourselves a scatter plot of this data set was:

01020304050607080

0 2 4 6 8 10 12Years of experience

Ann

ual s

ales

(£00

0s)

These points clearly lie close to a straight line so to continue with a least squares regression analysis is sensible.

As with correlation, the calculations involve evaluating various sums and then substituting them into the formula. So the process of the calculation will be very similar to in the last unit and is best performed in a table.



The table of calculations from last week was:

Experience x

Annual Sales y

x2

y2

xy

1 40 1×1=1 40×40=1600 1×40=40

2 49 2×2=4 49×49=2401 2×49=98

3 46 9 2116 138

4 51 16 2601 204

5 52 25 2704 260

6 56 36 3136 336

7 60 49 3600 420

8 62 64 3844 496

9 59 81 3481 531

10 68 100 4624 680

∑ = 55x ∑ = 543y ∑ = 3852x ∑ = 301072y ∑ = 3203xy

To calculate the regression line we need all of these sums apart from

∑ = 301072y .

To calculate the equation of the regression line we will first calculate the value of b:

( )∑ ∑∑ ∑ ∑

−

−= 22 xxn

yxxynb

Substituting in the values for the various sums from the above table we get;

( ) ( )( ) ( )2

10 3203 55 543 32030 29865 2165 2.62.3850 3025 82510 385 55

b× − × −

= = = =−× −

Having calculated a value for b we need to use this value in the calculation of a.

( )

543 552.6210 10

54.3 2.62 5.5 54.3 14.41 39.89.

y xa b

n n⎛ ⎞= − = − ×⎜ ⎟⎝ ⎠

= − × = − =

∑ ∑



The equation of the resulting regression line of y (annual sales ) on x (years of experience) is

y = 39.89 + 2.62x.

The scatter plot with the line of best fit drawn on it is:

y = 2.6242x + 39.867

0

10

20

30

40

50

60

70

80

0 2 4 6 8 10 12

Years of experience

Ann

ual s

ales

(£00

0s)




Interpreting the coefficients

The terms a and b are called the coefficients of the least squares regression line. We already know that, in the straight line,

• a is the intercept of the line with the y axis. Or a is the value of y when x = 0.

• b is the slope of the line.

But what do the values of a and b actually mean in the context of this data?

• a is the value of the y variable (annual sales) when the x variable (experience) = 0. So in the context of this question, when x (experience) = 0, y (annual sales) = a (39.89). So this means we would expect a salesmen with no experience to make, on average, 39.89 (£000’s) worth of sales in a year.

• b is the value of the slope. In this example b is positive because the line slopes upwards. The value of b tells you what the change in y will be when x increases by 1. So in this question, if x (years of experience) increases by 1, y (annual sales) will change by 2.62. This is a positive change so as a salesman gains 1 more year of experience you should expect to see his annual sales increase by 2.62 (£000’s).

For a greater discussion of the interpretation of the coefficients in a least square regression line refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 11.


ENDSECTION STARTECTION=activity_1.htm= SECTION~



Exercise S9.1

Returning to the exercise from the last unit concerning of the number of hours spent on safety training and the number of hours lost due to accidents. The data set and scatter plot were as follows;

Hours in safety training 10.0 19.5 30.0 45.0 50.0 65.0 80.0

Hours lost due to accidents 80 65 68 55 35 10 12

0102030405060708090

0 20 40 60 80 100

Hours in safety training

Hou

rs lo

st d

ue to

acc

iden

ts

• Would it be sensible to continue with a least squares regression analysis on this data? Explain your answer.

• Calculate the least squares regression line of number of accidents (y) on time spent in safety training (x).

• Interpret the coefficients in your least squares regression line.

You can use the following information,

∑ = 5.299x ∑ = 325y ∑ = 5.9942xy

∑ = 25.165302x ∑ =197432y




Uses of Regression

Having fitted the relationship between the two variables it can be used to estimate values of y for given values of x or to exercise control.

Estimation

If we know there is a very close relationship between the years of experience a salesman has and their annual sales, we could estimate the annual sales for a salesman with a given amount of experience.

Example

What would be the estimated annual sales for a salesman with 6 years of experience? Obviously within the data set we have a salesman with precisely 6 years of experience. However, the y value of 56 which corresponds with this individual in the data set is his annual sales, not the general amount of sales you could expect of any salesman who had 6 years of experience.

The regression equation for this example was

y = 39.89 + 2.62x.

We want to know the estimated value of y when x = 6.

( )39.89 2.62 6 39.89 15.72 55.61.y = + × = + =

So we would expect a salesman with 6 years of experience to make, on average, 55.61 (£000’s) worth of sales in a year.

Obviously the mathematics involved in using a regression line for estimation is fairly simple. However, you do need to be careful when using the estimated values and somehow judge their reliability. Obviously part of the reliability of the estimate will be attributable to how well the line fits the data which is a point we will be returning to. If the data points very closely follow the suggested line, the line will be a good fit to the data and the estimates will be fairly good. Another important part of the reliability of the estimate is the value of x that you are trying to estimate from.



Interpolation

If the given x value which you are estimating from is within the range of x values used to fit the regression line, then you can estimate y with a fair degree of confidence. In the estimation above, we were estimating y when x = 6. The regression line was fitted on values of x ranging from 0 to 10 so this is an interpolated estimate so we can be fairly confident with our answer.

Extrapolation

If the given x value is outside the range of values of x used to fit the regression line, then the estimate for y needs to be treated with some caution. There is no evidence to suggest that the regression relationship holds outside the fitted range of x as you have no data there.

Refer to Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 11 for more discussion on the reliability of estimates.

Control

In the above example, we know from the regression relationship that a salesman with 6 years of experience should be making, on average, 55.61 (£000’s) worth of sales in a year. If a salesman with 6 years of experience in the company was only making 20 (£000’s) in a year it would alert you to a potential problem that may be worth investigating.

Similarly, suppose we have a regression relationship of the cost of maintenance of a machine on the age of the machine. We can use the regression line to estimate the maintenance cost of a machine of a given age. If the actual maintenance cost is higher than expected, it indicates that the machine is not functioning as it should. An overhaul of the machine may rectify matters and reduce the maintenance costs.




Explanatory and Response Variables

When dealing with two variables it is important to know which is the response (or dependent) variable and which is the explanatory variable. We have looked at this before but a brief summary follows:

• The explanatory variable (x) is not affected by changes in the other variable but it can be used to help explain changes in the other variable. In the example of experience and annual sales you would expect salesmen with greater experience to make more sales. So the length of experience can be used to explain the sales so experience is the explanatory (x) variable. The response variable is affected by changes in the other variable.

Sometimes it is obvious which variable is the response and which variable is the explanatory. In other situations it is not. Suppose we collect data from individuals which consists of recording their weight and height. In trying to decide between the explanatory variable and the response, does your height explain your weight or does your weight explain your height? It could be either, I suppose. If you’re not sure which variable to make the response (y) and the explanatory (x), you always put the variable to be estimated as y. So in the situation of wanting to use someone’s height to estimate their weight, you would make the weight the response, y.


Coefficient of Determination

The product moment correlation coefficient, r, can be used to evaluate the coefficient of determination, r2. The coefficient of determination specifies how much of the variation in the response variable is explained by variation in the explanatory variable.

Example

Last week we showed that the correlation between length of experience and annual sales of salesmen was

0.96.r =

The coefficient of determination is then

2 20.96 0.9216.r = =

This is usually quoted as a percentage so r2 = 92.16%. This tells us that 92.16% of the variation in annual sales can be explained by variation in years of



experience. This is obviously very high so the fit is good and interpolated estimates will be fairly reliable.

Clearly for a regression to be useful we would want the r2 to be quite large. ENDSECTION STARTSECTION=content_8.htm= SECTION~

Extreme Values

As we have already discussed when we were calculating the mean, extreme values can distort the results of statistical analyses. This is particularly true of regression analyses. Extreme values can have a big effect on a regression line and need to be given careful consideration. Consider the following data set.

x 10 8 9 11 14 6 4 12 7 5

y 7.46 6.77 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

The scatter plot with the fitted regression line indicates that the points follow a very good straight line pattern, the correlation is very high and the resulting regression line would give us excellent estimates:

y = 0.35x + 4.01

r = 0.99

0123456789

10

0 2 4 6 8 10 12 14 16x

y

If we now take the same data set but add in the additional point x =13, y =23.5, (this point is shown in bold in the table below) what happens to the fitted regression line?

x 10 8 9 11 14 6 4 12 7 5 13

y 7.46 6.77 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 23.5



As the scatter graph and regression line show, although there is only point which is different between these two data sets, the correlation in the second data set has dropped considerably, from 0.99 to 0.58, which means estimates will be less reliable. The plot also shows that the fitted regression line does not fit the majority of the data well at all. The one extra point which is an extreme value has pulled the line away from a very good fit to quite a poor fit. So this one extreme point will have quite a large detrimental effect on the estimation of y from this data set.

y = 0.89x + 0.46

r = 0.58

0

5

10

15

20

25

0 2 4 6 8 10 12 14 16x

y




Seminar Questions


A company is interested in the effectiveness of its advertising expenditures. An experiment is conducted to study how the amount spent on advertising affects the sales of a soft drink the company produces. Ten sales areas are included in the experiment. Each area spends its allocated advertising budget (in 10,000’s of £) on a prime time television commercial. The observations are recorded which shows the expenditures and sales for the 10 areas.

Advertising expenditure (x)

2 2 3 4 5 5 6 7 7 8

Sales (y) 8 4 10 7 11 15 19 16 23 20

i) Explain why the sales variable has been designated as the response (y) and the advertising as the explanatory variable (x).

ii) Produce a scatter plot of this data and indicate if regression analysis is suitable for this data (use a scale of 0 to 16 on the x axis).

iii) Calculate the regression line of sales on advertising expenditure. Interpret the coefficients in this regression line.

iv) Calculate the coefficient of determination and interpret this value.

v) Estimate the sales for an area with an advertising budget of 3.5 (10,000’s of £) and estimate the sales for an area with an advertising budget of 15 (10,000’s of £). Comment on the reliability of these estimates.

vi) Additional data becomes available for 4 further areas which is detailed below.

Additional x values 11 12 14 16

Additional y values 22 21 22 20

Add these points to your plot. What does this additional information tell you about the reliability of your estimate above for a sales advertising budget of 15?





A city council is considering increasing the number of police at public events, such as football matches, in an effort to reduce crime. Before making a final decision, the council asks the chief of police to survey other public events of similar size to determine the relationship between the number of police and the number of crimes reported. The chief gathered the following information.

Number of police (in hundreds)

15 17 25 27 17 12 11 22

Number of crimes (in tens)

17 13 5 7 7 21 19 6

i) If we want to estimate crimes based on the number of police, which variable is the response variable and which is the explanatory variable?

ii) Draw a scatter plot and comment on the suitability of using least squares regression analysis for this data.

iii) Calculate the appropriate least squares regression line.

iv) Calculate and interpret the coefficient of determination. ENDSECTION STARTSECTION=activity_4.htm= SECTION~


Consider the following set of data.

x 2 4 1 5 3

y 15 25 10 40 30

i) Produce a scatter plot of this data. Use a scale from -50 to 50 on the y axis.

ii) Does the relationship appear to be quite strong between these two sets of data? Calculate the product moment correlation coefficient to support your conclusion.

iii) A third data item becomes available so that the data set is.

x 2 4 1 5 3 6

y 15 25 10 40 30 -50



iv) Add this additional data point to the plot. What would you expect to happen to the value of the correlation coefficient and the fitted line? What does this show you about extreme values in regression analysis?



If you have understood the content of this unit you should have knowledge of and be able to answer any questions relating to all of the following points;

• the ideas under lying the method of least squares estimation of a regression line;

• calculating a least squares regression line; • interpreting the coefficients of a least squares regression line; • estimating and predicting values of the response variable from a least

squares regression line and understanding the reliability of these estimates;

• calculating and interpreting the coefficient of determination; • understanding the effect of extreme values on least square regression

lines.

Extra Activities






S10 Estimation Page 141


CHAPTER=ESTIM


Estimation Context

This unit will bring together ideas and concepts which you have studied in previous lectures. In particular we will revisit ideas covered in

• Sampling • Means and standard deviations • Normal distribution


Objectives


• use the information in a sample to calculate a point estimate of an unknown population mean;

• understand the concept of a sampling distribution and know that the sample mean, x , follows a normal distribution;

• calculate the standard error of x ; • calculate an interval estimate of an unknown population mean; • discuss how the sample size and level of required confidence affects the

calculated interval estimate. •


Revision of Sampling

Earlier in the module we considered ways of selecting a sub group or sample from a population. Having done so, we only collect statistical information (data) from people, or items, in the sample. We also said that, provided the sample was selected in a representative or unbiased way, any results that are true for the sample will generalise to be a correct population result.

Example

Suppose we want to know the mean income of all single males in the U.K. To answer this question we decide to take a random sample of 20 single males and ask them their income, which produces the following results (in £).



12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000

17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

We can calculate the mean salary for these 20 men using

.600,24£20

000,492£=== ∑

nx

x

This is the exact average salary for these 20 men. However, we want to know the average salary for all single men in the U.K. So we assume that our sample of 20 men was representative of all single men in the U.K. and say the average salary of all single men in the U.K. is also £24,600.

So we take a sample in order to estimate something about the population of interest as a whole. In the above example we asked a sample of 20 single men how much they earned and worked out the sample average, x . This then gives us an idea of the average earnings (population average) of all single males in the U.K. The sample average is likely to be close to the population average, provided the sample is representative, but it is unlikely to be totally precise. (The sample average x is unlikely to be exactly the same as the population average.)

Obviously, if we had asked all the single males in the U.K. their salary and then worked out the average the answer would be precise. However, in the first unit on collecting data we discussed lots of reasons why it is often impractical to ask the whole population. Usually you have to use imprecise or incomplete sample information to find out something relating to the population.


Estimation notation

So our situation is that we have a population (all single males) and we want to know some information about the population. Usually we want information about the mean and the standard deviation for the population. The notation we will use for these two quantities is;

• μ = population mean (mean salary of all U.K. single males) • σ = standard deviation (standard deviation of salary of all U.K. single

males)



We don’t ask all the population so we don’t know the values of μ and σ . We take a representative sample and calculate statistics from the sample information.

• x = sample mean (mean salary of 20 single men in the sample) • s = sample standard deviation (standard deviation of salary of 20 single

men in the sample) • n = sample size.

We use the information in the sample to talk about the population. ENDSECTION STARTSECTION=content_3.htm= SECTION~

Point Estimate of an unknown population mean μ

A point estimate is a single number which is used to estimate the population mean μ . The obvious point estimate of μ is the sample mean x .

So, in the last example we calculated x = £24,600 from the sample of 20 men. What we are really interested in is the mean for all single men, the population mean μ . So we say the value of x (sample mean) is a point estimate of μ (population mean).

An estimate of μ is x , so an estimate of μ is £24,600.

The average salary of all single men in the U.K. is approximately £24,600.

Problem

Someone else takes another sample of 20 single men. The average salary they calculate from their sample is £23,250! This is different to the average we calculated from the sample in the first example.

Both of these samples results are trying to estimate the value of the population mean μ (mean for all single men). Is one of the estimates better than the other? If so, which one is better?

The value we compute for x will vary in a random manner from sample to sample. We are using the sample mean x to estimate the population mean μ . We can’t expect x to estimate μ perfectly as it is based on incomplete information. What we are concerned with in this lecture is the accuracy of the estimate of μ .




Sampling distribution of x

The fact that the value of x changes from sample to sample can actually be used to help us. Imagine if we take lots of samples and calculate the sample mean from each. We are now left with a collection of sample means which we could plot as a frequency distribution. This distribution is called the sampling distribution of the mean. Theoretically it is the set of means of all possible samples of size 20 which could be taken from the population.

Suppose for the example concerning the salary of single men in the U.K. that, instead of taking 1 sample, we actually took 250 samples all of size 20. We could calculate the mean x for each of these samples which would give us 250 separate x calculations. Some of these sample means may turn out to be the same but there would be a degree of variation between the 250 values calculated for x . A frequency distribution summary of these 250 values for x might be.

Mean salary from sample ( x ) in £ Frequency f (number of samples)

14,000 but less than 16,000 3

16,000 but less than 18,000 7

18,000 but less than 20,000 16

20,000 but less than 22,000 30

22,000 but less than 24,000 44

24,000 but less than 26,000 50

26,000 but less than 28,000 44

28,000 but less than 30,000 30

30,000 but less than 32,000 16

32,000 but less than 34,000 7

34,000 but less than 36,000 3

∑ = 250f



A histogram of this distribution is

0

10

20

30

40

50

60

15000 17000 19000 21000 23000 25000 27000 29000 31000 33000 35000

This suggests that the average salary of 20 men in a sample ( x ) might range from £14,000 to £36,000. The true mean of the population ( μ ), that is the true mean salary of all U.K. single males, presumably also lies somewhere in this range.

If we know about the sampling distribution of the mean, particularly the variability in the sampling distribution, we have information about how good a sample mean ( x ) is as an estimate of the population mean ( μ ). So we can use the sampling distribution to tell us something about how accurate x is.


Properties of the sampling distribution of the mean

A sampling distribution of the mean has the following important properties.

• It is very close to being normally distributed. This is true even if the distribution of the population from which the samples are drawn is nowhere near normal. The larger the sample, the more closely will the sampling distribution approximate to a normal distribution.

N.B. Remember, to characterise a normal distribution we need to know its mean and standard deviation.

• The mean of the sampling distribution is the same as the population mean μ .

• The sampling distribution has a standard deviation which is called the standard error of the mean (SE) and is calculated as

SE = nσ .

We don’t know σ as this is the standard deviation of the population. However, we can calculate the standard deviation for the



sample, s, and use this as an estimate of σ so we can calculate the standard error of the mean using the formula

SE = ns . Using this information we can use x to quote an estimate for

μ and use information in the sampling distribution to tell us how good the estimate is by quoting a range of likely values for μ .

Pause for thought

Am I now suggesting that we take lots and lots of samples, calculate the mean and look at the distributions of the resulting set of sample means? Surely this would involve a lot of work, and one of the reasons for taking a sample in the first place was to reduce the amount of work we need to do?. Fortunately, there is a very important statistical result which comes to our rescue and allows us to use the information contained in just one sample to come up with the sampling distribution of the mean. This result is called the Central Limit Theorem and is crucial to statistics.


Central Limit Theorem

Suppose we have a population with mean μ and standard deviation σ . Imagine that we take repeated samples of size n, where n is large, by which we mean that n ≥30.

The central limit theorem tells us that:

• The sampling distribution of the means is a normal distribution. • The mean of the sampling distribution is the population mean μ . • The standard deviation of the sampling distribution is the standard error

of the mean, SE = n

σ .

Putting this all together we can say that the sample mean x follows a normal distribution with mean μ and standard deviation

nσ

2

~ N ,xn

σμ⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠ .



μ

We can use this fact to get extra information from our sample results.

Example on salary continued

The data concerning salary for our sample of 20 single men was

12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000

17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

We already know that the mean for this data is

.600,24£20

000,492£=== ∑

nx

x

We can also calculate the standard deviation using

( )

2341.987697540000

605160000702700000

600,2420

01405400000 222

==

−=

−=−= ∑ xnx

s

So the standard error of the means is n

σ . Using s as an estimate for σ , the

standard error of the means is

9876.2341 2208.3931.20n

σ = =



Using the central limit theorem we know that

( )( )2

2~ N , ~ N , 2208.3931xn

σμ μ⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠ .

Since almost all the normal distribution lies within 3 standard deviations of the mean, the range of this sampling distribution is

( )3931.22083×−μ up to ( )3931.22083×+μ .

Using x as the estimate for μ this gives a range of

( )3931.22083×−x up to ( )3931.22083×+x

i.e. ( )3931.22083600,24 ×− up to ( )3931.22083600,24 ×+

i.e. £17,974.821 up to £31,225.179.

So although x is not a precise estimate of μ , we can be fairly confident that μ is in the range £17,974.821 up to £31,225.179.

£17,974.82 £31,225.18μ




Confidence Interval

One of the ways of expressing the accuracy associated with a particular sample mean is to quote a range of values which μ is likely to be within, or an interval estimate for μ . This is called a confidence interval.

A confidence interval gives us a range of values which we say will contain the true value of μ with a certain degree of confidence. So a confidence interval is a range of values which is likely to contain the true value of the population mean μ .

From our knowledge of a normal distribution together with the information that the sample means are normally distributed we can calculate a confidence interval for μ using the formula

⎟⎟⎠

⎞⎜⎜⎝

⎛×±

nzx σ

where the value of z is found from standard normal tables to satisfy a required level of confidence.

Salary example continued

Suppose we want a 95% confidence interval for the salary of single males in the U.K. From our sample we know that x = 24,600 and s = 9876.2341.

To calculate the 95% confidence interval we need to know what value of z to use.

We chose the value of z so that the confidence interval is centred in the middle 95% of the normal curve as shown below:

95%100% – 95% = 5%5% / 2 = 2.5% = 0.025 2.5% = 0.025

0

We want the value of z so that the area in each of the two tails is 0.025 (2.5%).



Using the left hand table of the normal tables on page 207 of the handbook we need to look for 0.025 in the main body of the table and then work out what z value gives us a lower tail area of 0.025. The value 0.025 is in the middle chunk of the table and corresponds to a z value of –1.96:

2.5% = 0.025

0-1.96

Using this value the confidence interval is then

( )

9876.234124,600 1.9620

24,600 1.96 2208.393124,600 4328.450420,271.55 up to 28,928.45.

x znσ⎛ ⎞ ⎛ ⎞± × = ± ×⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠= ± ×

= ±=

So we are 95% confident that the true value of the mean salary of single males in the U.K. is between £20,271.55 and £28,928.45.

Now go and do Exercise S10.1 ENDSECTION STARTSECTION=activity_1.htm= SECTION~

Exercise S10.1

What would a 99% confidence interval for the salary of single males in the U.K. be?



Why don't we use a 100% confidence interval?

It is not possible to quote an interval which will contain the true value of the population mean μ with 100% accuracy as the interval is based on incomplete sample information. However, we quote a confidence interval which contains the true value of μ with high probability. Most usually we use a 95% confidence interval. This means that if we were to take 100 samples, 95 of them would produce values for x and s which would result in confidence intervals that did contain the true value for μ . However, 5 of them would result in a confidence interval which actually didn’t contain the true value of μ . Therefore, when we quote a confidence interval, there is always a small chance that it does not contain the true value of μ .


Final Comment

When calculating the confidence interval we used the formula

22

xnx

s −= ∑ which is equivalent to ( )

nxx

s ∑ −=

2

to calculate the standard deviation for the sample.

If you refer back to the unit on Numerical Summaries of Data 2 I indicated that some people use a different definition for the standard deviation. Instead of dividing by n they divide by n – 1 to give

( )1

2

−

−= ∑

nxx

s.

When calculating confidence intervals you should use the formula for the standard deviation where you divide by n – 1. This then gives a better estimate of the unknown population standard deviation σ . However, for this module, if you are asked to calculate a standard deviation for a sample you can use the method covered in the Numerical Summaries of Data 2 unit.




Seminar Questions


A random sample of 100 records of a mail order company for March of this year revealed that the values of individual orders had a mean ( x ) of £65 and a standard deviation (s) of £4.

i) What is the standard error of the mean?

ii) Calculate a 95% confidence interval of the mean value of orders that the company received in March.

iii) Suppose you wanted a 90% confidence interval, what would the value of z be in the formula for the confidence interval?

iv) Calculate the 90% confidence interval of the mean value of orders that the company received in March.

v) In February, the mean value of orders was £68. Does there seem to have been a change from February to March? Explain your answer.



Fast food service companies try to devise wage plans that provide incentive and produce salaries for their managers that are competitive with corresponding positions in competing companies. A random sample of 12 unit managers for one company shows that they earn an average salary of £36,750 with a standard deviation of £3,100.

i) Calculate a 95% confidence interval for the mean salary of the company’s managers.

ii) Do the data suggest that the mean salary earned by the company’s units managers differs from £38,500 which is the mean salary paid by a competitor firm? Explain your answer.





A random sample of 80 observations had x =14.1 and s=2.6.

i) Find a 95% confidence interval for μ .

ii) Find a 99% confidence interval for μ .

iii) Which of these 2 confidence intervals is wider? What does this show about how the level of required accuracy (or confidence) affects the resulting confidence interval?

iv) Suppose the sample contained only 32 observations but x =14.1 and s =2.6 were the same. Recalculate the 95% confidence interval with the smaller sample size.

v) What does the calculation in part (iv) show you about how the size of a sample affects the accuracy of the estimates?



If you have understood the content of this unit you should have knowledge of, and be able answer any questions relating to, the following points:

• you should know that sample results are used to estimate unknown population quantities: typically the sample mean ( x ) is used to estimate the population mean ( μ );

• you should be able to discuss why sample estimates are not precise estimates and therefore why we need a measure of how accurate the estimates are;

• you should understand what is meant by a sampling distribution; • you should know, that because of the Central Limit Theorem, we know

what the sampling distribution of the mean x is Normal with mean μ and

standard deviation n

σ ;

• you should be able to calculate and interpret a confidence interval for an unknown population mean μ .



Extra Activities







MATHEMATICS SECTION

Original author: Alison Megeney Revisions by: Thomas Bending

Alison Megeney

M1 Financial Mathematics 1 Page 157


CHAPTER=FM1


Financial Mathematics 1 Context

This unit introduces financial mathematics. In it we consider how investments and commodities gain or lose value over time. We will examine different forms of interest and investigate how these affect the worth of an investment over time. This will enable us to determine what return we can expect on an investment, or see how a commodity loses value. In the next unit we will build on these ideas to enable us to determine whether or not the returns of an investment plan make investing in it worthwhile. We come across examples of financial mathematics everyday. Whether it is the interest we gain on our savings, credit card or student loan, or the depreciation or loss in value of our car. The financial mathematics you will consider in this unit will help you to interpret financial information from a variety of sources that you encounter in everyday life, and make decisions based on your evaluation of the financial small print. So if you are wondering if your credit card is offering you the most competitive interest rate or to see if you are better off moving your savings to another bank or building society, then this unit can inform your decision. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives


• understand the concept of simple and compound interest and be able to perform calculations;

• be aware of the differences between nominal and effective rates of interest and be able to calculate the effective rate (APR);

• understand the concept of straight line depreciation and be able to perform calculations.




Interest and Depreciation

What is interest? It is the amount earned on an investment, or the amount paid on a loan.

Credit card example:

The interest is the amount of extra money added to your credit card bill every month.

What is depreciation?

This is a loss of value of an investment over time.

Value of a car example:

Your car loses value over time. Different cars lose different proportions of their current value each year. Try looking on the web to see how a car of your choice loses value over time.


Useful Definitions

P = Principal amount.

This is the amount initially considered, i.e. the amount of money invested.

For example this is the amount you open your bank account with.

A = Accrued Amount.

This is the updated value of the principal amount after some fixed time.

e.g. This is your current bank balance. So the original amount plus the extra money earned.

i = Interest rate. The proportionate amount of money to be added to the principal amount.

e.g. This is usually a fixed percentage, say 5%=0.05.

n = The number of time periods.

This is the length of time (years, months, days) over which the money has been borrowed or invested.

e.g. This is the total number of times the interest is added on to you account every year.


Simple Interest Investment

The interest amount of interest added to your investment is calculated based on you principal amount, and not on the accrued amount (current value of your investment).



Simple Interest example

Suppose you have £200 which you invest at a simple interest rate of 10% per year. How much will have accrued after 3 years? So the amount accrued after 3 years is £260.

Year Amount on which interest is calculated

Interest Amount Accrued

1 £200 10% of £200 =

0.1 × 200 = £20

£200 + £20 =£220

2 £200 10% of £200 =

0.1 × 200 = £20

£220 + £20 =£240

3 £200 10% of £200 =

0.1 × 200 = £20

£240 + £20 =£260

This can be written in notation as A3 = £260. ENDSECTION STARTSECTION=content_4.htm= SECTION~

Simple Interest Formula

To calculate the accrued amount after n time periods can be a long process so it is advised that you should be able to use the formula. An = P×[ 1+ (i×n)] This formula will be provided in the exam and does not need to be memorised.

Simple interest formula example

Suppose we have £200 invested at a simple interest rate of 10% per annum. How much will have accrued after 10 years? We will use, An = P×[ 1+ (i×n)] So P = 200, i =10% = 0.1, and n =10. This gives A10 = 200×[ 1+ (0.1×10)] =200×[1+1]= 200 × 2 = 400

Not sure about how to use brackets?

See Improve your maths by Bancroft and Fletcher for helpful tips. Now go and do Exercise M1.1


Exercise M1.1 - Simple Interest Activity



Suppose you have £500 which you invest at a simple interest rate of 25% per year.

How much will have accrued after 4 years?



1

£ 25% of £ =

2

£

3

£

4

£

What is the amount accrued after 4 years?

Write the amount accrued after 4 years in notation

How much has accrued after 10 years?




Compound interest investment

The interest amount of interest added to your investment is calculated based on your accrued amount (current value of your investment), and not on the principal amount.

Compound Interest example

Suppose you have £200 which you invest at a compound interest rate of 10% per year. How much will have accrued after 3 years?



1 £200 10% of £200 =

0.1 × 200 = £ 20

£200 + £20 =£220

2 £220 10% of £220 =

0.1 × 220 = £ 22

£220 + £20 =£242

3 £242 10% of £242 =

0.1 × 242 = £ 24.20

£242 + £20 =£266.20

So the amount accrued after 3 years is £266.20. This can be written in notation as, A3 = £266.20.




Compound interest formula

It is advised that you use the formula. An = P×[ 1+ i]n This formula will be provided in the exam and does not need to be memorised.

Compound interest formula example

Suppose we have £200 invested at a compound interest rate of 10% per annum. How much will have accrued after 10 years? We will use An = P×[ 1+ i]n So P = 200, i = 10% = 0.1, and n =10. This gives A10 = 200 × [ 1+ 0.1]10 = 200 × [1.1]10 = 200 × 2.5937425 = 518.75 2dp

Note:

On your calculator you should have one of the following function buttons, xy ,yx , ab , ba, ^ To evaluate 56 follow these steps on your calculator , • press 5 • press the xy

button • press 6 and then =. This produces an answer of 15625.

Need more help with powers?

Need more help with powers (exponents)?

See Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 1 for further examples.

Now go and do Exercise M1.2


Exercise M1.2 - Calculator Activity

Compute the following;

46 =

(1.2)3 =

(1 + 0.1)4 = Now go and do Exercise M1.3




Exercise M1.3 - Compound Interest Activity

Suppose you have £500 which you invest at a compound interest rate of 25% per year.




1

£

2

£

3

£

4

£

What is the amount accrued after 4 years?

Write this in notation.

How much has accrued after 10 years? ENDSECTION STARTSECTION=content_7.htm= SECTION~

Nominal and Effective Interest Rates

Nominal rate

Rates of interest are often expressed or quoted as figures per year (per annum) even though they may be compounded ( or act) over time periods less than a year. The given annual rate is known as the Nominal Annual rate. Nominal rate examples

Credit cards A store card advertisement states that it has a nominal annual rate of 24% that acts (compounded) monthly. So has a compound monthly interest rate of 2% per month.

Is 24% the true percentage rate that acts?



Finance package

Finance company pays for your goods and then charges you a nominal annual rate of interest on the amount you have borrowed.


Nominal and Effective Interest Rates

Effective Rate (or Actual Percentage Rate)

This is the rate of interest that effects (acts) on your investment or the amount that you have borrowed. This interest rate is always greater than the quoted nominal rate.

APR example

Suppose you buy a cooker which costs £200.The manager offers you a finance package which allows you to pay in one years time, but the cost of the cooker will be increased by an annual nominal interest rate of 20% compounded Quarterly. How much will you be paying for the cooker in 1 year? The interest rate which acts on the price of the cooker is 5% per quarter, (20% annual ÷ 4 quarters = 5% per quarter).



Number of times interest has been added

Amount on which interest is calculated

Interest Amount accrued

1 £200 5% of £200 =

0.05 × 200 = £ 10

£200 + £10 = £210

2 £210 5% of £210 =

0.05 × 210 = £10.50

£210 + £10.50 = £220.50

3 £220.5 5% of £220.50 =

0.05 × 220.50 =

£11.025

£220.5 + £11.025

=£231.525

4 £231.525 5% of £231.525 =

0.05 × 231.525 =

£11.57625

£231.525 +11.57625

= 243.10125

= £243.10 2dp

So we will pay an extra £43.10 for the cooker if we accept the finance package offered. If we express this increase as a percentage we will obtain the effective interest rate or Actual Percentage Rate. So,

Effective Rate ( or APR ) = %55.21%100200

10.43=×⎟

⎠⎞

⎜⎝⎛ .


Annual Percentage Rate interest formula

It is advised that you use the formula,

11

1rate annual nominal1...

−⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+=

−⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+=

n

n

ni

nRPA

where n is the number of compoundings in one year.



APR Formula example

Suppose a credit card charges a nominal annual rate of 24% , 0.24, which is compounded (acts) monthly, n = 12. What is the Actual Percentage Rate?

( )

2dp. to%82.262682418.0

12682418.1102.1

102.01112

0.241=

1rate annual nominal1...

12

1212

==

−=−=

−+=−⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+

−⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+=

n

nRPA



Exercise M1.4 - APR Activity 1

Suppose a credit card charges a nominal annual rate of 24%

What is the Actual Percentage Rate if the interest is compounded in the following ways?

Weekly

Daily

Hourly Now go and do Exercise M1.5


Exercise M1.5 - APR Activity 2

Suppose a finance package has a nominal annual rate of 10% which is compounded (acts) quarterly. What is the Actual Percentage Rate?

1n

rate annual nominal1... −⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+=

nRPA =




Depreciation

This is the loss in value of an item over time.


Straight Line Depreciation

This is the opposite of simple interest. Once the proportion of the initial value has been calculated, it is subtracted from the total each year ( or month or day). A fixed amount of money is deducted at each time point.

Straight Line Depreciation example

Suppose you have £200 which is depreciating at an interest rate of 10% per year. What value will be retained after 3 years using straight line depreciation?

Year Amount on which deduction is calculated

Deduction Amount accrued

1 £200 10% of £200 = 0.1 ×200 = £ 20

£200 – £20 = £180

2 £200 10% of £200 = 0.1 × 200 = £ 20

£180 – £20 = £160

3 £200 10% of £200 = 0.1 × 200 = £ 20

£160 – £20 = £140

So the amount retained after 3 years £140. This can be written in notation as D3 = £140




Straight Line Depreciation Formula

To calculate the amount retained after n time periods can be a long so it is advised that you should be able to use the formula. D n = P × [ 1– (i × n)] This formula will be provided in the exam and does not need to be memorised.

Straight Line Depreciation formula example

Suppose we have £200 (straight line) depreciating at an interest rate of 10% per annum. How much will have accrued after 10 years? We will use the formula, D n = P × [ 1– (i × n)] and let P = 200, i = 10%=0.1, and n = 10. This gives D 10 = 200 × [ 1– (0.1× 10)] = 200×[1–1] = 200×0 = 0 Now go and do Exercise M1.6


Exercise M1.6 - Straight Line Depreciation Activity

Suppose you have a watch worth £500 whose value (straight line) depreciates at a rate of 15% per year.

What is the watch’s retained value after 4 years?


Deduction Amount retained

1

£

2

£

3

£

4

£

Write the retained value after 4 years in notation

How much of the watch’s value is retained after 7 years?


Reduced Balance Depreciation (RBD)

This is the opposite of compound interest



RBD example

Suppose that a cars value is depreciating by 15% per annum. If it initially cost £1000, what is its value after 3 years?



1 £1000 15% of £1000 = 0.15 × 1000 =£150

£1000 – £150 = £850

2 £850 15% of £850 = 0.15 × 850 =£127.50

£850 – £127.5 = £722.5

3 £722.5 10% of £722.5 = 0.15 × 722 =£108.375

£722.5 – £108.375 = £614.125

So the depreciated value of the car after 3 years is £ 614.125. ENDSECTION STARTSECTION=content_14.htm= SECTION~

Reduced Balance Depreciation Formula

It is advised that you use the formula. Dn = P×[ 1– i]n Where Dn is the depreciated value of the item after n time periods. This formula will be provided in the exam and does not need to be memorised.



RBD Formula example

If £1000 depreciates at a rate of 15% per annum, then after 3 years we have a retained value of D3 = 1000×[ 1– 0.15]3 = 1000×[ 0.85]3 = 1000×0.614125 = £ 614.13 to 2 decimal places. Now go and do Exercise M1.7


Exercise M1.7 - RBD Activity

Suppose that an investment’s value is depreciating by 20% per annum.

If its initial value is £2500, what is its value after 3 years?



1

£2500 20% of £2500 =

2

3

What is the depreciated value of the investment after 10 years?




Seminar Questions

Seminar Question M1.1 - Simple Interest

Suppose you have £500 which you invest at a simple interest rate of 20% per year.




1

£ 20% of £ =

2

£

3

£

So the amount accrued after 4 years is £ . Write this in notation.

Use the simple interest formula to calculate how much has accrued after 10 years.




Seminar Question M1.2 - Compound Interest Question

Suppose you have £500 which you invest at a compound interest rate of 20% per year.




1

£ 20% of £ =

2

£

3

£

So the amount accrued after 3 years is £ .

Write this in notation.





Seminar Question M1.3 - APR Question

Suppose a credit card charges a nominal annual rate of 20%

What is the Actual Percentage Rate if the interest is compounded in the following ways?

Quarterly n = 4

1rate annual nominal1... −⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+=

n

nRPA

Monthly n = 12

1rate annual nominal1... −⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦⎤

⎢⎣⎡+=

n

nRPA

Daily Hourly Comment on your results. ENDSECTION STARTSECTION=activity_11.htm= SECTION~

Seminar Question M1.4 - Reduced Balance Depreciation Question

Suppose that an investment’s value is depreciating (by reduced balance) by 10% per annum.






1

£ 10% of £500 =

2

3

What is the depreciated value of the investment after 10 years?


Seminar Question M1.5 - Straight Line Depreciation Question

Suppose that an investment’s value is depreciating (straight line) by 10% per annum.




1

£ 10% of £500 =

2

3

What is the depreciated value of the investment after 10 years? ENDSECTION STARTSECTION=think.htm= SECTION~





• examples of where commodities either gain interest or depreciate in value;

• the differences between compound and simple interest;

• how to calculate accrued amounts using compound and simple interest environments;

• the differences between nominal and effective rates of interest; • how to calculate an effective rate (or Actual percentage rate APR)

form a given nominal annual rate; • the differences between straight line and reduced balance

depreciation; • how to calculate the depreciated value of an item using straight

line or reduced balance depreciation.

Extra Activities

Log on to the STX1110 Oasis page and attempt the quizzes in the extra section. A guide on how to find the STX1110 Oasis quizzes is on page 16 of this book.






CHAPTER=FM2


Financial Maths 2 Context

In the last unit we examined how investments and commodities gained or lost value over time, using simple and compound interest environments and reduced balance and straight line depreciation. In this unit we will introduce a variety of techniques to enable us to determine whether or not the return of an investment plan make investing in it worthwhile. The techniques introduced in this unit will enable us to evaluate how much money to invest now to get an expected financial return in the future. Perhaps we need this return to cover the cost of a planned luxury holiday in a few years time, or to cover the instalments of a finance agreement. In addition these techniques will enable us to make to choose between investment plans based on their projected returns. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives

• Having worked through this unit, you should: • understand the concept of present values and be able to perform

calculations for single value as and for a set of instalments; • understand the concept of net present values and be able

to perform calculations; • be able to determine whether or not an investment is worthwhile; • be able to compare two or more capital investments and determine

which, if any, are worthwhile.




Present Values and Investment Appraisals

Suppose we have invested £100 at a compounded rate of 10% per annum. In one year you will have, £100(1+0.1) = £110.

We could pose a different question and ask how much we need to invest to get a return of £110 in a year’s time. From the calculation above we see that if we need £110 in a year’s time then we must have £100 to invest now so that in a year’s time with the accrued interest our investment will have increased to £110. So to have a return of £110 in one year with an interest rate of 10% you need to invest £100 now. This present value tells us how much this future return is worth to us now in the present. So present value, P, of a return of £110 in 1 year at an interest rate of 10% is

1.01110£100£ +

==P

Definitions

P = Present value.

This is the amount of money needed to invest now to receive a return of £A in n time periods. This is our initial investment.

A = Amount payable in n time periods.

This is the return on our investment. So our initial investment plus our the interest earned on our investment.

i = Investment rate (discount rate).

This is the interest rate that effect our investment. This is usually quoted as a percentage, say 10%, but should always be expressed as a decimal when used in calculations.

n = The number of time periods.

This is the length of time that the interest has to act on our investment. So it is the number of times that interest is added to our initial investment.




Present Value Formula

The present value of £A at an interest rate i for n time periods is:

ni)(AP+

=1

This is equivalent to, 1

(1 )nP A A Discount factori

= × = ×+

This version of the formula is often used by accountants. They use tables of calculated discount factors for particular values of i. Present Value example Suppose we need a return of £1200 in 4 years time to pay for a planned holiday. If you have a savings account with a compound interest rate 8% per annum how much money would you need to invest now so that over the 4 year period it would grow to £1200, and cover the cost of the holiday.

What is the present value of a return of £1200 invested at 8% per annum for 4 years? Let us use the present value formula.

(1 )n

APi

=+

where, A = Amount payable in n time periods = £1200 i = Investment rate = 8% = 0.08 n = The number of time periods = 4



Substituting in our values we obtain,

4

4

1200£ £(1 ) (1 0.08)

1200£ 1200 1.360489(1.08)

£882.04

n

APi

= =+ +

= = ÷

= So this tells us you need to invest £882.04 now at 8% per annum to obtain a return of £1200 on your initial investment in 4 years time. Now go and do Exercise M2.1


Exercise M2.1 - Present Value Activity Calculate the present value of a return of £2500, after being invested at 15% per annum for 5 years. Use the present value formula below.

(1 )n

APi

=+

where, A = i = n =

so, (1 )n

APi

=+

So this tells us you need to invest £ now at 15% per annum to obtain a return of £2500 in 5 years time.


The Present Value of a Set of Instalments

You are interested in buying a home cinema system. It costs a total of £1600. The salesperson offers you the opportunity to spread the cost of the purchase over 3 years. They offer you a deal where you pay £1000 now, followed by 3 subsequent annual payments of £200 at the end of each year.



If you have a savings account with an interest rate of 8%, how much is the stereo worth to you now?

How much money do you need to invest now to be able to cover the cost of the instalments as they become due?

What is the present value of goods?

Present Value of a set of instalments

Time the is Payment Due

Payment Due

Number of time periods (n years)

Present value

Now £1000 n = 0 £1000

End of first year

£ 200 n = 1

£ 1)08.01(

200+ = £185.19

End of second year

£ 200 n = 2

£ 2)08.01(

200+ = £171.47

End of third year

£ 200 n = 3

£ 3)08.01(

200+ = £158.77

Present value of goods = £1000 + £185.19 + £171.47 + £158.77 = £1515.43 Amount paid = £1600 So £1515.43 invested now would pay off the instalments as they became due. Now go and do Exercise M2.2


Exercise M2.2 - Present Value of a Set of Instalments Activity

Suppose that a stereo costs £1800 over a 3 year period. The payment consists of an initial down payment of £1500 and then a single payment of £100 every year for the next 3 years.

If you have an interest rate of 10% what is the present value of goods?

Time Payment Due

Payment Due

Number of time periods (n years)

Present value



Now £1500 n = 0 £1500

End of first year £ 100 n = 1

£ 1)10.01(

100+ = £

End of second year £ 100 n = 2

£ 2)10.01(

100+ = £

End of third year £ 100 n = 3 £

Total =

Present value of goods = Amount paid = £1800. So £ invested now would pay off the instalments as they became due. ENDSECTION STARTSECTION=content_4.htm= SECTION~



Capital Investments

This is a project consisting of an initial cash outlay, and estimated inflows and outflows of cash for the life of the project. Discounted cash flow (DCF) techniques can be used to evaluate capital expenditure projects.

Comparing Investments

Two DCF methods are:

1. Net Present Value method (NPV) 2. Internal Rate of Return (IRR)

See Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 3 for details and examples of IRR. ENDSECTION STARTSECTION=content_5.htm= SECTION~

Net Present Value (NPV)

To evaluate a project calculate all the present values associated with it.

Net Present Value example

Suppose we have the following cash flow with an associated project , we calculate the net cash flow , then with our discount rate we calculate their corresponding present values and then add them together to obtain the NPV. Suppose that the interest rate is 18.5%.

Year Cash Inflow Cash Outflow

0 £ 0 £12000

1 £ 8000 £ 8500

2 £12000 £ 3000

3 £10000 £ 1500

4 £ 6500 £ 1500

Year Cash Inflow Cash Outflow Net Cash Flow = Inflow - Outflow

0 £ 0 £12000 £-12000



1 £ 8000 £ 8500 £ -500

2 £12000 £ 3000 £ 9000

3 £10000 £ 1500 £ 8500

4 £ 6500 £ 1500 £ 5000

Year Net Cash Flow Discount factor Present Value = NCF × Df

0 -£ 12000 11 0185

10( . )+=

-£ 12000

1 -£ 500 11 0185

084391( . ).

+=

-£ 500 × 0.8439 = -£ 421.95

2 £ 9000 11 0185

071212( . ).

+=

£ 9000 × 0.7121 = £ 6408.90

3 £ 8500 11 0185

060103( . ).

+=

£ 8500 × 0.6010 = £ 5108.50

4 £ 5000 11 0185

050714( . ).

+=

£ 5000 × 0.5071 = £ 2535.50

Total =

So the Net Present Value is -£ 12000 - £ 421.95 + £ 6408.90 + £ 5108.50 + £ 2535.50 = £ 1630.95 We say that the project makes a profit if NPV > 0. We say that the project breaks even if NPV = 0. We say that the project makes a loss if NPV < 0. ENDSECTION STARTSECTION=content_6.htm= SECTION~

Internal Rate of Return (IRR)

This method of DCF determines the rate of interest, i, at which the NPV=0.

See Essential Quantitative Methods for Business, Management and Finance, third edition, Les Oakshot chapter 3 for further details and examples.

Comparison of projects



Projects can be compared using NPV and IRR techniques. Usually the situation is more complicated than just comparing the two methods. So perhaps one project may involve more initial outlay, this should be taken into consideration.

Net Present Value example

Suppose there are two possible projects A and B. Both projects generate a total of £10000 over 4 years and both cost an initial payment of £6000, which is a single cash outflow. Their estimated inflows are detailed below.

Year Project A Project B

1 £4000 £1000

2 £3000 £2000

3 £2000 £3000

4 £1000 £4000

If the money has no time value the projects are identical.

If the interest rate is 10 % (discount rate is 10 %) which project is best?



Evaluation of Project A using the Net Present Value method Year Cash Inflow £ Cash Outflow

£ Net cash flow

£ Discount factor

( 4dp) Present Value £

0 0 6000 – 6000 1 – 6000

1 4000 0 4000 1 0 . 9 0 9 11 . 1

=

3636.40

2 3000 0 3000 2

1 0 .8 2 6 41 .1

=

2479.2

3 2000 0 2000 3

1 0 .7 5 1 31 .1

=

1502.6

4 1000 0 1000 4

1 0 .6 8 3 01 .1

=

683

Net present value of project A = Total of the present value column = = £ (– 6000 + 3636.4 + 2479.2 + 1502.6 + 683) = £ 2301.25 Project A’s NPV is positive and hence makes a profit. Evaluation of Project B using the Net Present Value method

Year Cash Inflow £

Cash Outflow £

Net cash flow £

Discount factor Present Value £

0 0 6000 - 6000 1 – 6000

1 1000 0 1000 1 0 . 9 0 9 11 . 1

=

909.09

2 2000 0 2000 2

1 0 .8 2 6 41 .1

=

1652.89

3 3000 0 3000 3

1 0 .7 5 1 31 .1

=

2253.94

4 4000 0 4000 4

1 0 .6 8 3 01 .1

=

2732.05

Net present value of project B = £(–6000 + 909.09 + 1652.89+ 2253.94 + 2732.05) = £1547.97 Project B’s NPV is positive and hence makes a profit.

Which project makes the biggest profit? Which would you invest in?




Seminar Questions

Seminar Question M2.1 - Present Value Question

Calculate the present values of the following:-

a) A return of £1000 payable in 5 years time with an interest rate of 10%.

b) A return of £5000 payable in 18 months time if the Nominal annual rate is 24% and it compounded monthly.

c) An initial payment of £500 now, and £100 payable at the end of each year for the next three years. Suppose that you have an interest rate of 10%, what is the present value of goods.

Amount Number of time periods ( n years)

Time Payment Due Present value

£500 n = 0 Now £

£100 n = 1 End of first year £

£100 n = 2 End of second year £

£100 n = 3 End of third year £

Total =

Present value of goods =

Amount paid =

So £ invested now would pay off the instalments as they became due.




Seminar Question M2.2 - Net Present Value Question

Suppose we have the following cash flow with an associated project, A, and that the discount rate is 15.5%. Calculate the NPV of project A.

Year Cash Inflow Cash Outflow Net Cash Flow = Inflow - Outflow

0 £ 0 £10000 £

1 £ 8000 £ 1000 £

2 £ 8000 £ 1000 £

3 £ 9000 £ 2000 £

4 £ 6500 £ 3000 £

Year Net Cash Flow Discount factor

ni)1(1+

Present Value = NCF × Df

0

1 £

1

£

2

£

3

£

4

£

Net Present Value of Project A =

If another project has an NPV of £2437 which of the projects would you prefer?





If you have understood the content of this unit you should have knowledge of, and be able to answer questions relating to, all of the following points:

• how to calculate a present value of a single return or for a set of instalments;

• the concept of a present value or worth of an investment; • what makes up a capital investment; • methods of comparing capital investments; • how to evaluate the Net Present Value (NPV) of an investment

plan; • the relationship between the NPV of an investment and profit and

loss.

Extra Activities






M3 Index Numbers Page 191


CHAPTER=INDEXNO


Index Numbers Context

In this unit we will introduce index numbers as a method of standardising economic commodities in order to make sensible comparisons. The commodities we wish to compare may be measured in different ways and units. For example production figures may be measured in different ways for different types of industries. Hence we must standardise values in order to make comparisons between them. The methods we encounter could also help us identify statistical trends. A company may want to investigate the growth in production, or the loss in value of its stock. It would be possible to identify the rate of growth in production per month using a series of index numbers calculated over a period of time. You may have heard the term index before, and know of examples of indices. For example, the FTSE 100 index, retail price index, the output of production index, and index linked investments. In the next unit we will build the ideas and methods in this unit to enable us to compare economic commodities made up of components.


Objectives

Having worked through this unit you should be able to;

• investigate how an index number reflects a change in prices or quantity;

• compute index numbers; • interpret index numbers; • determine which index is most suitable in a given situation and

justify your choice; • discuss their limitations.




Index Numbers

What are index numbers?

What do index numbers enable us to do? Index numbers are a method of standardising commodities, so for example prices of products or wages so that they can be compared over time. The Index number (Index relative) measures the percentage change in some economic commodity over time.

What is an Economic commodity?

Prices Wages Production figures

Types of index number

There are two main types of index, price and quantity. A price index measures changes in the cost of a commodity from one time period to another. An example is the Retail Price Index, which measures change in the cost of items of expenditure of the average household. A quantity index measures changes in the quantities of a commodity from one time period to another. An example is the productivity index, which measures the change in productivity of a group of workers.

Representation of index numbers

Index numbers are expressed in terms of a base 100 like percentages. An index number of value 100 represents the original or base value. An index number above 100 represents an increase, and an index number below represents a decrease. Increase of 50% produces an INDEX NUMBER of 150. Decrease of 10% produces an INDEX NUMBER of 90.




ENDSECTION STARTSECTION=activity_1.htm= SECTION~ Exercise M3.1

What index number would represent the following?

An increase of 60%. A decrease of 25%.



What percentage change is represented by the following index numbers:

Index number = 115.

Index number = 83. ENDSECTION STARTSECTION=content_2.htm= SECTION~

Construction of an Index number

Petrol prices example

Petrol costs 49p in Oct 1995 and, 52p per litre in Dec 1995. • Percentage Price Increase = • Index number =



Index number formula

100 Relative)Index (Number Index ×=o

n

VV

. This formula will be provided in the exam and does not need to be memorised.

Petrol prices example

Returning to our petrol price example, we are given that petrol costs 49p per litre in Oct 1995, and 52p per litre in Dec 1995. The base time is usually the earliest time we collected or starting point of, our data values. The base time is Oct 1995, and so V0 = 49p. The other time is Dec 1995, and so Vn = 52p.

125.1061004952100 100) (Oct Dec =×=×==

o

n

VVI

Notation This is expressed in the following way IDec(Oct = 100) = 106.125 ↑ ↑ ↑

Other time Base time Index number representing the change in prices in Dec relative to prices in Oct.

IDEC/OCT = 106.125 ↑ ↑ Other time / Base time Index number representing the change in prices

in Dec relative to prices in Oct. .





Exercise M3.3 - Quantity Index Activity

The maths group employed 35 lecturers in April 92 and 32 in April 95.

What is the base time, and what value does it take?

What is the current time, and what value does it take?

Calculate the quantity index that represents this change using the

index number formula, 100×o

n

VV

.


Time Series of relatives

How do we investigate the growth of economic commodities over time?

How do the values of an index number (relative) change over time?

Two possible methods are 1. Fixed base relatives 2. Chain base relatives ENDSECTION STARTSECTION=content_4.htm= SECTION~

Fixed Base Index Numbers

Each relative is calculated on the same fixed time point (the same base). This is a suitable method for commodities whose nature remains unchanged over time.



Example

The production figures of a bottle factory for a five month period are given below. Jan Feb Mar Apr May

Production 4563 4254 4841 4644 5290

0

1000

2000

3000

4000

5000

0 1 2 3 4 5 6

If we plot a graph of the production figures against time we see the production figures generally increase. Let us investigate how the productivity of the factory changes over time by comparing the monthly production figures to a fixed point in time, say March. So March will be the base time. Using this method we compare January’s production to March’s production, and then we compare February’s production to March’s production, and so on. We do this using the index number formula, calculating the production index for our series of data values. As the base time is March we know that the value of V0 is the production figure for March, so V0 = 4841. This value will remain constant throughout our calculation. The value of Vn is the production figure of the month we are comparing to March. Hence,

1004841

figure productionMonthly

100 March) base (Number Index

×=

×=o

n

VV



Using this we obtain the following.

Jan Feb Mar Apr May

Production 4563 4254 4841 4644 5290

Fixed base relative

=×10048414563

=×100

48414254

=×10048414841

=×10048414644

=×100

48415290



Exercise M3.4

Calculate the series of fixed base index numbers that represent the following data:

Jan Feb Mar Apr May

Production 4431 4542 4650 4781 4892

Fixed base relative (April)


Changing the base of fixed base relatives

Suppose you have an index number relative to some base (Old base), and you wish to change it to a different base (New base). You could recalculate, or you could use the formula. This rescales our index numbers so they are relative to our new base.

base) (Oldnumber Index valuebase New valuebase Old Base) New number(Index ×=



Example

Let’s change the base to January.

March) (Basenumber Index 45634841 Jan) Base number(Index ×=

= 1.06 × Index number ( Base March )

Jan Feb Mar Apr May

Production 4563 4254 4841 4644 5290

Fixed base index March (old base)

94.3 87.9 100 95.9 109.3

Fixed base index Jan (new base)

1.06×94.3 = 100

1.06×87.9 = 93.2

1.06×100 = 106

1.06×95.9 =101.6

1.06×109.3 = 115.9



Exercise M3.5

Change the base to January.

Index number ( Base Jan ) = ×baseNew

base OldIndex number ( Base

April )

Jan Feb Mar Apr May

Production 4431 4542 4650 4781 4892

Fixed base relative (April)

Fixed base relative (Jan)




Chain base relative

Each relative is calculated with respect to the immediately preceding time point. We use chain base index numbers when the nature of the commodity is rapidly changing.

Example

The sales figures for a mobile phone manufacturer are given below. Jan Feb Mar Apr May June

Sales 2150 2660 3324 4156 5160 6300

01000

20003000

40005000

60007000

0 2 4 6

If we plot a graph of the production figures against time we see the production figures sharply increase. June’s production figures are nearly three times that of January. The growth in production figures is rapidly increasing. In this case it is more sensible to investigate the rate of growth in production comparing this months production figures to those of the immediately preceding time point.

Jan Feb Mar Apr May June

Sales 2150 2660 3324 4156 5160 6300

Chain base relative

100

21502660

×

10026603324

×

10033244156

×

10041565160

×

10051606300

×





Exercise M3.6

Calculate the set of chain base index numbers for the following data.

Jan Feb Mar Apr May June

Sales 2050 2560 3024 3998 4986 6000

Chain base relative


Comparing Index numbers (fixed base)

The Real Value Index will not be covered during the lecture, however you will need to read this to complete question M3.3 of the seminar questions. If you were told that the price of chocolate had increased by 5% you would think that you could buy afford to buy less chocolate. If your wages have increased by 7% then in real terms the price of chocolate has decreased as your wages have increased at a higher rate. We will use indicators to help us judge real increases.

Examples of indicators

• Retail price index • Output of production index We will use the retail price index as an indicator for the economy in the following examples. ENDSECTION STARTSECTION=content_8.htm= SECTION~



Time Series Deflation

Change in the real value of a commodity over time.

Real value index

The real value index is calculated using the following,

100IndicatorCurrent

Indicator Base valueBase

lueCurrent va R.V.I ××=

Example

The following table contains information relating to the average weekly earnings of a set of factory workers.

1974 1984

Average Earnings 59.6 174.3

RPI 134.8 351.8

Let’s investigate the growth in wages from 1974 to 1984. To calculate the index of earnings

relative to 1974 we use 100 I ×=o

n

VV .

Table 1 (without incorporating the indicator)

1974 1984


Index of Earnings base 1974.

100 292100

6.593.174 I =×=

So we see a 192% increase in wages over the 10 year period.

Does this represent a real increase?

What is the increase in earnings after incorporating the indicator? Now we can calculate the real value index of earnings relative to 1974. This will enable us to judge the increase in earnings in real terms. Table 2 (Incorporating the indicator)

1974 1984


RPI 134.8 351.8

RVI base 1974 100 112



112100351.8 134.8

59.6174.3 )1001984( R.V.I 1974 =××==



Complete the table: 1976 1986

RPI 134.8 351.8

Price 18 65

Price Index = 100×o

n

VV

RVI base 1976




Composite Index Numbers

A composite index number is obtained by combining information from a set of economic commodities called components.

Weighting of Components

For example, the Retail Price Index has components, which consist of food, alcohol, fuel, transport, etc. In calculating a composite index number, each factor will be weighted. The weighting is considered as a measure of the importance of the component. For example, in the Retail Price Index food will be weighted more heavily than alcohol. Food is a necessity; alcohol is a luxury; hence food has a larger weight than alcohol.

Popular Weighting Factors

Calculating a price index for the production of an item consisting of three components. Use the quantity required of each component as the weighting factor. When calculating a quantity index use prices as weights. ENDSECTION STARTSECTION=content_10.htm= SECTION~

Weighted Average Index

Calculate an index relative, ( 100 I ×=o

n

VV ) , for each component,

Obtain a weighted average of these relatives.

( )∑∑=

wwI

indexaverageWeighted

where w = weighting factor, and I = the index for each relative.



Example

A Building contractor buys 8 tonnes of concrete, 14 tonnes of bricks and 2 tonnes of cement in order to complete 1 job. The prices of each component in 1999 and 2000 are displayed in the table below.

Commodity Weighting

w

1999 Price

V0

2000 Price

Vn 100 I ×=

o

n

VV

wI

Concrete 8 25 24 96100

2524 =×

Bricks 14 34 38 8.111100

3438 =×

Cement 2 64 80 125100

6480 =×

Total

( )∑∑=

wwI

indexaverageWeighted =


ENDSECTION STARTSECTION=activity_8.htm= SECTION~ Exercise M3.8 - Weighted Average Index Activity

Calculate the weighted average index of the labour costs.

Category of worker

Number of workers =

w

Hourly wage rates 1987

V0


Vn

100×=o

n

VV

I

wI

Craftsmen 120 £4.50 £6.00

Labourer 200 £3.20 £3.80

Drivers 80 £3.80 £4.60

Total

( )∑∑=

wwI





Weighted Aggregate Index

Perhaps we should compare the current cost with the base cost. This method is called the weighted aggregate index.

( )( ) 100×=

∑∑

o

n

WVWV

indexaggregateWeighted

where Σ WVn = the total of weight × current price for each component, Σ WV0 = the total of weight × base price for each component.

Commodity

Weighting w

Price V0

Price Vn

WV0

WVn

Concrete 8 25 24 200 192

Bricks 14 34 38 476 532

Cement 2 64 80 128 160

Total 804 884

( )( ) 100×=

∑∑

o

n

WVWV


=

884804

100×

=109.95 ≈110 Now go and do Exercise M3.9




Exercise M3.9 - Weighted Aggregate Index Activity

Calculate the weighted aggregate index of the labour costs.

Category of worker

Number of workers

w


V0


Vn

WV0

WVn

Craftsmen 120 £4.50 £6.00

Labourer 200 £3.20 £3.80

Drivers 80 £3.80 £4.60

Total

( )( ) 100×=

∑∑

o

n

WVWV





Seminar Questions

Seminar Question M3.1

a) Calculate the series of fixed base index numbers that represent the following data

July Aug Sept Oct Nov

Production 2431 2542 2650 2781 2892

Fixed base relative (Sept)

b) Change the base to October.

Index number ( Base October) = ×baseNew

base OldIndex number (Base Sept)

July Aug Sept Oct Nov

Production 2431 2542 2650 2781 2892

Fixed base relative (October)



The average weekly earnings of manual employees in manufacturing industry increased from £132.98 in October 1983 to £164.74 in October 1986. The RPI increased from 340.7 to 388.4 over this period.

i) Calculate the index number that represents the change in earnings from 1983 to 1986.

ii) What was the real value index of earnings in over this period? ENDSECTION STARTSECTION=activity_12.htm= SECTION~




Calculate the set of fixed base (Jan) and chain base index numbers for the following data. Which is more appropriate for the data? Explain your answer.

Jan Feb Mar Apr May

Sales 1050 1560 2024 2998 3986

Fixed base

Chain base relative



A firm uses three materials in its manufacturing processes. The quantities bought in 1980 and 1989 and the prices are as follows:

Material Units 1980 Prices (£)

1989 Prices (£)

1980 Quantities

1989 Quantities

A Thousands 100 150 10 16

B Gallons 1 2 100 120

C Metres 2 5 50 70

i) Calculate the price index and quantity index each component with base period 1980.

ii) Calculate the total cost of the manufacturing process in 1980 and 1989. Express the percentage change in the cost of the manufacturing process as an index number.





A cleaning product is made up of four components, A, B, C, and D. The table below displays the quantities used to make the cleaning agent and their prices in 1999 and 2000. Calculate the weighted average index .

Weighting w

1999 Price

V0

2000 Price

Vn

I= 100×o

n

VV WI

A 4 1.00 1.4

B 3 0.95 1.2

C 2 0.80 1.00

D 2 0.45 0.60

Total

( )∑∑=

wwI




Calculate the weighted aggregate index.

Weighting w 1999 Price

V0

2000 Price

Vn

WV0 WVn

A 4 1.00 1.4

B 3 0.95 1.2

C 2 0.80 1.00

D 2 0.45 0.60

Total

( )( ) 100×=

∑∑

o

n

WVWV







• the calculation, interpretation and representation of index numbers; • when to use fixed and chain base index numbers on a time series of

data • how to judge real change; • the differences between the weighted average and weighted

aggregate index..

Extra Activities







We shall next consider two particular types of weighted aggregate indices: the Laspeyre and Paasche indices. ENDSECTION STARTSECTION=content_12.htm= SECTION~

Laspeyre Index

This index always uses base time period weights. Generally it is used with price and quantity indices - A Laspeyre price index uses base time period quantities as weights.

( )( ) 100×=

∑∑

oo

no

pqpq

indexpriceLaspeyre

A Laspeyre quantity index uses base time period prices as weights.

( )( ) 100×=

∑∑

oo

no

qpqp

indexquantityLaspeyre

Laspeyre example

A firm uses three materials in its manufacturing processes. The quantities bought in 1980 and 1989 and the prices are as follows:

Material 1980 Prices

op

1989 Prices

np

1980 Quantities

oq

1989 Quantities

nq

no pq oo pq

A 100 150 10 16 10 ×150=1500 10 ×100=1000

B 1 2 100 120 100 ×2=200 100 ×1=100

C 2 5 50 70 50 ×5=250 50 ×2=100

Totals 1950 1200



( )( )

5.16210012001950

100

100

=×=

×=

×=∑∑

columnpqoftotalcolumnpqoftotal

pqpq

indexpriceLaspeyres

oo

no

oo

no

Laspeyre Index The Laspeyre index assumes that the quantities of goods are held constant from the base year. This is probably not realistic since as prices go up, consumers tend to buy less. So the Laspeyre Index usually over-estimates price increases ENDSECTION STARTSECTION=content_13.htm= SECTION~

Paasche Index

This index always uses the current time period weights.

( )( ) 100×=

∑∑

on

nn

pqpq

indexpricePaasche

( )( ) 100×=

∑∑

on

nn

qpqp

indexquantityPaasche

Remember a formula sheet is provided in the exam. Do not learn these formulas.



Paasche example

Material 1980 Prices

op

1989 Prices

np

1980 Quantities

oq

1989 Quantities

nq

nn pq on pq

A 100 150 10 16 16 ×150= 2400 16 ×100=1600

B 1 2 100 120 120 ×2=240 120 ×1=120

C 2 5 50 70 70 ×5=350 70 ×2=140

Totals 2990 1860

( )( ) 100×=

∑∑

on

nn

pqpq

indexpricePaasche

8.160100

18602990

100

=×

×=columnpqoftotalcolumnpqoftotal

on

nn

Paasche Index The Paasche index assumes the current quantities are true for the base period. So the Paasche Index tends to under-estimate price increases. Also the quantities need to be updated each year. The quantities are generally more difficult than prices to determine so the Laspeyre Index is more popular. ENDSECTION ENDCHAPTER

M4 Introduction to Probability Page 215


CHAPTER=PROB


Introduction to Probability Context

In this unit we will introduce the concept of probability, chance and randomness. We will determine how likely events are to occur when we conduct or observe an experiment. We begin by investigating discrete experiments and then build on these ideas in subsequent lectures when we investigate continuous experiments. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives


• define an experiment and its associated outcome set; • calculate discrete probabilities; • express probabilities as decimals, fractions and percentages; • demonstrate knowledge of the properties of discrete probabilities. • demonstrate knowledge of the characteristics of the normal

distribution. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Probability and Chance

Probability measures the chance that something will happen. Statements about probability occur in everyday speech. For example the following statements are concerned with chance:

• It is highly likely that I will enjoy STX1110. • Nine times out of ten I forget to switch off my phone before going

into my seminar. • I am almost certain I will pass the multiple choice test.

Probability gives a structure to the idea of chance and allows us to try and measure the level of uncertainty or chance. This will enable us to evaluate the level of associated risk. This level of risk will inform decisions and choices to be made in the future. For example we could evaluate how risky an investment is and so determine how high likely an investment is to make a loss. If I feel that the chance of making a loss is too great then I will not invest. Probability gives a well defined structure about the idea of chance.

Definitions



An EXPERIMENT, X, is a situation that can be performed (or considered) to gain information. Our experiment could be as simple as picking a STX1110 tutor at random from the module team list. Possibly our experiment may be more complicated, perhaps taking part in a raffle, or playing the lottery. An OUTCOME SET, Ω, is a set of possible results associated with an experiment . Let X = picking a STX1110 tutor at random from the module team list. Then, Ω is the list of names of all STX1110 tutor in the module team. For example, Ω = { Alison , Cathy, Chris, Emma, Gary, John, Matt, Patricia, Thomas, Zainab } An EVENT, E, is either a single or combination outcomes. Let X = picking STX1110 tutor at random from the module team list and Ω = { Alison , Cathy, Chris, Emma, Gary, John, Matt, Patricia, Thomas, Zainab } One event could be picking a STX1110 tutor at random from the list and their name beginning with C. This condition is satisfied by two of the outcomes (names) in our outcome set (list), Cathy and Chris. This could be written as, E = picking Cathy or Chris.

Examples

Experiment, X Set of Outcomes,

Ω

An Event, E

Flipping a coin {Heads, Tails} The coin landing face up.

Guessing the sex of a baby {Boy, Girl } Giving Birth to a Girl

Simple Questionnaire

e.g. 1 Question :

Do you like STX1110 ?

{Yes, No, Do not Know}

A student picked at random liking STX1110.

Complex Questionnaire

e.g. Lots of questions asked to the whole of Middlesex University.

1. At which campus are you studying STX1110?

Hendon (HE) or Dubai (D)

2. Which do you prefer?

A) The statistics lectures

B) The mathematics lectures

C) No preference

{ (HE, A),

(HE, B),

(HE, C,

(D, A),

(D, B),

(D, C) }

A student picked at random preferring the mathematics lectures.



Example

Suppose an experiment consists of flipping a coin twice. What is the set of outcomes? For simplicity let us denote the coin landing Heads up by H, and Tails up by T. The first time we flip the coin it could land either Heads up or Tails up: 2 results. The second time we flip the coin it could also land either Heads up or Tails up: 2 results. So if the first result is a Heads then the result of second flip will be one of two possible answers, H, T, and if the result of the first flip is Tails then the result of the second flip will also be one of two possible answers. This gives four (2×2) outcomes to list. It is much easier to list these in a table.

Second Result H T

H

(H, H)

(H, T)

First Result

T

(T, H)

(T, T)

So, Ω = { (H, H), (H, T), (T, H), (T, T) }.



Properties of Probabilities

Probabilities are measured on a scale between 0 and 1. This is just a scale and is a consequence of how the probability is calculated. Sometimes we refer to the percentage chance. We convert our probability value that is between zero and one to a percentage. Hence our percentage chance will lie between 0% and 100%.

Impossible 50- 50 Chance Certain

0 0.5 1 If the probability of an event, P(E), is 0, then the event is impossible.

The probability that the following events will occur is 0 since they are all impossible.

• A STX1110 student being 50 metres tall. • Winning a raffle if I do not have a ticket.

If the probability of an event, P(E), is 1, then the event is certain. The probability that the following events will occur is 1 since they are all certain to occur.

• The probability of a person being less than 50 metres tall. • The probability of winning a raffle if I have all of the tickets.

The sum of all probabilities associated with an experiment is 1. Probabilities are often presented as percentages;

For example, P(E) = 0.5 or 50% P(E) = 0.75 or 75% P(E) = 0.125 or 12.5% ENDSECTION STARTSECTION=content_2.htm= SECTION~



Evaluating probabilities for discrete variables

Definition

For an event E associated with a experiment X , the probability of observing the event is denoted by P(E) and is defined as the following;

outcomes possible ofnumber Total occurcan event an waysofNumber )P(E =

Note we can only use this formula for experiments with a finite number of outcomes which are equally likely.

Example

Suppose that an experiment, X, consists of rolling a six sided fair die, and noting the result. Then the set of outcomes for this experiment is Ω = { 1, 2, 3, 4, 5, 6 }.

What is the probability that the die will result in an even number, P(E) ? Now if we assume that the die is fair, then each face of the die is equally likely to be rolled. In fact each face has a one in six chance of being rolled.

So, P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 61

Let our event E be obtaining an even number. So E is obtaining 2 or 4 or 6, which is three of our six outcomes. Hence the probability that the die will result in an even number is,

21

63

experiment for the outcomes ofnumber Total occurcan event an waysofNumber )P( ===E





Exercise M4.1

Suppose an experiment consists of tossing a coin twice, and noting the result on each.

Calculate the probability that you obtain 2 heads in the two tosses of the coin.

Calculate the probability that you obtain at least one head in the 2 tosses of the coin.



A bag contains sweets which are either small, medium or small, and either red or yellow , in the following numbers:

Red Yellow

Small 4 6

Medium 5 5

Large 8 2

Find the probability that the sweet picked is

i. red

ii. yellow

iii. large and yellow

iv. small, medium, or large

v. extra large.




Estimating probabilities from observed data

Sometimes it will be necessary to estimate probabilities using data we have observed from conducting an experiment.

Example

We could estimate the probability that a student passes the STX1110 exam by finding out how many students have passed during the last 3 years and using this proportion as an estimate of the probability that a student will pass the STX1110 exam this year. The table below details the number of students who have passed the STX1110 exam for the last three years.

Number of students

Pass Fail

2002 221 12

2003 550 21

2004 750 26

Based on this information how likely is it for a student to pass the STX1110 exam this year? First we must work out how many students have passed the test in total over the last three years. To do this we add all the entries in the pass column. Then we must find out how many students have taken the test, either passing or failing, in total. So we add all the entries in the table.



This is then expressed as a proportion. Number of students

Pass Fail Totals

2002 221 12 233

2003 550 21 571

2004 750 26 776

Totals 1521 59 1580

So based on this information the probability that a student will pass the STX1110 exam is,

%3.969627.015801521)exam STX1110 thePass( ===P

. Now go and do Exercise M4.3


A module leader collects information concerning the punctuality of students in different lecture groups. The results are displayed in the following table.

Level of Punctuality

Early On time Late

A 36 19 10

B 20 116 65

Group

C 3 35 100

How likely are the following: i. A student being late?

ii. A student being early and in group B?

iii. A student being in group A?

iv. A student not being late?

ENDSECTION STARTSECTION=content_4.htm= SECTION~ Normal Probabilities



At the beginning of the course we discussed frequency distributions. These are concerned with the number of times each outcome happens and the pattern of the number of occurrences of each outcome.

Example

Suppose we looked at the height distribution of the male students attending STX1110 last semester. Height 5’6” 5’7” 5’8” 5’9” 5’10” 5’11” 6’0” 6’1” 6’2”

Frequency 1 6 15 22 25 20 16 5 1 A graph of the data is as follows,

0

5

10

15

20

25

5’6” 5’7” 5’8” 5’9” 5’10” 5’11” 6’0” 6’1” 6’2”

What is the probability that a male student picked at random will be less than or equal to 5’8” in height? We find estimate this probability by finding the total number of students who are less than or equal to 5’8” in height, and divide by the total number. So we find

%82.191982.011122)( ===EP

. We could visualise this probability by saying it is the percentage area on our graph.



0

5

10

15

20

25

5’6” 5’7” 5’8” 5’9” 5’10” 5’11” 6’0” 6’1” 6’2”

We could plot a relative frequency histogram and then the areas would directly represent the probability we want to evaluate. If we could measure the heights with increasing accuracy the width of the bars above would become smaller. As this happened the shape of the distribution graph would tend to a smooth curve. We could then say that probabilities could be estimated by the area under this curve. In the next unit we will use the normal distribution to do this. There are some special distributions in statistics which model the observed outcomes of experiments quite well. This is important, because given a little bit of information we can use the statistical information to calculate any required probability.




The Normal Distribution

This is the most frequently used and important distribution in statistics; it has been shown that it models many things that occur in nature, for example the heights of males, weights of students, or time taken to complete an activity.

Characteristics of the Normal Distribution

It has a very distinctive shape.

The normal distribution is distributed about its mean:

Most of the values cluster about the mean. The variance determines how ‘tightly’. The frequency tapers away either side of the mean and tends to 0.

The total area under the curve is 1. The two values that characterise the shape of a normal distribution are the mean μ and the

standard deviation σ. The mean is a measure of location, and the standard deviation is a measure of dispersion.

Example:

Suppose f1 is the normal density function with mean μ = 0, and standard deviation s1. Suppose also that f2 is the normal density function with mean μ = 0, and standard deviation s2. We can see that both distributions have the same mean, f2 has larger standard deviation than f1.

f1

f2



The distribution for f2 is more widely spread over the range. In effect all values lie within ± 3 standard deviations of the mean.




Seminar Questions


Suppose that an experiment, X, consists of rolling a six sided fair die, and noting the result.

Write down the set of outcomes and calculate the following probabilities:

Set of outcomes =

a) What is the probability of rolling a 4 or a 5? P(4 or 5) =

b) What is the probability of rolling 4 or more? P(4 or more) =

c) What is the probability of rolling a 7? P(7) =

d) What is the probability of rolling an odd or an even number?

P(odd or even) = ENDSECTION STARTSECTION=activity_5.htm= SECTION~


In Euro-Wisney theme park in Paris, park helpers dressed up as Creepy Crawly characters distribute free gifts from a sack at random.

Before the next toy is picked, one of the park helpers refills the bag with a variety of action figures from the Creepy Crawly range. The number of each size and type is as follows.

Flint Princess

Large action figure 5 10

Medium action figure 6 9

Small action figure 9 6

Calculate the probability that the toy picked out is;

i) A Flint action figure ii) A large action figure iii) A medium Princess action figure. iv) Either a Flint or a Princess action figure.





For an experiment we roll two dice together and add their results up. Finish off filling the results table below.

Result on first die Result on second die Total 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 7 2 1 3 2 2 4 2 3 5 2 4 6 2 5 7 2 6 8 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6

Calculate the probability of obtaining a total of 11. P (total = 11)





The set of heights of male STX1110 students is made up as follows:

Height to the nearest inch Number of students 5’ 7’’ and below 4

5’ 8’’ 8 5’ 9’’ 15

5’ 10’’ 23 5’ 11’’ 24

6’ 17 6’ 1’’ 7 6’ 2’’ 3

6’3’’ and above 1

i) Graph the above data.

ii) What percentage of male students are six foot or taller?

iii) What is the probability of picking out a male student from the group and him being shorter than six foot?




• the properties of discrete probabilities; • how to evaluate probabilities and express them as fractions

decimals and percentages; • the link between probabilities, percentages and percentage areas

of the graphed data; • the characteristics of the normal distribution.



Extra Activities






M5 Standard Normal Distribution

Page 231


CHAPTER=SND


Standard Normal Distribution Context

In the last unit we introduced probability for discrete variables. We then considered how it may be possible to make predictions for continuous variables, such as heights and weights. We introduced the notion of approximating probabilities by evaluating an area under a curve. In this unit we build on these ideas and will introduce the normal distribution. We start by looking at a special case called the standard normal distribution. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives

Having worked through this unit you should be able to:

• use standard normal distribution tables to evaluate less than or lower tail probabilities;

• manipulate less then probabilities to evaluate greater than (upper tail) and interval (strip) probabilities;

• construct an appropriate z value from tables for a given probability. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Standard Normal Distribution

The standard normal distribution has a mean equal to 0 and a standard deviation equal to 1.

Probabilities associated with the standard normal distributions are tabulated. We use these tables to evaluate probabilities. Tables give us the probability that our experiment (say Z) takes a value less than a number (say z), P(Z ≤ z).


Page 232


Graphically it is the area under the curve to the left of the number z.

Example Suppose that our experiment Z is the error in measuring someone’s height in centimetres. We believe that Z is normally distributed with mean = μ = 0 and standard deviation = σ = 1. We denote this by Z ~ N(0,1).

Properties of the standard normal distribution

1. The total area under the curve = 1

2. Due to the symmetry of the graph we see that the area under the curve to the left of zero is half of the total area. This tells us that P(Z<0) is one half.

P( Z<0) = ½


Page 233



Calculating Standard Normal Probabilities

Less than or lower tail probabilities

Suppose we want to evaluate the probability that our error, Z, is less than 0.41cm. In notation this is written as P(Z < 0.41).

Drawing a diagram helps you visualize the probability that you are calculating. We want P(Z < 0.41), graphically this is;

Now we use our tables. Tables provide the probability that Z<z. We look up the corresponding value using the z-score, the non-zero number z. In our example we must look up 0.41 in our tables. We do this by splitting the value into units, tenths, and hundredths. The units and tenths part of the z-score tells us which row we must look in, and the hundredths identifies the column. For our example we know that the associated probability is in the 0.4 row and 1 column. We follow the row and column until we highlight the appropriate entry, see below.

↓

z 0 1 2

0.0 .5000 .5040 .5080

0.1 .5398 .5438 .5478

0.2 .5793 .5832 .5871

0.3 .6179 .6217 .6255

→ 0.4 .6554 .6591 .6628

0.5 .6915 .6950 .6985

We find the corresponding probability is P(Z < 0.41) = 0.6591. This tells us that 65.91% of our measurement errors are less than 0.41cm. Now go and do Exercise M5.1


Evaluate the probability that our error is less than 1.63cm.

So in notation P(Z < 1.63).


Page 234


Graphically this is

Now we use our tables.

We find the corresponding probability is Now go and do Exercise M5.2


Evaluate the probability that our error is less than 2.5cm.

So in notation P(Z < 2.5).

Evaluate the probability that our error is less than –1.05cm.



Page 235


Greater than or upper tail probabilities

Suppose we want to calculate the probability that Z, our measurement error, is greater than

0.41. Graphically this is

=

–

This tells us that P(Z > 0.41) = 1–P(Z < 0.41) = 1- 0.6591 = 0.3409. So 34.09% of measurement errors are greater then 0.41cm. Now go and do Exercise M5.3



Page 236


Exercise M5.3

Calculate the probability that Z, our measurement error, is greater than 1.65.

Graphically this is



Calculate the probability that Z, our measurement error, is greater than -0.37.



Calculate the probability that Z, our measurement error, is greater than 1.06.


Page 237



Interval or strip probabilities

Suppose we need the probability that 0.41 < Z < 1.35. Graphically this is

=

-

Look up 1.35 and 0.41 on your tables, this gives, P(0.41 < Z < 1.35) = 0.9115 - 0.6591= 0.2524 So 25.24% of errors take a value between 0.41cm and 1.35cm. Now go and do Exercise M5.6


Page 238



Suppose we need the probability that -0.58 < Z < 1.35. Graphically this is


Recap

So far in this unit have evaluated less than or lower probabilities using the standard normal distribution tables. We have also manipulated these less than probabilities to give greater than probabilities by subtracting the standard normal distribution table values from 1. Additionally we have calculated interval or strip probabilities by evaluating the difference between two standard normal distribution table values. ENDSECTION STARTSECTION=content_6.htm= SECTION~

Finding a Z value

Up until this point we have been using our Z value to find a probability. This probability is an area under the normal curve which we have to evaluate. Suppose now we know the area, but need to find the z value it comes from.

Example

So suppose we want to find the z value that gives us an upper tail value of 0.05, ( 5% ). Graphically this is,


Page 239


How do we find the value of z?

We know that P(Z < z)= 0.95. So if we look through the standard normal probability tables and find the entry closest to 0.95 we can find the value of z. The closest table entries are 0.9505 and 0.9495.

↓ ↓

z 0 1 2 3 4 5 6

…

1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279

1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406

→ 1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515

1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608

1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686

1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750

The table entry of 0.9505 is in the 1.6 row and 5 column. So here z =1.65. The table entry of 0.9495 is in the 1.6 row and 4 column. So here z =1.64 We can use either of these z values, but to get a better estimate for z we could take their average. So take z = (1.65+1.64) ÷ 2 = 1.645. Example Find the z value that gives us an upper tail value of 0.1, ( 10% ). Graphically this is, We know that P(Z < z) = 0.9. Look through the standard normal probability tables and find the entry closest to 0.9.


Page 240


The closest table entry is 0.8997, it is 0.0003 away from 0.9.

↓

z … 3 4 5 6 7 8 9

…

1.0 .8485 .8508 .8531 .8554 .8577 .8599 .8621

1.1 .8708 .8729 .8749 .8770 .8790 .8810 .8830

→ 1.2 .8907 .8925 .8944 .8962 .8980 .8997 .9015

1.3 .9082 .9099 .9115 .9131 .9147 .9162 .9177

1.4 .9236 .9251 .9265 .9279 .9292 .9306 .9319

1.5 .9370 .9382 .9394 .9406 .9418 .9429 .9441

The table entry of 0.8997 is in the 1.2 row and 8 column. So here z =1.28. Example Find the z value that gives us an upper tail value of 0.025, ( 2.5% ). Graphically this is,

We know that P(Z < z) = 0.975. Look through the standard normal probability tables and find the entry closest to 0.975. The table entry is exactly 0.9750.

Which row and column is it in? This will tell us the value of z.


Page 241



Seminar Questions


a) Sketch the shape of the standard normal distribution.

b) What do we know about the mean and variance of the standard normal distribution?

c) Evaluate the following probabilities

i) P(Z < 1.68)

ii) P(Z < -0.96)

iii) P(-0.96 < Z < 1.68) ENDSECTION STARTSECTION=activity_8.htm= SECTION~


a) Find the z value that gives us an upper tail value of 0.01, ( 1% ).

b) Find the z value such that P(Z>z) = 0.1 = 10%. ENDSECTION STARTSECTION=think_1.htm= SECTION~



• how to use standard normal tables; • the link between less than, greater than and strip probabilities; • how to estimate a z-value from a given probability.

Extra Activities






Page 242



M6 General Normal Distribution Page 243


CHAPTER=GND

STARTSECTION=scope_1.htm= SECTION~</p>

General Normal Distribution Context

In the last unit we introduced the normal distribution. We used a special case called the standard normal distribution to make predictions for continuous variables, such as heights and weights. We also used the idea of approximating probabilities by evaluating an area under a curve. In this unit we build on these ideas and will investigate the normal distribution in general. So for normal variables that do not have a mean of zero and a variance of one. We will learn how to standardise, and compute probabilities in these situations. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives


• give the steps for standardisation; • use these steps to standardise general normal probabilities, and

obtain less than (lower tail) probabilities from tables; • manipulate less then probabilities to evaluate greater than (upper

tail) and interval (strip) probabilities for the general normal distribution.


General Normal Distribution

Suppose that the situation that we are interested in follows a normal distribution but it does not have mean 0 and standard deviation 1.



Example The heights of female STX1110 students are normally distributed with mean 160cm and standard deviation 5cm. X = the height of a female STX1110 student We write, X ~ N(160 , 25). Suppose we want the probability that the student is shorter than 170cm, so P(The height of the female student is less than 170cm) = P(X <170). We do not have tables for this specific normal distribution, so we convert to a standard normal. We transform our general normal curve into a standard normal curve. To do this we first move it along the x-axis and then change the shape of the curve. Mathematically this is done via the following steps.

Steps for standardising

Subtract the mean. Divide by the standard deviation. Then evaluate the probabilities using standard normal tables.


Standardising the General Normal Distribution

To calculate the probabilities of a general normal distribution we must transform one graph into the standard normal graph.

We do this by first moving the graph to the origin.

The next step is to change the shape of the graph by squashing it.



Let us use the steps for standardising for our example

Steps for standardising

Subtract the mean. Divide by the standard deviation. Then evaluate the probabilities using standard normal tables.



Example P(The height of a female STX1110 student is less than 170cm) =

P( X < 170) = ⎟⎠⎞

⎜⎝⎛ −

<=⎟⎠⎞

⎜⎝⎛ −

<−

5 160 170P

5 160170

5 160P ZX

( )2P5 01P <=⎟⎠⎞

⎜⎝⎛ <= ZZ

We can use the tables to find the tail area required. Graphically this is,

So the required probability is the area under the curve to the left of 2.

The value 170 160

5−

is called a Z-score.

More formally a Z-score for any x value from a normal distribution mean μ and standard deviation σ is defined to be

σμ−

=xz

Example Suppose that X is the height of a female STX1110, and let X be normally distributed with mean 160cm and standard deviation 5cm, (so variance = 25). In notation this is X ~ N(160 , 25)

Calculate P(Height of STX1110 student is less than 155cm). In notation this is P(X < 155). To evaluate this probability we must use the steps for standardisation. Subtract the mean. Divide by the standard deviation Then evaluate the probabilities using standard normal tables. So this gives,

P(X < 155) = =⎟⎠⎞

⎜⎝⎛ −

<−

5 601155

5 160 P X P(Z< –1).

We can now evaluate this probability by looking up –1.00 in our tables.


Further Worked Examples

The time taken to type up each set of STX1110 notes is normally distributed with mean 6.4 hours and standard deviation 1.2 hours. Calculate the probability that it takes



a) Less than 6 hours. b) Greater than 7 hours. c) Between 6 and 7 hours. Let X = Length of time to type up the notes. Then X ~ N ( 6.4 , (1.2)2 ).

Example a) We need to calculate P(X < 6). Once we apply the steps for standardisation we have,

P(X < 6) = P(Z< 6 6 4

12− ..

) = P(Z < –0.33 )

where Z ~ N ( 0 , 1). Graphically this is,

Using the standard normal tables we find P( Z < –0.33 ) = 0.3707. So 37% of the time the notes will take less than 6 hours to type up. Example b) We need to calculate P(X > 7). Once we apply the steps for standardisation we have,

P(X > 7) = P (Z> 7 6 4

12− ..

) = P(Z > 0.5 )

Graphically this is,

Now we can use our tables to evaluate the required probability. Remember as this is an upper tail probability we must subtract the table value from one. This gives, P(X > 7) = P(Z > 0.5) = 1 – P(Z < 0.5 ) = 1 – 0.6915 = 0.3085. So, roughly 31% of the time it will take more than 7 hours to type up the notes.



Example c) P(6 < X < 7 ) = P (X < 7) – P(X < 6) = 0.6915– 0.3707 = 0.3208 Graphically,

=

-




Seminar Questions


Suppose the weight of a bag of potatoes in pounds is N (5, (0.2)2).

X = Weight of a bag in pounds. X ~ N (5, (0.2)2).

i) Calculate the probability a bag weighs less than 5.5 pounds.

P(X< 5.5) =

ii) Calculate the probability a bag weighs more than 5.5 pounds.

P(X > 5.5) =

iii) Calculate the probability a bag weighs between 5.5 and 4.5 pounds.

P(4.5 < X < 5.5) =



Suppose X ~ N (6.4 , (1.2)2). Calculate the following probabilities.

i) P(X < 6.5) =

ii) P(X > 7.5) =

iii) P(5.8 < X < 6.7) =


M6 General Normal Distribution The standard normal distribution Page 250

-4 -3 -2 -1 0 1 2 3 4z

-4 -3 -2 -1 0 1 2 3 4z

z 0 1 2 3 4 5 6 7 8 9 z 0 1 2 3 4 5 6 7 8 9 -3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010 0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359

0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 -2.9 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .0014 0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 -2.8 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019 0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 -2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026 0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 -2.6 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036 0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 -2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048 0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 -2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064 0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 -2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084 0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 -2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110 0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 -2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143 -2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183 1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621

1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 -1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233 1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 -1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294 1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 -1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367 1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 -1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455 1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 -1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559 1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 -1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681 1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 -1.3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823 1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 -1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985 1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 -1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170 -1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379 2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817

2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 -0.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611 2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 -0.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867 2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 -0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148 2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 -0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451 2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 -0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776 2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 -0.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121 2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 -0.3 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483 2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 -0.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859 2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 -0.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247 0.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641 3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990

M6 General Normal Distribution

Page 251





• how the steps for standardisation transform a general normal variable into a standard normal variable;

• how to use the steps for standardisation to find general normal probabilities from the standard normal tables;

• the link between less than, greater than and strip probabilities for a general normal variable.

Extra Activities





.


M7 Linear Equations Page 253


CHAPTER=LE


Linear Equations Context

This unit is designed to enable you to top up the mathematical skills required to formulate and solve linear programming problems. We will revise the skills required to generate and manipulate linear equations. The techniques that we revise will provide us with the basic mathematical tools to help us find the optimal allocation of resources like time, materials, money, to achieve the best solution to the business problem. So to maximise profit we may need to evaluate the optimal number of each type of product to produce within the restraints that we have. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives


• formulate linear equations; • manipulate linear equations; • construct linear graphs; • identify regions within a plot.


Linear Equations

Solutions to equations can be thought of as lines or curves. All the equations we will be working with will be linear equations, so those equations whose solutions can be thought of as straight lines. These equations do not contain terms with powers of x not equal to 1. So they do not have x2, x3, x4, etc, but will have a multiple of x. Linear equations are those of the form;

bxay += a is where the line cuts the y axis, and b is the gradient (slope) of the line.



y = 10 + 20x

-60

-40

-20

0

20

40

60

80

-3 -2 -1 0 1 2 3

You may have seen this written as cmxy += , where c is where the line cuts the y axis, and m is the gradient (slope) of the line. It really doesn’t matter what letters you use to represent the slope and intercept. We will use a and b as we will see this form later in the course during the regression and time series units. The equation, bxay += , tells us what the y-value is for any given x-value. So we can see that y depends on x. Often we see y called the dependent variable, and x the independent variable. Another way to think of this is that if we change the value of x then the value of y will move in response. So when y changes we know that it can be explained by x moving. Often we see y called the response variable and x the explanatory variable. The numerical value of b tells us how y responds when we increase x by +1. So if xy 2010 += , then we know that the line representing this equation crosses the y-axis at 10 when x = 0, and that every time you increase x by +1, y increases by +20. See the example below.



Example y = 10 + 20x If we calculate the corresponding values of y for x between –3 and 3, we can see that as x increases by +1, y increases by +20.

→+1

→+1

→+1

→+1

→+1

→+1

x –3 –2 –1 0 1 2 3

y = 10 + 20x y = 10 + 20(–3)

= 10 – 60 = –50

y = 10 + 20(–2)

= 10 – 40 = –30

–10 10 30 50 70

→+ 20

→+ 20

→+ 20

→+ 20

→+ 20

→+ 20

Using these point we can plot the graph for y = 10 + 20x. The line slopes upwards telling us that there is a positive relationship between x and y. This tells us that as x increases y increases.

y = 10 + 20x

-60

-40

-20

0

20

40

60

80

-3 -2 -1 0 1 2 3



Example y = 10 – 20x Suppose that our equation has a negative b value, so a negative slope. If we calculate the corresponding values of y for x between –3 and 3, we can see that as x increase by +1, y decreases by 20.

→+1

→+1

→+1

→+1

→+1

→+1

x –3 –2 –1 0 1 2 3

y = 10 – 20x y = 10 – 20(–3)

= 10 – (–60) = 10 + 60 = 70

y = 10 – 20(–2)

= 10 + 40 = 50

30 10 -10 –30 –50

→− 20

→− 20

→− 20

→− 20

→− 20

→− 20

Using these point we can plot the graph for y = 10 – 20x. The line slopes downwards telling us that there is a negative relationship between x and y. This tells us that as x increases y decreases.

y = 10 - 20x

-60

-40

-20

0

20

40

60

80

-3 -2 -1 0 1 2 3



Manipulating linear equations

Often we see equations in the form;

dycxP +=

Examples 21 = 15x + 3y 100 = 20x + 10y We can rearrange these equations by first taking cx from both sides, and then dividing by d. So for the equation 100 = 20x + 10y we first take 20x from both sides, and then divide by 10. So, 100 – 20x = 20x + 10y – 20x The – 20x and the + 20x terms cancel out leaving, 100 – 20x = 10y. Dividing by 10 then gives, y = 10 – 2x. So although the equations look different at first sight, they are just different arrangements of each other. Now go and do Exercise M7.1


Rearrange the following into the form y = a + bx.

i) 600 = 6x + 10y

ii) 700 = 10x + 7y

iii) 18 = 3x + 2y


Drawing Graphs

All we need to draw a line is two points which the line passes through. We say that two points fix a line. Any two points. The easiest two points to calculate are the points where the line crosses the axis. The equation of the line below is 600 = 6x + 10y or y =60 – 0.6x.



0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110 120

y = 60 - 0.6x

X

Y

Here x =0, y =?

Here y =0, x =?

STEP 1

Find where the line cuts the x-axis (where y = 0).Let y = 0 in your equation, and then rearrange it to find the x value. Example Suppose our linear equation is 600 = 6x + 10y. Let y = 0. 600 = 6x + 10(0) = 6x + 0 = 6x. Divide by 6 to find x: x = 600 ÷ 6 = 100. This gives us the co-ordinates of our first point, y =0, so remembering that this is written in the form (x, y) this is (100,0). So the line crosses the x-axis at (100, 0).



STEP 2

Find where the line cuts the y-axis (where x =0).Let x = 0 in your equation, and then rearrange it to find the y value. Example Our linear equation is 600 = 6x + 10y. Let x = 0. 600 = 6(0) + 10y = 0 + 10y = 10y. Divide by 10 to find y: y = 600 ÷ 10= 60. This gives us the co-ordinates of our second point, x = 0, y = 60, (0,60)

STEP 3

Mark these two points on the axis of your graph and join them up with a straight line.

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110 120

y = 60 - 0.6x

x

y



Find two points that satisfy 2x + 3y = 18, and draw the graph of this linear equation.

Use the steps above to draw the straight line graph of 2x + 3y = 18.


Finding the intersection of two straight lines.



After drawing the lines representing two linear equations on the same graph you may find that they intersect as above. To find the co-ordinates of the point where the lines meet we substitute one equation into the other and then solve.

Simultaneous Equations

If there are two unknown variables that we want to find, we need two equations – known as simultaneous equations – in order to do so. There are two methods of solving simultaneous equations: elimination or substitution. ENDSECTION STARTSECTION=content_4.htm= SECTION~

Solving simultaneous equations by elimination.

Equation 1: 2052 =+ yx Equation 2: 1123 =− yx

STEP 1

Eliminate x by making the coefficients of both equations the same. In this example, we can multiply the first equation by 3 and the second by 2:

203)52(3 ×=+ yx so 60156 =+ yx 112)23(2 ×=− yx so 2246 =− yx

STEP 2

Subtract one equation from the other to eliminate x: (6x + 15y) – (6x – 4y) = 60 – 22 the terms involving x cancel out and so we have 19y = 38 dividing by 19, y = 38/19 = 2.

STEP 3

0

20

40

60

80

100

120

-25 0 25 50 75 100 125

500 = 10x + 5y

600 = 6x + 10y



Substitute this value of y into one of the equations: 2052 =+ yx so substituting in y = 2 gives 20)2(52 =+x

20102 =+x 102 =⇒ x 5=⇒x .



Try solving the equations above again, but eliminate the variable y first.




Solving Simultaneous Equations by Substitution

600 = 6x + 10y Label this (1) 500 = 10x + 5y Label this (2)

STEP 1

Rearrange (1) into the form y = a + bx.

Divide by 10: yx1010

106

10600

+= .

Which is 60 = 0.6x + y. Take 0.6x from both sides: 60 – 0.6x = 0.6x + y – 0.6x = y. 60 – 0.6x = y.

STEP 2

Substitute (3) into (2) for y and solve. 500 = 10x + 5y = 10x + 5(60–0.6x) = 10x + 300 – 3x = 7x +300

7428

7200

7300500

==−

=x .

STEP 3

Substitute this value for x into (2) to find a corresponding y value. 500 = 10(28 4/7) + 5y 100 = 2(28 4/7) + y (dividing by 5) Rearranging gives y = 100 – 2(28 4/7) = 42 6/7. ENDSECTION STARTSECTION=content_6.htm= SECTION~



Estimating points of intersection using the graphical method

Finding the co-ordinates by solving the simultaneous equations is exact, but not always straight forward. An alternative method is to approximately find the co-ordinates of the intersection by reading the co-ordinates off the graph as accurately as possible. This method is not exact, but is more straightforward. The accuracy of the co-ordinates depends on how well you have drawn your graph.

From the graph, the intersection has co-ordinates (28, 42).

Inequalities

We will be using statements like y < a + bx where <means “less than” y ≤ a + bx where ≤ means “less than or equal to” y > a + bx where > means “greater than” y ≥ a + bx where ≥ means “greater than or equal to” Linear inequalities define regions on a graph

0

20

40

60

80

100

120

-25 0 25 50 75 100 125

500 = 10x + 5y

600 = 6x + 10y



Example 1 The shaded area below represents the region 2x + 3y ≤18:

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

2x + 3y = 18

x

y

We shade all points (x, y) for which 2x + 3y ≤ 18. This region is below and to the left of the line. The shading crosses the axis as there are points with negative co-ordinates that satisfy the inequality, e.g. (–1, –1) since 2x + 3y = –5 which is less than 18. Example 2 The shaded area below represents the region 2x + 3y ≤ 18, with x and y positive.

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

2x + 3y = 18

x

y

We shade all positive points (x, y) for which 2x + 3y ≤ 18. This region is below and to the left of the line, but it does not cross the axes.

Example 3 The shaded area below represents the region 2x + 3y ≤ 18, with x ≤ 3.



0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

2x + 3y = 18

x

y

x = 3

We shade all points (x, y) for which 2x + 3y ≤ 18, where x ≤ 3. This region is below the line 2x + 3y ≤ 18 and to the left of the x = 3 line.




Seminar Questions


Draw the graphs for the following linear equations:

i) 5x + 12y = 2400

ii) 3x + 4y = 1200

iii) x + 5y = 800

Also find the points where the equations above meet. You can use simultaneous equations or estimate the points from your graph.



Shade the following regions:

i) 5x + 12y ≤ 2400

ii) 3x + 4y ≤ 1200

iii) x + 5y ≤ 800



Shade the regions (from Seminar Question M7.2) with the additional

condition that x and y are both positive.



Shade the region that satisfies the following conditions simultaneously:

5x + 12y ≤ 2400, 3x + 4y ≤ 1200, x + 5y ≤ 800, and x & y positive.






• how to find the co-ordinates of the points where a linear equation cross the axes;

• plot linear equations; • evaluate points of intersection for two linear equations by solving

simultaneous equations or by estimating these graphically; • how to identify regions that satisfy inequalities.

Extra Activities






M8 Linear Programming and Optimisation Page 269


CHAPTER=LP


Linear Programming and Optimisation Context

In this unit we investigate methods for solving linear programming problems. We will find optimal solutions, either maximising profit or minimising cost, using the mathematical skills from the previous unit to linear business problems. We will consider optimisation problems involving two variables. Often, linear programming is used to investigate the optimal allocation of certain resources to maximise profit. For example a plastics manufacturer makes washing up bowls and bins and wants to find how many of each type should be made to maximise the profit given various constraining factors such as the amount of plastic available. By the end of this unit you will be able to identify the number of each product to make to produce the maximum profit. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives


• formulate linear programming problems; • graph the constraint inequalities and identify the feasible region; • find an optimal solution to the problem; • evaluate the utilisation of resources for points within the feasible

region. ENDSECTION STARTSECTION=content_1.htm= SECTION~

Profit Lines

Suppose a company is providing two products; eg kitchen bins and washing up bowls. The profit on a kitchen bin is £3 and the profit on a washing up bowl is £2. Let x = number of bins produced, and y = number of bowls produced, then the Profit function is, P = 3x + 2y. How do we graph this?



a) Use three axes Difficult trying to show a 3 dimensional picture on 2 dimensional paper. Example P = 3x + 2y

0

5

10

15

P

1 2 3 4 01

2

x

y

b) Use 2 axes, x and y, and draw the profit function for particular values of P. Example If P = £20 our linear equation is 20 = 3x + 2y which passes through the points (0,10) and (6.67,0). If P = £60 our linear equation is 60 = 3x + 2y which passes through the points (20,0) and (0,30).

0

5

10

15

20

25

30

35

-10 -5 0 5 10 15 20 25 30

60=3x +2y

20=3x +2y

The profit lines are all parallel and as the profit increases the line moves further up the graph. ENDSECTION STARTSECTION=content_2.htm= SECTION~

Gradient of a General Profit Line

Let c and d be constants, then our profit function is, P = cx + dy. Taking cx from both side we have dy = – cx + P.

Then dPx

dcy +

−=

The gradient of the profit line is dc− , so for our example,

P = 3x + 2y so our gradient = –3/2.




Linear Programming

Linear programming is concerned with maximising (or minimising) some linear objective function (e.g. profit function), subject to some constraints on x and y. This means finding values for x and y which maximise the objective function and which satisfy the constraints. There are several ways of solving this problem.

Example: Maximising Profit

A manufacturer makes two products, X and Y. Each X requires 5 hours in the assembly department, 3 hours in the spraying department and 1 hour in the finishing department. For Y, the time required in each of these departments is 12 hours, 4 hours and 5 hours respectively. The total weekly hours available in each department are 2400, 1200 and 800. If the profits are £30 on each X and £100 on each Y, what is the maximum profit output?



X

(time in hrs) Y

(time in hrs) Total time available

(hrs)

ASSEMBLY 5 12 2400

SPRAYING 3 4 1200

FINISHING 1 5 800

PROFIT 30 100

How do we express this as a linear programme?

Linear Programme

We want to maximise profit, P = 30x + 100y Subject to our constraints, Assembly 5x + 12y ≤ 2400 Spraying 3x + 4y ≤ 1200 Finishing x + 5y ≤ 800 We have the additional constraints that x and y are positive. Each constraint tells us how resources are used within the maximum resources we have available. In the Assembly department the time used for a given level of output (x ,y) is 5x + l2y. So we know that for any solution to our problem the amount used is less than or equal to the amount available, 2400. Any difference between what we have used and what we have available is spare. We will evaluate this later.

How do we find the optimal profit point?

So how do we find the values for x and y that maximise P?



STEP 1

Find the region that satisfies our constraints by drawing a graph. So we must plot each constraint equation onto our graph. Assembly: 5x + l2y = 2400 This line passes through (0, 200) and (480, 0). Spraying: 3x + 4y = 1200 (0, 300) and (400, 0). Finishing: x + 5y = 800 (0, 160) and (800, 0).

STEP 2

We now shade the area on our graph which satisfies our constraint inequalities simultaneously. This produces our feasible region, the set of possible answers to our linear programme.

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800 900 1000

Finishing

Spraying

Assembly

The Optimum point is the point (x, y) that is in the feasible region and which maximises the objective function.



STEP 3

To find the optimum point you can either compare slopes of the edges of the feasible region with the slope of the objective function, or find the co-ordinates of the points A, B, C, and D, then calculate the profit at each of these points to see which is the maximum value.

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800 900 1000

A B

C

D

Spraying

FinishingAssembly




Method of Slopes

The slope of the objective function and the constraints are calculated and compared to determine the optimum point. The slope or gradient of an equation of the form, P = cx +dy , is given by

dc−

. So if we calculate the slopes of the constraints, finishing, assembly, and spraying, and compare them to the slope of the profit function we will be able to identify the optimum point. Profit function P = 30x + 100y Slope = – 30/100 = –0.3. Finishing: x + 5y = 800 Slope = –1/5 = –0.2. Assembly: 5x + l2y = 2400 Slope = –5/12 = –0.42. Spraying: 3x + 4y = 1200 Slope = – ¾ = –0.75. Now –0.3 is between –0.2 and –0.42, and so the optimum point is the intersection of the Finishing and Assembly lines, so the point marked B on our graph. Next we must find the co-ordinates of B. So we can either estimate the co-ordinates from the graph or use simultaneous equations for an exact solution.

So the Optimum Point = ⎟⎠⎞

⎜⎝⎛

131123,

138184 = (184.6, 123.1) which produces a maximum profit of

30(184.6) + 100(123.1)= £17,846.15. ENDSECTION STARTSECTION=content_5.htm= SECTION~



Calculating Profit Method

First find the co-ordinates of A, B, C, D. We know A= (0, 160) and D = (400, 0) from the graph. We then have to solve two sets of simultaneous equations to find the co-ordinates of B and C, or estimate their co-ordinates by reading them from the graph. Once we have the co-ordinates we substitute them into the profit function and find the point that produces the maximum profit.

Point Co-ordinates

(x, y)

Profit = P = 30x + 100y

A (0, 160) 30(0) + 100(160) = £16,000

B (184.6, 123.1) 30(184.6) + 100(123.1)= £17,846.15

C (300, 75) 30(300) + 100(75)= £16,500

D (400, 0) 30(400) + 100(0)= £12,000

We clearly see that point B is the optimal point, since it produces the maximum profit.

Optimal Solution

Using either method we find that point B is the optimal solution to our linear programme and that we should produce 184 of product X and 123 of product Y to maximise profit.




Utilisation of Resources

Once we have an optimal solution we can assess how fully we are using the resources available at each corner of the feasible region.

Point (x, y) Used Available Spare = Available – used

A (0, 160)

Finishing x + 5y = 0 + 5(160) = 800 Fully used

800 800 – 800 = 0

Assembly 5x + l2y = (5×0) + (12×160) = 1920

2400 2400 – 1920 = 480

Spraying 3x + 4y = (3×0) + (4×160) = 640

1200 1200 – 640 =560

B (184.6,123.1)

Finishing Fully used 800 0

Assembly Fully used 2400 0

Spraying 3x + 4y = (3×184.6) + (4×123.1) = 1046.2

1200 1200 – 1046.2 = 153.8

C (300, 75)

Finishing x + 5y = 300 + (5×75) = 675

800 800 – 675 = 125

Assembly Fully used 2400 0

Spraying Fully used 1200 0

D (400, 0)

Finishing x + 5y = 400 + (5×0) = 400

800 800 – 400 = 400

Assembly 5x + l2y = (5×400) + (12×0) = 2000

2400 2400 – 2000 = 400

Spraying Fully used 1200 0



Example: Minimising Cost

A health food manufacturer wishes to blend two kinds of food so that the package can claim that a 100g portion will contain enough of two particular vitamins to meet the daily requirement for good health.

The requirements are:

Vitamin A at least 250 units and Vitamin B at least 225 units.

The vitamin content of each g of the two foods are shown below.

Food 1 Food 2

Vit A 5 12

Vit B 3 4

If food 1 costs 0.5p per g, and food 2 costs 0.4g per g, how much of each food should be used in each 100g portion to give enough of vitamins A and B at minimum cost?

How do we express this as a linear programme?

Let the manufacturer blend x g. of Food 1 with y g of Food 2. This will help us construct our cost function and constraint inequalities.

We want to minimise the cost of the food portion. Since Food 1 costs 0.5p per g, and Food 2 costs 0.4g per g and as our portion is made up of x g. of Food 1 and y g of Food 2 then the cost of a portion is

yxCCost 4.05.0 +==

We know that each portions weight must be at least 100g. So the weight of Food 1(x) plus Food 2 (y) must be at least 100g. This produces

100≥+ yx

Also we know that the Vitamin A content of the portion must be at least 250 units. So the Vitamin A content of Food 1 plus the Vitamin A content Food 2 must be at least 250 units. This produces

250125 ≥+ yx



Similarly the Vitamin B content of the portion must be at least 225 units. So the Vitamin B content of Food 1 plus the Vitamin B content Food 2 must be at least 225 units. This produces

22543 ≥+ yx

Common sense tells us that the quantities we use of each food will be positive. This gives 0,0 ≥≥ yx .

Linear Programme

We want to minimise Cost = 0.5x + 0.4y subject to the following constraints, x + y ≥ 100 (1) 100g portion 2x + 5y ≥ 250 (2) Vitamin A requirement 3x + 2y ≥ 225 (3) Vitamin B requirement x ≥ 0, y ≥ 0 To find the region that satisfies our constraints we draw the graph of our constraints. So we must plot each constraint equation onto our graph. Constraint (1) x + y ≥ 100 This line passes through (0, 100) and (100, 0). Constraint (2) 2x +5 y ≥ 250 (0, 50) and (125, 0). Constraint (3) 3x +2 y ≥ 225 (0, 112.5) and (75, 0).



Method of Slopes

The slope of the objective function and the constraints are calculated and compared to determine the optimum point. The slope or gradient of an equation of the form, P = cx +dy , is given by

dc−

. So if we calculate the slopes of the constraints, finishing, assembly, and spraying, and compare them to the slope of the profit function we will be able to identify the optimum point.

yxCCost 4.05.0 +== Slope = –0.5/0.4 = –1.25 Constraint (1) x + y ≥ 100 Slope = –1/1 = –1 Constraint (2) 2x +5 y ≥ 250 Slope = –2/5 = –0.4 Constraint (3) 3x +2 y ≥ 225 Slope = –3/2 = –1.5 Now –1.35 is between –1 and –1.5, and so the optimum point is the intersection of (1) and (3), so point B. So we can either estimate the co-ordinates from the graph or use simultaneous equations for an exact solution. So the Optimum Point = (25, 75) which produces a minimum cost of £ (0.5(25) + 0.4(75)) = 42.5p. So the manufacturer should blend 25 g of Food 1 with 75 g of Food 2 to achieve a minimum cost of 42.5 p per 100g.



Calculating the cost method

First find the co-ordinates of A, B, C, D. Once we have the co-ordinates we substitute them into the cost function and find the point that produces the minimum cost.

Point Co-ordinates

(x, y)

yxCCost 4.05.0 +==

A (0, 112.5) 0.5(0) + 0.4(112.5) = 45p

B (25, 75) 0.5(25) + 0.4(75)= 42.5p

C (83⅓,16⅔)

0.5(83⅓)+ 0.4(16⅔)= 49.93p

D (125, 0) 0.5(125) + 0.4(0)= 62.5p

We clearly see that point B is the optimal point, since it produces the minimum cost.

Optimal Solution

Using either method we find that point B is the optimal solution to our linear programme and that we should include 25g of Food 1 and 75g of Food 2 to minimise cost subject to our constraints. So the manufacturer should blend 25 g of Food 1 with 75 g of Food 2 to achieve a minimum cost of 42.5 p per 100g.

Exercise M8.1

Try calculating the utilisation of resources at the optimal point for the above example.




Seminar Questions


The table below gives information about a bed manufacturer, Sleepy Nights Ltd, who make and distribute single and double beds. Each manufactured item goes through 3 departments, assembly, testing and distribution.

Time for activity in mins

Singles (x) Doubles (y) Total time available

Assembly 9 12 7200

Testing 10 10 6500

Distribution 12 6 6000

Profit 150 100

a) Formulate the above information as a linear programming problem, and write down your objective function.

b) Draw a graph of the constraint inequalities.

c) Find the optimum profit point, indicate this on your graph and evaluate the maximum profit.

d) Describe the utilisation of resources at the optimum point. ENDSECTION STARTSECTION=activity_2.htm= SECTION~


The table below gives information about a car manufacturer, Citrus Cars Ltd, for two models the AM and the CM. Each car must go through 3 departments, assembly, testing and distribution.

Time for activity in mins

AM (x) CM (y) Total time available

Assembly 50 30 3000

Testing 100 100 7000

Distribution 60 110 6600

Profit 2500 3500



a) Formulate the above information as a linear programming problem, and write down your objective function.

b) Draw a graph of the constraint inequalities.

c) Find the optimum profit point, indicate this on your graph and evaluate the maximum profit.

d) Describe the utilisation of resources at the optimum point.

e) If the profit of the AM is increased to £4500 how would this affect the optimum point?




• the formulation of the profit function from the information provided;

• the formulation of the constraint inequalities from the information provided;

• graph the constraint inequalities and identify the feasible region; • identification of the optimal point, and calculate the maximum

profit achieved; • how to evaluate the utilisation of resources.

Extra Activities






M9 Time Series Analysis Page 285


CHAPTER=TS


Time Series Analysis Context

At the beginning of the course we examined methods used to collect data. We then represented our results graphically, and summarised the data. We also used regression analysis to describe any linear relationships between variables. The regression methods used are appropriate when considering a causal relationship between variables. Having estimated the regression model it can be used to estimate/forecast/predict a value of the response variable, y, for a known value of the explanatory variable, x. Thus, we could use such models to forecast the sales of a commodity given we know the advertising expenditure etc. In this unit we will look at how the relationship between a variable (such as sales, demand, price of materials, labour costs) and time can be analysed so as to predict future values of the variable. The generation of reliable estimates of future values is crucial for planning purposes in many businesses. Time series analysis involves consideration of historical data to obtain estimates or forecasts of future values based on past values. ENDSECTION STARTSECTION=scope_2.htm= SECTION~

Objectives


• know what a time series is; • construct a time series graph; • describe the components of a time series; • estimate the trend using regression and moving average

techniques; • find seasonal variations using the additive model and

multiplicative models; • forecast by extrapolating a trend and adjusting for seasonal

variation.




What is a Time Series?

A series of values taken over a time period is referred to as a time series. For our purposes we will assume the data was recorded at regular time intervals. The following are examples of time series.

• Financial Time Index recorded daily for the last 5 years • Daily air pollution levels in London for the last month • Number of road deaths in U.K. recorded weekly for the last 10

years • Monthly sales of a company over the last 2 years • Total annual costs of production for a company for the last 10

years. Now go and do Exercise M9.1


How many observations are there in each of the above time series?


Aims of Time Series Analysis

• To identify patterns in the data. E.g. electricity bills will be high in winter and low in summer

• To gain an understanding of the variation in the data, both in the long and the short term. E.g. perhaps in the short term, the number of road deaths fluctuates due to the weather conditions, but in the long term the number of road deaths are increasing since the number of cars on the road are increasing.

Time Series Plot

A graph of a time series is produced by plotting our variables of interest against time. The horizontal axis represents time and the vertical axis represents the values of the data recorded. The graph is very useful in identifying characteristics of the data.



Consider the following time series.

Year 1998 1999 2000 2001 2002 2003 2004

Sales (£000s) 20 21 24 23 27 30 28

A graph of this data is as follows:

0

5

10

15

20

25

30

35

1998 1999 2000 2001 2002 2003

Sale

s (£0

00s)

.


Components of a Time Series

In order to consider the behaviour of such time series it is useful to separate the values into a number of components. Trend (T) The trend is the underlying long term movement over time in the value of the time series data.



Example In the following three time series there are three types of trend which are immediately apparent in the time series graphs.

Year Output per labour hour (units)

Cost per unit (£)

Number of employees

1999 30 1.00 100

2000 24 1.08 103

2001 26 1.20 96

2002 22 1.15 102

2003 21 1.18 103

2004 17 1.25 98

Series A Series B Series C

Time series A (output)

0

5

10

15

20

25

30

35

1999 2000 2001 2002 2003 2004

Series A

There is a downward trend in the output per labour hour. Output per labour hour did not fall every year because it went up between 2000 and 2001, but the long term movement (trend) is clearly a downward one.



Time series B (cost)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1999 2000 2001 2002 2003 2004

Series B

There is an upward trend in the cost per unit. Although costs went down in 2002 from a higher level in 2001, the basic movement over time is one of rising costs. Time series C (number of employees)

92

94

96

98

100

102

104

1999 2000 2001 2002 2003 2004

Series C

There is no clear movement up or down, and the number of employees remained fairly constant around 100. The trend is therefore a static or level one.



Seasonal Variations (S)

These are short-term fluctuations in recorded values due to circumstances which affect results at different times. Seasonal is a term which may appear to refer to seasons of the year but its meaning in time series analysis is somewhat broader as the following examples show.

• Daily seasons: the data could be the number of patients recorded daily in a casualty department. There would be more patients on the weekends.

• Monthly seasons: there would be more cold / flu cases in the winter months than the summer months.

• Quarterly seasons; Electricity bills arrive quarterly and the winter quarters tend to have higher bills.

Other seasonal examples

• Sales of ice cream will be higher in the summer than in the winter, and sales of overcoats will be higher in the autumn than in the spring.

• The telephone network may be heavily used at certain times of the day and much less at other times.

Cyclical Variations

Cyclical variations are medium term changes in results caused by circumstances which repeat in cycles. These variations could cause the data to be below (or above) the trend line for periods of longer than one year. In business, cyclical variations are commonly associated with economic cycles, successive booms and slumps in the economy. Cyclical variations are longer term than seasonal variations.

Residual Variation (R)

Other factors causing variation which cannot be explained by the trend or seasonal component: for example, measurement error. ENDSECTION STARTSECTION=content_4.htm= SECTION~



Finding the Trend (T)

There are three principle methods of finding a trend.

Inspection

The trend can be drawn by eye on a graph in such a way that appears to lie evenly between the recorded data points.

Regression Analysis

This method makes the assumption that the trend line, whether up or down, is a straight line. Periods of time (such as years for the data in the trend examples A, B and C) are numbered commonly from 1 and the regression line of the data on these period numbers is found. That line is then taken to be the trend.

Moving Averages

This method attempts to remove seasonal variations by a process of averaging. Example: Associate Petroleum Inc The table below shows the volume of heating oil sold by Associate Petroleum Inc in the Eastern European sector over the period 1994-1997. The figures give the number of barrels of oil in thousands sold during each 4 month period during these years.

Sales of heating oil (1000 barrels)

Year Jan – Apr May – Aug Sep – Dec

1994 35 15 42

1995 36 19 44

1996 41 22 47

1997 45 26 52



A graph of this data is as follows:

0

10

20

30

40

50

60

Jan – Apr

'94

May – A

ug '94

Sep – Dec '94

Jan – Apr

'95

May – A

ug '95

Sep – Dec '95

Jan – Apr

'96

May – A

ug '96

Sep – Dec '96

Jan – Apr

'97

May – A

ug '97

Sep – Dec '97

1000

bar

rels

Sales (y)

There is a clear upwards trend in the data. Furthermore, there is a seasonal effect as the May – Aug values each year are below the trend are lower than the Jan – Apr and Sep – Dec values.




Regression Estimate of the Trend

Setting up the Data To estimate the trend using regression analysis, the time index (i.e. 1994, Jan – Apr) needs to be replaced by a number. Remember that the variable of interest, sales, is the response y. This would give the following data set.

Year Period Time point (x) Sales (y)

1994 Jan – Apr 1 35

May – Aug 2 15

Sep – Dec 3 42

1995 Jan – Apr 4 36

May – Aug 5 19

Sep – Dec 6 44

1996 Jan – Apr 7 41

May – Aug 8 22

Sep – Dec 9 47

1997 Jan – Apr 10 45

May – Aug 11 26

Sep – Dec 12 52



Show that

∑ = 78x , ∑ = 6502x , ∑ = 424y , ∑ =165862y , ∑ = 2940xy .

Hence show that the regression line of sales (y) on time (x) is

y = 26.97 + 1.2867x.





• Forecast the value of the trend for the May – Aug period of 1998.

Hint: what would the value of x be for the May – Aug period of 1998?

• What is the problem with using the trend forecast for May – Aug period of 1998 as the value of the forecasted sales for May – Aug period of 1998?

The graph clearly shows that the data has a seasonal component. This means that the data varies about the trend line in a way that can be described by the time period, or season, that the data point relates to. Jan – Apr and Sep – Dec values are above the trend line and May – Aug values are below the trend line. Therefore, the forecast must also take into account this seasonal aspect of the data.




Moving Average Method

Moving averages for the values of a time series are arithmetic means of successive and overlapping values taken n at a time. The number of values, n, used to calculate the moving average is called the period of the moving average. Example n = 3 Year 1994 1995 1996 1997

Period Jan-Apr

May-Aug

Sep-Dec

Jan-Apr

May-Aug

Sep-Dec

Jan-Apr

May-Aug

Sep-Dec

Jan-Apr

May-Aug

Sep-Dec

Sales (y)

35 15 42 36 19 44 41 22 47 45 26 52

Moving Total

92

93

97

99

104

105

108

112

116

121

Moving average

30.7

31

32.3

33

34.7

35

36

37.3

38.7

40.3

Note that the period of the moving average, n, must coincide with the length of the natural cycle of the series. Example Quarterly data n = 4 4 period moving average Monthly data n = 12 12 period moving average Student shop data n = 5 5 period moving average The time series graph with this trend line drawn on it is

0

10

20

30

40

50

60

Jan – Apr '94

May – Aug '94

Sep – Dec '94

Jan – Apr '95

May – Aug '95

Sep – Dec '95

Jan – Apr '96

May – Aug '96

Sep – Dec '96

Jan – Apr '97

May – Aug '97

Sep – Dec '97

1000

bar

rels

Centring

If the period of the moving average is odd then the moving average is automatically centred. This means that the moving average locates at a time point that corresponds to the time point on actual data value in the series. If n is even the moving average would not be centred automatically. As the moving average needs to coincide with times that the actual time series values were recorded at an even period moving average needs to be centred by you. The Quarterly sales figures for a company are given below.



Quarter

Year First Second Third Fourth

1998 588 612 636 660

1999 495 515 535 555

2000 400 416 432 448

2001 707 735 763 791



First we calculate the moving average using n=4, and then we average again to obtain our centred trend values.

Year and Quarter

Sales y Moving total n=4

Moving average

Trend

1998 First 588

Second 495 2190 547.5

Third 400 (547.5+553.5)÷2 = 550.5

2214 553.5 Fourth 707 (553.5+558.5)÷2 =556.0

2234 558.5 1999 First 612 560.5

2250 562.5 Second 515 566.0

2278 569.5 Third 416 572.5

2302 575.5 Fourth 735 578.0

2322 580.5 2000 First 636 582.5

2338 584.5 Second 535 588.0

2366 591.5 Third 432 594.5

2390 597.5 Fourth 763 600

2410 602.5 2001 First 660 604.5

2426 606.5 Second 555 610

2454 613.5 Third 448

Fourth 791

Below displays our times series for sales with the moving average trend also indicated.



0

100

200

300

400

500

600

700

800

90019

98 F

irst

Seco

nd

Third

Four

th

1999

Firs

t

Seco

nd

Third

Four

th

2000

Firs

t

Seco

nd

Third

Four

th

2001

Firs

t

Seco

nd

Third

Four

th

Time

Sale

sSales Y Trend




Finding the Seasonal Variation

Once a trend has been established, by whatever method, we can find the seasonal variations. When isolating seasonal variations we need to establish if we are using an additive or multiplicative time series model.

Additive Model

This is used when the seasonal elements are relatively constant over the complete time period being analysed. So in the following graph, the peaks of the time series graph are all the same size, a1 = a2 = a3. In such a case the time series value can be expressed as the sum of a trend and seasonal component. The standard expression describing this type of model would be; y = T + S + R. Example

Additive Model

0

100

200

300

400

500

600

700

800

900

1998

Q1

Q2

Q3

Q4

1999

Q1

Q2

Q3

Q4

2000

Q1

Q2

Q3

Q4

2001

Q1

Q2

Q3

Q4

Qua

rter

ly S

ales

a aa

12

3

Multiplicative Model

This type of model is used when the seasonal elements change in proportion to the trend values over the complete time period being analysed. So the peaks in the following graph for the multiplicative model get bigger as the trend increases, a1 < a2 < a3. In this case the time series can be expressed as the product of a trend and a seasonal component and can be expressed as ;



y = T × S × R. Example

Multiplicative Model

0

100

200

300

400

500

600

700

800

900

1000

1998

Q1

Q2

Q3

Q4

1999

Q1

Q2

Q3

Q4

2000

Q1

Q2

Q3

Q4

2001

Q1

Q2

Q3

Q4

Qua

rter

ly S

ales

aa

a

1

23




The Additive Model, y = T + S + R.

We will illustrate finding the seasonal variations referring to the Associate Petroleum Inc example from earlier. The data set used to fit the regression trend on is

Year Period Time point (x) Sales (y)

1994 Jan – Apr 1 35

May – Aug 2 15

Sep – Dec 3 42

1995 Jan – Apr 4 36

May – Aug 5 19

Sep – Dec 6 44

1996 Jan – Apr 7 41

May – Aug 8 22

Sep – Dec 9 47

1997 Jan – Apr 10 45

May – Aug 11 26

Sep – Dec 12 52

For each of the time points we can estimate the trend value using the fitted regression line, T = 26.97 + 1.2867x The value of x for the first time point is 1. So the estimate of the trend for the first tie point is given by substituting x = 1 in the regression equation. This gives, Trend estimate T(1) = 26.97 + (1.2867×1) = 28.2567 The value of x for the second time point is 2. So the estimate of the trend for the second time point is given by substituting x = 2 in the regression equation. This gives, Trend estimate T(2) = 26.97 + (1.2867×2) = 29.5434



This can be done for each point giving the following table.

Year Period Time point (x)

Sales (y) Trend (T) T = 26.97 + 1.2867x

1994 Jan-Apr 1 35 28.2567

May-Aug 2 15 29.5434

Sep-Dec 3 42 30.8301

1995 Jan-Apr 4 36 32.1168

May-Aug 5 19 33.4035

Sep-Dec 6 44 34.6902

1996 Jan-Apr 7 41 35.9769

May-Aug 8 22 37.2636

Sep-Dec 9 47 38.5503

1997 Jan-Apr 10 45 39.837

May-Aug 11 26 41.1237

Sep-Dec 12 52 42.4104

The additive model for time series is y = T + S + R. We can therefore write y – T = S + R. In other words, if we deduct the trend values from the time series values, we will be left with the seasonal and residual components of the time series. If we assume that the residual component is very small and hence negligible, the seasonal component can be found as S = y – T, the de-trended series.



Year Period Sales (y) Trend, T = 26.97 + 1.2867x

S = y – T

1994 Jan – Apr 35 28.2567 35 – 28.2567 = 6.7433

May – Aug 15 29.5434 15 – 29.5434 = –14.5434

Sep – Dec 42 30.8301 42 – 30.8301 = 11.1699

1995 Jan – Apr 36 32.1168 3.8832

May – Aug 19 33.4035 –14.4035

Sep – Dec 44 34.6902 9.3098

1996 Jan – Apr 41 35.9769 5.0231

May – Aug 22 37.2636 –15.2636

Sep – Dec 47 38.5503 8.4497

1997 Jan – Apr 45 39.837 5.163

May – Aug 26 41.1237 –15.1237

Sep – Dec 52 42.4104 9.5896

You will notice that the difference between the actual time series result and the trend line average for any one time period is not the same from year to year. That is the May – Aug seasonal effects each year are not the same. This is because y – T contains not only seasonal variations but random variations as well. So to evaluate a seasonal estimate for each time period, we average the seasonal estimates for each year corresponding to that time period. The May – Aug seasonal effect is then

.8336.1443342.59

4)1237.15()2636.15()4035.14()5434.14(S Aug -May

−=−

=

−+−+−+−=



Exercise M9.4

Find the seasonal effects for Jan – Apr and Sep – Dec for the above example.




Forecasting with the Additive Model

Forecasting is an essential but difficult task in business. There are several mathematical techniques for producing forecasts. They will not necessarily provide reliable forecasts but they can help in making future plans. The technique we will use here will consist of extrapolating a trend and the adjusting this trend for seasonal variations.

Example

Returning again to the Associate Petroleum Inc example, how we would produce a forecast for the sales of oil in May – Aug 1998.

Step 1: Estimate the trend, T.

To estimate the trend we would need to know the value of x at this future point in time to substitute into the regression equation T = 26.97 + 1.2867x. If Sep – Dec 1997 has a value of 12 for x, then May – Aug 1998 has a value of 12 + 2 = 14 as May – Aug 1998 is two time periods into the future from the last value of the given time series. This means the trend forecast for May – Aug 1998 is Trend estimate T = 26.97 + (1.2867×14) = 44.9838.

Step 2: Evaluate the seasonal component, S.

Assuming the existing pattern in the data continues, we have just calculated the seasonal component for May – Aug to be –14.8336.

Step 3: Combine the trend and seasonal components.

In this case we are using the additive model y = T + S + R We have a forecast for the trend, T, and the seasonal component, S. We cannot isolate R but hopefully this is small. Hence the forecast is constructed by adding the trend and seasonal components together. Forecast for the sales of oil in May-Aug 1998 y = 44.9838 + (–14.8336) = 30.1502. ENDSECTION STARTSECTION=content_10.htm= SECTION~



The Multiplicative Model, y = T×S×R

The multiplicative model for time series is y = T× S × R. We can therefore write y ÷ T = S × R. In other words, if we divide the time series values by the trend values, we will be left with the seasonal and residual components of the time series. If we assume that the residual component is very small and hence negligible. The seasonal component can be found as S = y ÷T, the de-trended series.

Year and Quarter Sales y. Trend T y ÷ T

1998 First 588

Second 495

Third 400 550.5 400 ÷ 550.5 = 0.7266

Fourth 707 556.0 1.2716

1999 First 612 560.5 1.0919

Second 515 566.0 0.9099

Third 416 572.5 0.7266

Fourth 735 578.0 1.2716

2000 First 636 582.5 1.0918

Second 535 588.0 0.9099

Third 432 594.5 0.7267

Fourth 763 600 1.2717

2001 First 660 604.5 1.0918

Second 555 610 0.9098

Third 448

Fourth 791 You will notice that the difference between the actual time series result and the trend line average for any one time period is not the same from year to year. That is the seasonal effects for each quarter for each year are not the same. This is because y ÷ T contains not only seasonal variations but random variations as well. So to come up with a seasonal estimate for each time period, we average the seasonal estimates for each year corresponding to that time period.

Seasonal Effect for the First Quarter



Average all y ÷ T values corresponding to first quarter

0918.13

0918.10918.10919.11 =

++=S .

Seasonal Effect for the Second Quarter Average all y ÷ T values corresponding to second quarter

9099.03

9098.09099.09099.02 =

++=S .

Seasonal Effect for the Third Quarter

7266.03

7267.07266.07266.03 =

++=S .

Seasonal Effect for the Fourth Quarter

2717.13

2717.12716.12716.14 =

++=S .

Note: The seasonal effects should add up to 4, the period of the moving average. Here the total is 3.999. This is close enough not to worry about. You can adjust each S slightly to make the sum 4 if needed. ENDSECTION STARTSECTION=content_11.htm= SECTION~



Forecasting with the Multiplicative Model

Returning again to the quarterly sales example, how we would produce a forecast for the sales for the second quarter of 2002.

Step 1: Estimate the trend, T.

To estimate the trend we would use the graph, extrapolating the trend, the trend estimate for 2002 quarter 2 is 628.

0100200300400500600700800900

1998

Firs

tSe

cond

Third

Four

th19

99 F

irst

Seco

ndTh

irdFo

urth

2000

Firs

tSe

cond

Third

Four

th20

01 F

irst

Seco

ndTh

irdFo

urth

2001

Firs

tSe

cond

Time

Sale

s

Sales Y Trend

Step 2: Evaluate the seasonal component, S.

Assuming the existing pattern in the data continues, we have calculated the seasonal component for the second quarter to be S2 = 0.9099.

Step 3: Combine the trend and seasonal components.

In this case we are using the multiplicative model y = T × S × R We have a forecast for the trend, T, and the seasonal component, S. We cannot isolate R but hopefully this is small. Hence the forecast is constructed by multiplying the trend and seasonal component together. Forecast for the sales for the second quarter of 2002 y = 628×0.9099 = 571.39.




Seminar Questions


The table below shows the total export orders for a company during 1993–6. The figures are given in £ millions.

Total exports (£ millions)

Jan – Apr May – Aug Sep – Dec

1993 4.5 5.6 4.9

1994 5.1 5.9 5.2

1995 5.4 6.8 5.8

1996 6.0 6.8 6.1

i) Calculate a regression trend estimate for this time series.

ii) Estimate the seasonal variations and thus forecast the value of exports for the three time periods in 1997 using the multiplicative model.



The figures below give the total newspaper sales of a company based in Canada in each quarter during the years 1994–7. The figures show the average daily circulation over each quarter in 100,000s.

Daily newspaper sales

1994 1995 1996 1997

1st quarter 2.2 2.6 2.9 3.2

2nd quarter 2.9 3.2 3.4 3.6

3rd quarter 3.3 3.6 3.9 4.2

4th quarter 2.4 2.7 2.8 3.1

i) Plot these values onto a graph.



ii) Calculate the moving average trend (n = 4) and seasonal variations, estimate the circulation figures in each quarter during 1998 by using an additive model.

iii) Would an additive or multiplicative model be appropriate in this example? Forecast the same range of figures using the multiplicative model and compare your results from the two methods.




• what components make up a time series; • the characteristics of additive and multiplicative models; • under what conditions you need to use regression analysis or

moving averages to evaluate a trend; • what steps are needed to forecast values.

Extra Activities





.


Documents

STX1110 Module Notes 0607