Statistics

Histograms

In addition to grouping data, we often graph them to better visualize any patterns in the data.

Seeing data displayed graphically can significantly deepen our understanding of a data set and the

situation it describes.

outliers

In many data sets, there are occasional values that fall far from the rest of the data. For example, if

we graph the age distribution of students in a college course, we might see a data point at 75 years.

Data points like this one that fall far from the rest of the data are known as outliers. How do we

interpret them?

P17 Summary

With any data set we encounter, we must find ways to allow the data to tell their story. Ordering

and graphing data sets often expose patterns and trends, thus helping us to learn more about the

data and the underlying situation. If data can provide insight into a situation, they can help us to

make the right decisions

Creating Histograms

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to

create histograms using the Histogram tool. However, we suggest you read through the

instructions to learn how Excel creates histograms so you can construct them in the future when

you do have access to the Data Analysis Toolpak.

To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel

2007. If "Data Analysis" appears in the Ribbon, the Toolpak has already been installed. If not,

click the Office Button in the top left and select "Excel Options." Choose "Add-Ins" and highlight

the "Analysis Toolpak" in the list and click "Go." Check the box next to Analysis Toolpak and

click "OK." Excel will then walk you through a setup process to install the toolpak

Central Values for data

Graphs are very useful for gaining insight into data. However, sometimes we would like to

summarize the data in a concise way with a single number.

The mean

Often, we'd like to summarize a set of data with a single number. We'd like that summary value to

describe the data as well as possible. But how do we do this? Which single value best represents

an entire set of data? That depends on the data we're investigating and the type of questions we'd

like the data to answer.

The median

Let's look at the revenues of the top 100 companies in the US. The mean revenue of these

companies is about $42 billion. How should we interpret this number? How well does this average

represent the revenues of these companies?

The mode

A third statistic to represent the "center" of a data set is its mode: the data set's most frequently

occurring value. We might use the mode to represent data when knowing the average value isn't as

important as knowing the most common value.

P23 Summary

To summarize a data set using a single value, we can choose one of three values: the mean, the

median, or the mode. They are often called summary statistics or descriptive statistics. All three

give a sense of the "center" or "central tendency" of the data set, but we need to understand how

they differ before using them:

Finding the mean in Excel

To find the mean of a data set entered in Excel, we use the AVERAGE function.

Excel can find the median, even if a data set is unordered, using the MEDIAN function.

Excel can also find the most common value of a data set, the mode, using the MODE function

Variability

The mean, median and mode give you a sense of the center of the data, but none of these indicate

how far the data are spread around the center. "Two sets of data could have the same mean and

median, and yet be distributed completely differently around the center value," Alice tells you.

"We need a way to measure variation in the data."

The Standard Deviation

It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or

are the values widely dispersed?

calculating

A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a

staffing plan for Saturdays, typically a heavy traffic day. In the hospitality industry, like many

service industries, proper staffing can make the difference between unhappy guests and satisfied

customers who want to return.

P30 Interpreting

What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been

50 requests.

Summary

P31 The standard deviation measures how much data vary about their mean value.

P32 Finding in Excel

Excel's STDEV function calculates the standard deviation.

The Coefficient of Variation

The standard deviation measures how much a data set varies from its mean. But the standard

deviation only tells you so much. How can you compare the variability in different data sets?

P34Summary

The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use

it to compare variation in different data sets of different scales or units.

Applying Data Analysis

After a good night's sleep, you meet Alice for Breakfast.

"It's time to get started on Leo's assignments. Could you get those price quotes from diving

schools and prepare a presentation for Leo? We'll want to present our findings as neatly and

concisely as possible. Use graphs and summary statistics wherever appropriate. Meanwhile, I'll

start working on Leo's hotel occupancy problem."

Pricing scuba school

In addition to the school Leo is currently using, you find 20 other scuba services in the phone

book. You call those 20 and get price quotes on how much they would charge the Kahana per

guest for a Scuba Certification Course.

Exercise1

After a company completes its initial public offering, how is the ownership of common stock

distributed between individuals in the firm, often termed "named insiders"?

Exercise3

Two Variables

We use histograms to help us answer questions about one variable. How do we start to investigate

patterns and trends with two variables?

Sometimes, we are not as interested in the relationship between two variables as we are in the

behavior of a single variable over time. In such cases, we can consider time as our second variable

Suppose we are planning the purchase of a large amount of high-speed computer memory from an

electronics distributor. Experience tells us these components have high price volatility. Should we

make the purchase now? Or wait?

Assuming we have price data collected over time, we can plot a scatter diagram for memory price,

in the same way we plotted height and weight. Because time is one of the variables, we call this

graph a time series.

Let's look at two data sets: heights and weights of athletes. What can we say about the two data

sets? Is there a relationship between the two?

Our intuition tells us that height and weight should be related. How can we use the data to inform

that intuition? How can we let the data tell their story about the strength and nature of that

relationship?

As always, one of our first steps is to try to visualize the data.

Because we know that each height and weight belong to a specific athlete, we first pair the two

variables, with one height-weight pair for each athlete.

Plotting these data pairs on axes of height and weight — one data point for each athlete in our data

set — we can see a relationship between height and weight. This type of graph is called a "scatter

diagram."

Scatter diagrams provide a visual summary of the relationship between two variables. They are

extremely helpful in recognizing patterns in a relationship. The more data points we have, the

more apparent the relationship becomes.

In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier.

We need to be careful not to draw conclusions about causality when we see these types of

relationships.

Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about

our weights.

Assuming causality in the other direction would be just plain wrong. Although we may wish

otherwise, growing heavier certainly doesn't make us taller!

The direction and extent of causality might be easy to understand with the height and weight

example, but in business situations, these issues can be quite subtle.

Managers who use data to make decisions without firm understanding of the underlying situation

often make blunders that in hindsight can appear as ludicrous as assuming that gaining weight can

make us taller.

Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a

scatter diagram, we plot for each day the number of massages purchased at a spa resort versus the

total number of guests visiting the resort.

We can see a relationship between the number of guests and the number of massages. The more

guests that stay at the resort, the more massages purchased — to a point, where massages level off.

Why does the number of massages reach a plateau? We should investigate further. Perhaps there

are limited numbers of massage rooms at the spa. Scatter plots can give us insights that prompt us

to ask good questions, those that deepen our understanding of the underlying context from which

the data are drawn.

Time series are extremely useful because they put data points in temporal order and show how

data change over time. Have prices been steadily declining or rising? Or have prices been erratic

over time? Are there seasonal patterns, with prices in some months consistently higher than in

others?

Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we

shouldn't rely only on visual analysis when looking for relationships and patterns.

Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be

related to each other. But we must be careful: human intuition isn't foolproof and often we infer

relationships where there are none. We must be careful to avoid some of these common pitfalls

Let's look at an example. For US presidents of the last 150 years, there seems to be a connection

between being elected in a year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in

office. Abraham Lincoln (elected in 1860) was the first victim of this unfortunate relationship.

James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left

office), and William McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and

John F. Kennedy (1960) all died in office.

Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data

suggest about the president elected in 2020?

Probably nothing. Unless we have a reasonable theory about the connection between the two

variables, the relationship is no more than an interesting coincidence.

Hidden variables

Even when two data sets seem to be directly related, we may need to investigate further to

understand the reason for the relationship.

We may find that the reason is not due to any fundamental connection between the two variables

themselves, but that they are instead mutually related to another underlying factor.

Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store

The sales of the two products form a relationship on a scatter plot: when puck sales slump,

baseball sales jump. But are the two data sets actually related? If so, why?

A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey.

In spring and summer, people play baseball.

If we had simply plotted puck and baseball sales without thinking further, we might not have

considered the time of year at all. We could have neglected a critical variable driving the sales of

both products.

In many business contexts, hidden variables can complicate the investigation of a relationship

between almost any two variables.

A final point: Keep in mind that scatter plots don't prove anything about causality. They never

prove that one variable causes the other, but simply illustrate how the data behave.

Summary

Plotting two variables helps us see relationships between two data sets. But even when

relationships exist, we still need to be skeptical: is the relationship plausible? An apparent

relationship between two variables may simply be coincidental, or may stem from a relationship

each variable has with a third, often hidden variable.

1 Plotting two variables on a scatter diagram can help illustrate the relationship between them.

2 When one variable is time, the relationship is known as a time series.

3 A relationship is not proof of causality.

4 Be alert to the possibility of hidden variables.

To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then

use Excel's built in chart tools to plot the data.

To prepare our data, we need to be sure that each data point in the first set is aligned with its

corresponding value in the other set. The sets don't need to be contiguous, but it's easier if the data

are aligned side by side in two columns.

If the data sets are next to each other, simply select both sets.

Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and

choose the first type: Scatter with Only Markers.

Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data

represented on the X-axis and the second column of data on the Y-axis.

We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and

choosing Layout 1.

Then we can add the chart title and label the axes by selecting and editing the text.

Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit

and design elements of your chart.

Correlation

By plotting two variables on a scatter plot, we can examine their relationship. But can we measure

the strength of that relationship? Can we describe the relationship in a standardized way?

Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when

the relationship between two variables looks strong ...

... or weak ...

... linear ...

... or nonlinear

... positive (when one variable increases, the other tends to increase) ...

... or negative (when one variable increases, the other tends to decrease).

Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively,

we notice when data points are close to an imaginary line running through a scatter plot.

Logically, the closer the data points are to that line, the more confidently we can say there is a

linear relationship between the two variables.

However, it is useful to have a simple measure to quantify and communicate to others what we so

readily perceive visually. The correlation coefficient is such a measure: it quantifies the extent to

which there is a linear relationship between two variables.

To describe the strength of a linear relationship, the correlation coefficient takes on values

between -1 and +1. Here's a strong positive correlation (about 0.85) ...

and here's a strong negative correlation (about -0.90).

If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1.

At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but

what happens in the middle?

Even when the correlation coefficient is 0, a relationship might exist — just not a linear

relationship. As we've seen, scatter plots can reveal patterns and help us better understand the

business context the data describe.

To reinforce our understanding of how our intuition about the strength of a linear relationship

between variables translates into a correlation coefficient, let's revisit the examples we analyzed

visually earlier.

In some cases, the correlation coefficient may not tell the whole story. Managers want to

understand the attendance patterns of their employees. For example, do workers' absence rates

vary by time of year?

Suppose a manager suspects that his employees skip work to enjoy the good life more often as the

temperature rises. After pairing absences with daily temperature data, he finds the correlation

coefficient to be 0.466.

While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship

— suggesting that the weather might indeed be the culprit.

But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter

plot, the manager might realize that the three outliers correspond to a late-summer, three-day

transportation strike that kept some workers homebound the previous year.

Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude

the outliers, the relationship disappears, and the correlation essentially drops to zero, quieting any

suspicion of weather. Why do the outliers influence our measure of linearity so much?

As a summary statistic for the data, the correlation coefficient is calculated numerically,

incorporating the value of every data point. Just as it does with the mean, this inclusiveness can

get us into trouble...

Because measures like correlation give more weight to points distant from the center of the data,

outliers can strongly influence the correlation coefficient of the entire set. In these situations, our

intuition and the measure we use to quantify our intuition can be quite different. We should always

attempt to reconcile those differences by returning to the data.

Summary

The correlation coefficient characterizes the strength and direction of a linear relationship between

two data sets. The value of the correlation coefficient ranges between -1 and +1.

1 A correlation coefficient near +1 or -1 indicates that the two variables have a strong positive or

negative linear relationship, respectively.

2 A correlation coefficient near zero indicates a weak or nonexistent linear relationship.

3A coefficient near zero does not prove there is no relationship between the two variables;

It indicates only that any relationship that does exist is not linear.

4Outliers can unduly influence the calculation of the correlation coefficient, making the

correlation much higher or lower than what it would be without the outlying points.

Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to

our data on athletes' height and weight.

Enter the data set into the spreadsheet as two paired columns. We must make sure that each data

point in the first set is aligned with its corresponding value in the other set.

To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the

CORREL function as shown below.

The order in which the two data sets are selected does not matter, as long as the data "pairs" are

maintained. With height and weight, both values certainly need to refer to the same person!

Occupancy and Arrivals

Alice is eager to move forward: "With your new understanding of scatter diagrams and

correlation, you'll be able to help me with Leo's hotel occupancy problem."

In the hotel industry, one of the most important management performance measures is room

occupancy rate, the percentage of available rooms occupied by guests.

Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving

on the island each month. On a geographically isolated location like Hawaii, visitors almost all

arrive by airplane or cruise ship, so state agencies can gather very precise data on arrivals.

Alice asks you to investigate the relationship between room occupancy rates and the influx of

visitors, as measured by the average number of visitors arriving to Kauai per day in a given

month. She wants a graphical overview of this relationship, and a measure of its strength.

Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy

rates in Kauai, as tracked by the Hawaii Department of Business, Economic Development, and

Tourism.

The best way to graphically represent the relationship between arrivals and occupancy is:

A histogram

A scatter diagram

A time series

A series of concentric burning wheels

You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can

be characterized as:

Weakly negative and linear

Strongly negative and non-linear

Strongly positive and linear

Strongly positive and non-linear

This is the best answer. The relationship is positive. Higher levels of occupancy generally

correspond to higher numbers of arrivals. The trend appears to be reasonably linear.

You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with

2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.

To find the correlation coefficient, open the Kahana Data file. In any empty cell, type

=CORREL(B2:B37,C2:C37). When you hit enter, the correct answer, 0.71, will appear.

Together with Alice, you compile your findings and present them to Leo.

see. The relationship between the number of people arriving on Kauai and the island's hotel

occupancy rate follows a general trend, but not a precise pattern. Look at this: in two months with

nearly the same average number of daily arrivals, the occupancy rates were very different — 68%

in one month and 82% in the other.

But why should they be so different? When people arrive on the island, they have to sleep

somewhere. Do more campers come to Kauai in one month, and more hotel patrons in the other?

Well, that might be one explanation. There could be differences in the type of tourists arriving.

The vacation preferences of the arrivals would be what we call a hidden variable

Another hidden variable might be the average length of stay. If the length of stay varies month to

month, then so will hotel occupancy. When 50 arrivals check into a hotel, the occupancy rate will

be higher if they spend 10 days each at the hotel than if they spend only 3 days.

I'm following you, but I'm beginning to see that the occupancy issue is more complex than I

expected. Let's get back to it at a later time. The scuba school contract is more pressing at the

moment.

Exercise1

As online retailing expands, many companies are interested in knowing how effective search

engines are in helping consumers find goods online.

Computer scientists study the effectiveness of such search engines and compare how many results

search engines recall and the precision with which they recall them. "Precision" is another way of

saying that the search found its target, for example a page containing both the phrases "winter

parka" and "Eddie Bauer."

What could you say about the relationship between the Precision and the number of Results

Recalled?

The amount of information a search engine recalls decreases over time.

An increase in precision causes the amount retrieved to decrease.

Recall and precision seem to be related: a large number of results typically pairs with low

precision.

This is the best answer. From the scatter plot, we can see that the variables demonstrate a

relationship, but maybe not a linear one. However, even when we recognize a clear relationship,

we cannot conclude that greater precision causes the amount of information recalled to decrease.

Exercise2

Is an education a good investment in your future? Some very successful business executives are

college dropouts, but is there a relationship in the general population between income and

education level?

Consider the following scatter plot, which lists the income and years of formal education for 18

people. Is the correlation:

Strongly positive

Weakly positive

Weakly negative

This is the best answer. The level of income is strongly associated with the number of years of

education for our data.

Though we should always calculate the correlation coefficient if we want to have a precise

measure, it's good to have a rough feel for the correlation between two variables we see plotted on

a scatter diagram. For the income-education data, the coefficient is nearest to:

0.1

-0.5

0.9

This is the best answer. A fairly strong linear relationship has a correlation coefficient closer to

1.0, making 0.9 a reasonable guess for what we see occurring between income and education

level.

Sampling & Estimation

The scuba problem

Leo asks you to help him evaluate the Kahana's contract with the scuba school.

Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from

their business activities. We have an excellent coral reef, and scuba diving is becoming very

popular among vacationers and business travelers.

We started our year-round diving program last year, contracting a local diving school to do a scuba

certification course. The one-year trial contract is now up for renewal.

Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a

week, and we subsidize the costs associated with each course. So I want to get a good handle on

how satisfied the guests are with the lessons before I decide whether or not to renew the contract.

The hotel has a database with information about which guests took scuba lessons and when. Feel

free to take a look at it, but I can't spend a fortune figuring this out. And I need to know as soon as

possible, since our contract expires at the end of the month.

Alice convinces you to do some field research and join her for a scuba diving lesson. You return

late that afternoon exhausted but exhilarated. Alice is especially enthusiastic.

"Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet!

"But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's

clientele as a whole enjoyed the scuba certification course. After all, we may have caught the

instructor on his best day this year."

Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school.

Random Samples

Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You

have to survey a few guests, and from their opinions draw conclusions about hotel guests in

general. The guests you choose to survey must be representative of all of the guests who have

taken the scuba course at the resort. But how can you be sure you get a good sample?

As managers, we often need to know something about a large group of people or products. For

example, how many defective parts does a large plant produce each year? What are the average

annual earnings of a Wall Street investment banker? How many people in our industry plan to

attend the annual conference?

When it is too costly to gather the information we want to know about every person or every thing

in an entire group, we often ask the question of a subset, or sample of the group. We then try to

use that information to draw conclusions about the whole group.

To take a sample, we first select elements from the entire group, or "population," at random. We

then analyze that sample and try to infer something about the total population we're interested in.

For example, we could select a sample of people in our industry, ask them if they plan to attend

the annual conference, and then infer from their answers how many people in the entire industry

plan to attend.

For example, if 10% of the people in our sample say they will attend, we might feel quite

confident saying that between 7% and 13% of our entire population will attend.

This is the general structure of all the problems we'll address in this unit — we'll work out the

details as we go forward. We want to know something about a population large enough to make

examining every population member impractical.

We first select elements from the population at random...

...then analyze that sample...

...and then draw an inference about the total population we're interested in.

Taking a Random Sample

The first trick to sampling is to make sure we select a sample that broadly represents the entire

group we're interested in. For example, we couldn't just ask the conference organizers if they

wanted to attend. They would not be representative of the whole group — they would be biased in

favor of attending the conference!

To get a good sample, we must make sure we select the sample "at random" from the full

population. This means that every person or thing in the population is equally likely to be selected.

If there are 15,000 people in the industry, and we are choosing a sample of 1,000, then every

person needs to have the same chance — 1 out of 15 — of being selected.

Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this

section, we'll see examples of some major mistakes people have made while trying to select a

random sample, and provide some advice about how to avoid the most common types of sampling

errors.

In some cases, selecting a random sample can be fairly easy. If we have a complete list of each

member of the group in a database, we can just assign a unique number to each member of the

group. We then let a computer draw random numbers from the list. This would ensure that each

element of the population has an equal likelihood of being selected.

If the population about which we need to obtain information is not listed in an easy-to-access

database, the task of selecting a sample at random becomes more difficult. In these cases, we have

to be extremely careful not to introduce a bias in the way we select the sample.

For example, if we want to know something about the opinions of an entire company, we cannot

just pick employees from one department. We have to make sure that each employee has an equal

chance of being included in the sample. A department as a whole might be biased in favor of one

opinion.

Once we have decided how to select a sample, we have to ask how large our sample needs to be.

How many members of the group do we need to study to get a good estimate about what we want

to know about the entire population?

The answer is: It depends on how "accurate" we want our estimate to be. We might expect that the

larger the population, the larger the sample size needed to achieve a given level of accuracy, but

this is not true.

A sample size of 1,000 randomly-selected individuals can often give a satisfactory estimation

about the underlying population, as long as the sample is representative of the whole population.

This is true regardless of whether the population consists of thousands of employees or millions of

factory parts.

Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned

about the accuracy of our estimate. Other times, we might need to sample thousands to obtain the

accuracy we require.

Later in this unit, we will find out how to calculate a good sample size. For now, it's important to

understand that the sample size depends on the level of accuracy we require, not on the size of the

population.

Learning about a sample

Once we select our sample, we need to make sure we obtain accurate information about each

member of the sample. For example, if we want to learn about the number of defects a plant

produces, we must carefully measure each item in the sample.

When we want to learn something about a group of people and don't have any existing data, we

often use a survey to learn about an issue of interest. Conducting a survey raises problems that can

be surprisingly tricky to resolve.

First, how do we phrase our questions? Is there a bias in any questions that might lead participants

to answer them in a certain way? Are any questions worded ambiguously? If some of the people in

the sample interpret a question one way, and others interpret it differently, our results will be

meaningless!

Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it

over the phone? Should we interview survey participants in person, or distribute handouts at a

meeting?

There are advantages and disadvantages to all methods. A survey sent through the mail may be

relatively inexpensive, but might have a very low response rate. This is a major problem if those

who respond have a different opinion than those who don't respond. After all, the sample is meant

to learn about the entire population, not just those with strong opinions!

Creating a telephone survey creates other issues: When do we call people? Who is home during

regular business hours? Most likely not working professionals. On the other hand, if we call

household numbers in the evening the "happy hour crowd" might not be available.

When we decide to conduct a survey in person, we have to consider whether the presence of the

person asking the questions might influence the survey results. Are the survey participants likely

to conceal certain information out of embarrassment? Are they likely to exaggerate?

Clearly, every survey will have different issues that we need to confront before going into the field

to collect the data.

With any type of survey, we must pay close attention to the response rate. We have to be sure that

those who respond to the survey answer questions in much the same way as those who don't

respond would answer them. Otherwise, we will have a biased view of what the whole population

thinks.

Surveys with low response rates are particularly susceptible to bias. If we get a low response rate,

we must try to follow up with the people who did not respond the first time. We either need to

increase the response rate by getting answers from those who originally did not respond, or we

must demonstrate that the non-respondents' opinions do not differ from those of the respondents

on the issue of interest.

Low response rate

Contact non-respondents

Raise response rate or Show non-respondents do not differ

Tracking down everyone in a sample and getting their response can be costly and time consuming.

When our resources are limited, it is often better to take a small sample and relentlessly pursue a

high response rate than to take a larger sample and settle for a low response rate.

Summary

Often it makes sense to infer facts about a large population from a smaller sample. To make sound

inferences:

1. Make sure the sample is representative of the population:

2. pick elements at random: every member of the population must be equally likely to be

selected.

3. Choose appropriate sample size:

4. for large populations, sample size does not depend on population size.

5. sample size depends on desired accuracy.

6. Avoid biased results:

7. Phrase questions neutrally to avoid bias.

8. Pursue high response rates: better to have a smaller with a high response.

9. Understand incentives and motivations of respondents and pollsters.

To understand the importance of representative samples, let's go back in history and look at some

mistakes made in the Literary Digest poll of 1936.

The Literary Digest, a popular magazine in the 1930's, had correctly predicted the outcome of U.S,

presidential elections from 1916 to 1932. When the results of the 1936 poll were announced, the

public paid attention. Who would become the next president?

Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking,

"Whom will you vote for in this year's presidential election?" This may well be the largest poll in

American history."

Newscaster: "The Digest sent the survey to over 10 million Americans and over two million

responded!"

Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a

large margin and become President of the United States."

As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D.

Roosevelt was re-elected to a third term in office in the largest landslide victory recorded to that

date. This was a devastating blow to the Digest's reputation. What went wrong? How could such a

large survey be so far off the mark?

The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it

mailed the survey to people on three different lists: the magazine's subscribers, car owners, and

people listed in telephone directories. What was wrong with choosing a sample from these lists?

The sample was not representative of the American public. Most lower-income people did not

subscribe to the Digest and did not own phones or cars back in 1936. This led the poll to be biased

towards higher-income households and greatly distorted the poll's results. Lower-income

households were more likely to vote for the Democrat, Roosevelt, but they were not included in

the poll.

Second, the magazine relied on people to voluntarily send their responses back to the magazine.

Out of the ten million voters who were sent a poll, over two million responded. Two million is a

huge number of people. What was wrong with this survey?

Mistakes:

Unrepresentative sample

Low response rate

Biased respondents

Biased questions

The mistake was simple: Republicans, who wanted political change, felt more strongly about the

election than Democrats. Democrats, who were generally happy with Roosevelt's policies, were

less interested in returning the survey. Among those who received the survey, a disproportionate

number of Republicans responded, and the results became even more biased.

The Digest had put an unprecedented effort into the poll and had staked its reputation on

predicting the outcome of the election. Its reputation wounded, the Digest went out of business

soon thereafter.

During the same election year, a little known psychologist named George Gallup correctly

predicted what the Digest missed: Roosevelt's victory. What did Gallup do that the Literary Digest

did not? Did he create an even bigger sample?

Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no

guarantee of accurate results if they weren't randomly selected from the population

Gallup's team interviewed only 3,000 people, but made sure that the people they selected were

truly representative of the US population. He also instructed his team to be persistent in asking the

opinion of each person in the sample, which generated a high response rate.

Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method

of polling soon became a standard for public opinion polls.

Today's polls usually consist of a sample of around a thousand randomly selected people who are

truly representative of the underlying populations. For example, look at poll reported in a leading

newspaper: the sample size will likely be around a thousand.

Another common survey mistake is phrasing the questions in a way that leads to a biased

response. Let's take a look at a recent example of a biased question.

In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mail-

in survey to show that the public supported his desire to abolish special interest groups. This is the

question he asked:

“Should laws be passed to eliminate all possibilities of special interests giving huge sums of

money to candidates?”

In Perot's mail-in survey, 99 percent of respondents said "yes" to that question. It seemed as if

everyone in America agreed with Perot's stance.

Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted

two interesting follow-up surveys. In the first survey, it used the same question that Perot asked

and found that 80 percent of the population favored passing the law. YP attributed the difference to

the fact that it was able to create a more representative sample than Perot.

Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the

following way:

“Should laws be passed to prohibit interest groups from contributing to campaigns,or do groups

have a right to contribute to the candidates they support?”

The response to this question was strikingly different. Only 40 percent of the sampled population

agreed to prohibit contributions. As it turned out, the results of the survey all came down to the

way the question was phrased.

For any survey we conduct, it's critical to phrase the question in the most neutral way possible to

avoid bias in the sample results.

The real lesson of these two examples is this: How data are collected is at least as important as

how data are analyzed. A sample that is unrepresentative, biased, or not drawn at random can give

highly misleading results.

How sample data are collected is at least as important as how they are analyzed. Knowing that

sample data need to be representative and unbiased, you conduct a survey of the hotel guests.

How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel

database, you determine that 2,804 hotel guests took scuba trips in the past year. The scuba

certification course was offered year-round. The database includes each guest's name, address,

phone number, age, date of arrival, length of stay, and room number.

Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will

provide a good sample of all the guests who took scuba lessons.

Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a

small percentage will respond, or conduct a telephone survey, which would likely provide a higher

response rate, but cost more per guest contacted?

To ensure a good response rate — and because Leo wants an answer quickly — you choose to

contact customers by phone. Alice warns that to keep costs low, you can only contact 50 hotel

guests, and reminds you to create a random, representative sample.

You open up the list of names in the hotel database. The names were entered as guests arrived. To

make things simple, you randomly select a date and then record the first 50 guests arriving after

that date who took the course. You ask the hotel operator to call them for you, and tell him to be

persistent. Eventually he is able to contact 45 of the guests on the list. He asks the guests to rate

their scuba experience on a 1 to 6 scale and reports the results back to you. Click the link below to

view your sample.

Enter the average satisfaction level as a decimal number with one digit to the right of the decimal

point (e.g., enter "5" as "5.0"). Round if necessary.

You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He

explodes.

Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go

over your data again.

Back in your room, you look over your list of data. What should you tell Leo?

You should have mailed out your survey.

Your survey is not representative of the guests who took the scuba course.

Your survey is unbiased and representative, and Leo should accept the survey results as true.

Your observation is correct. Although mailing out the survey might have changed your result, that

was not the main problem with your survey.

What factor is biasing your results?

By bothering people at home, you got negative responses.

The income levels of the customers you phoned were not representative of the scuba-diving

guests.

The dates that the surveyed customers visited the resort were not representative of the scuba-

diving guests.

Correct! Since you choose guests only from the month of April, any usual event that happened in

that period could bias your results. In addition, your sample would be biased if more of a certain

type of guests (for example business travelers versus tourists) visited during April than during the

rest of the year.

When you report this news to Leo, he begins to laugh.

We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and

the ones that did meet had to deal with choppy water and bad visibility. Even the weeks following

the hurricane were bad. Usually guests see a manta ray every week, and the guests in April could

barely see the underwater coral. No wonder they weren't happy.

You assure Leo you will conduct the survey again with a more representative sample. This time,

you make sure that the guests are truly randomly selected. Later, you have new data in your hands

from 45 randomly chosen guests that show the average satisfaction rate to be 4.4 on a 1 to 6 scale.

The standard deviation of the sample is 1.54.

Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal

computers. This morning, he opened the latest issue of Business 4.0, a business journal, and

noticed an article on Bell Computers.

The article praised the high quality and low cost of the PCs made by Bell. However, it also

included some negative comments about Bell's customer service.

Currently, customer service is only available to customers of Bell Computers over the phone.

Collins wants to understand more fully what customers think of Bell's customer service. His

marketing department designs a survey that asks customers to rate Bell's customer service from 1

to 10.

How should he conduct the survey?

Bell Computers should mail a survey to every customer in Bell's database asking them to write

Bell about their experiences with the customer service department.

Bell's sales peak during the holidays, when people give gifts, including computers. Bell should

send a mail survey along with each of its outbound computer shipments in December.

Bell is located in the Southern United States. 55% of Bell's customers are also located in the

South. Bell should conduct a phone survey in one of the major Southern cities.

Every month, on a random day and time, Bell should conduct a phone survey immediately after a

Customer Service Representative has spoken to a customer. New answers should be added to a

rolling average.

This is the best answer. Conducting a phone survey immediately after a randomly chosen

customer service session will create a random sample that is representative of all of Bell's

customers.

Wave" is a company that manufactures laundry detergent in several countries around the world. In

India, the competition among laundry detergents is fierce.

The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma

instructed his marketing team to come up with a strong advertising campaign stressing Wave's

superiority over other competitors. Wave conducted a survey in the month of June.

They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good

product?" "Do you notice a difference in the color of your clothes after using Wave?" Then, citing

the results of their survey, Wave aired a major television campaign claiming that 75% of the

population thought that Wave was a good product.

You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been

asked to conduct a study for Wave's main competitor, the Coral Reef Detergent Company, about

whether Wave's claims hold water. Coral Reef wonders how the Wave results are possible,

considering that Coral Reef holds over 45% of the current market share.

Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is

obviously not representative and unbiased. Coral Reef can dispute Wave's claim!" What has Ms.

Mehta noticed?

You have been asked to conduct a survey to determine the percentage of flights arriving at a small

airport that were filled to capacity that morning. You decide to stand outside the airport's single

exit door and ask a sample of 60 passengers leaving the airport how full their flight was.

Your first thought is to just ask the first 60 passengers departing the airport how full their flight

was, but you quickly realize that that could be a highly biased sample. Any 60 people leaving at

the same time would likely have come from only a couple of flights, and you want to get a good

sense of what percent of all flights arriving that morning were filled to capacity. Thus, you decide

to randomly select 60 people from all the passengers departing the building that morning.

After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you

that their flight was filled to capacity, and 20 people tell you that their flight was not filled to

capacity. What can you conclude from your survey results so far?

The best estimate is that 60% of the flights were filled to capacity.

The best estimate is that 50% of the flights were filled to capacity.

There is a problem with the survey approach.

This is the correct answer. There is a problem with your survey.

What is the problem with your survey?

A sample of 60 passengers is not large enough to provide a good estimate.

Only those passengers that feel most strongly about the issue are likely to respond.

Passengers from full planes are likely to be selected more frequently than passengers from

relatively empty planes.

This is the correct answer. There is a systematic bias in your sample: When you sample passengers

at the exit door of an airport, you will, on average, select more people from full planes, simply

because when a plane is full, there are more passengers on it - and hence more leaving the airport -

than when a plane is relatively empty.

To see this, imagine that 10 planes have arrived that morning — five of which were full (having

100 passengers each) and five of which had only a single passenger on the plane. In this case, half

of the planes were full. However, almost all of the passengers (500 of the total 505) departing

from the airport would report (correctly!) that they had been on a full plane. Since people from a

full plane are more likely to be selected, there is a systematic bias in your response.

It is important, in every survey, to try to make your sample as representative as possible. In this

case, your sample was not representative of the planes arriving to the airport.

A better approach might be to ask the people you select what their flight number was, and then ask

them how full their flight was. Make sure you have at least one passenger from every plane. Then

count the responses of only one person from each flight. By including only one person per flight in

your sample, you ensure that your sample is an accurate prediction of how many planes are filled

to capacity.

Sampling is complicated, and it is important to think through all the factors that might influence

your results. In this case, the mistake is that you are trying to estimate a population of planes by

sampling a population of passengers. This makes the sample unrepresentative of the underlying

population. By randomly sampling the passengers rather than the flights, each flight is not equally

likely to be selected, and the sample is biased.

The Scuba Problem(Part 11)

You report the results of your survey, the sample mean, and its standard deviation to Leo.

A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result.

After all, you've only collected 45 responses.

If you'd chosen different people, they likely would have given different responses. What if — just

by chance — these 45 people loved the scuba course, and no one else did?

You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests

isn't too far from 4.4, but at this point we're not sure exactly how far away it might be. Without

more calculations, all we can say is that 4.4 is the best estimate we have. That is why...

Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the

accuracy of this survey result?

If the results are a little off, that's not a problem. But you have to tell me how far off they might

be. What if you're off by two whole points, and the true satisfaction of my hotel guests is 2.4, not

4.4? In that case, my decision would be completely different.

I need to know how accurately this sample reflects the opinions of all the hotel guests who went

scuba diving!

The sample mean is the best point estimate of the population mean, but it cannot tell you how

accurately the sample reflects the population.

Alice suggests giving Leo a range of values that is almost certain to contain the population mean.

"We may not be able to pin down mean satisfaction precisely. But confining it to a range of likely

values will provide Leo with enough information to make a sound business decision."

That sounds like a good idea, but you wonder how to actually do it.

Using Confidence intervals

The sample mean is the best estimate of our population mean. However, it is only a point estimate.

It does not give us a sense of how accurately the sample mean estimates the population mean.

Think about it. If we know only the sample mean, what can we really say about the population

mean? In the case of our scuba school, what can we say about the average satisfaction rate of all

scuba-diving hotel guests? Could it be 4.3? 4.0? 4.7? 2.0?

To make decisions as a manager, we need to have more than just a good point estimate. We need

to have a sense of how close or far away the true population mean might be from our estimate.

We can indicate the most likely values of the true population mean by creating a range, or interval,

around the sample mean. If we construct it correctly, this range will very likely contain the true

population mean.

For example, by constructing a range, we might be able to tell Leo that we are very confident that

the true average customer satisfaction for all scuba guests falls between 4.2 and 4.6.

Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to

make a decision than if he simply knew the estimated average of 4.4.

Creating a range around the sample mean is quite easy. First, we need to know three statistics of

the sample: the mean x-bar, the standard deviation s, and the sample size n.

We also need to know how "confident" we'd like to be that the range contains the true mean of the

population. For any level of "confidence", there is a value we'll call z to put into the formula. We'll

learn later in this unit exactly what we mean by "confidence," and how to compute z. For now, just

keep in mind that for higher levels of confidence, we'll need to put in a larger value of z.

Using these numbers, we can create a range around the sample mean according to the following

formula:

Before we actually use the formula, let's try to develop our intuition about the range we're

creating. Where should the range be centered? How wide must the range be to make us confident

that it contains the true population mean? What factors would lead us to need a wider or narrower

range?

Let's see how the statistics of the sample influence the location and width of the range. Let's start

with the sample mean.

The sample mean is our best estimate of the population mean. This suggests that the sample mean

should always be the center of the range. Move the slider bar to see how the sample mean affects

the range.

Second, the width of the range depends on the standard deviation of the sample. When the sample

standard deviation is large, we have greater uncertainty about the accuracy of the sample mean as

an estimate of the population mean. Thus, we have to create a wider range to be confident that it

includes the true population mean.

On the other hand, if the sample standard deviation is small, we feel more confident that our

sample mean is an accurate predictor of the true population mean. In this case, we can draw a

more narrow range.

The larger the standard deviation, the wider the range must be. Move the slider bar to see how the

sample standard deviation affects the range.

Third, the width of the range depends on the sample size. With a very small sample, it's quite

possible that one or two atypical points in the sample could throw the sample mean off

considerably from the true population mean. So with a small sample, we need to create a wide

range to feel comfortable that the true mean is likely to be inside it.

The larger the sample, the more certain we can be that the sample mean represents the population

mean. With a large sample, even if our sample includes a few atypical points, there are likely to be

many more typical points in the sample to compensate for the outliers. Thus, with a large sample,

we can feel comfortable with a small range.

Move the slider bar to see how the sample size influences the range.

Finally, the width of the range depends on our desired level of confidence. The level of confidence

states how certain we want to be that the range contains the mean of the population. The more

confident we want to be that the range contains the true population mean, the wider we have to

make the range.

If our desired level of confidence is fairly low, we can draw a more narrow range.

In the language of statistics, we indicate our level of confidence by saying, for example, that we

are "95% confident" that the range contains the true population mean. This means there is a 95%

chance that the range contains the true population mean.

Move the slider bar to see how the confidence level affects the range.

P74

These variables determine the size of the range that we want to construct. We will learn exactly

how to construct this range in a later section.

For now, all we have to understand is that the population mean can best be estimated by a range of

values and that the range depends on three sample statistics as well as the level of confidence that

we want to assign to the range.

Summary

The sample mean is our best initial estimate of the population mean. To indicate how accurate this

estimate is, we construct a range around the sample mean that likely contains the population mean.

The width of the range is determined by the sample size, sample standard deviation, and the level

of confidence. The confidence level measures how certain we are that the range we construct

contains the true population mean.

1Construct range around the sample mean to estimate the population mean

2Larger standard deviation=>wider range

3 Large sample=>smaller range

4.Greater confidence=>wider range

The normal distriction

Alice recommends taking a step back from sampling and learning about the normal distribution.

The normal distribution helps us create a range around a sample mean that is likely to contain the

true population mean. You can use the normal distribution to turn the intuitive notion of

"confidence in your estimate" into a precisely defined concept. Understanding the normal

distribution will also give you deeper insight into how sampling works.

The normal distribution is a probability distribution that is centered at the mean. It is shaped like a

bell, and is sometimes called the "bell curve."

The z-statistic

The unique shape of the normal curve allows us to translate any normal distribution into a

standard normal curve, as we did with women's heights simply by re-labeling the x-axis. To do

this more formally, we use something called the z-statistic.

The normal distribution has a unique symmetrical shape whose center and width are completely

determined by its mean and its standard deviation. For every normal distribution, the probability

of being within a specified number of standard deviations of the mean is the same. The distance

from the mean, as measured in standard deviations, is known as the z-value. Using the properties

of the normal distribution, we can calculate a probability associated with any range of values.

Unique, symmetrical bell shape

Center at mean; width determined by standard deviation

Probability within 2 std.dev.s’s of mean =95%

Probability within 1 std.dev.s’s of mean =68%

z-value= (x - mean) / sigma

P79 Using EXCEL’s normal functions

To find the cumulative probability associated with a given z-value for a standard normal curve, we

use the Excel function NORMSDIST. Note the S between the M and the D. It indicates we are

working with a 'standard' normal curve with mean zero and standard deviation one.

The previous clip shows us how to use software programs like Excel to calculate z-values and

cumulative probabilities for the normal curve. Another way to find z-values and cumulative

probabilities is to use a z-table. Using z-tables is a bit more cumbersome than using Excel, but it

helps reinforce the concepts.

Find the cumulative probability associated with the z-value 2.

Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as

"5.00"). Round if necessary.

Find the cumulative probability associated with the z-value 2.36.



Find the cumulative probability associated with the z-value -1.



Find the cumulative probability associated with the z-value 1.645.



Find the cumulative probability associated with the z-value -1.645.



For a normal curve with mean 100 and standard deviation 10, find the cumulative probability

associated with the value 115.



For a normal curve with mean 100 and standard deviation 10, find the cumulative probability

associated with the value 80.



For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a

value greater than 80 but less than 115.


"5.00"). Round if necessary






value greater than 45.



The Central Limit Theorem





Find the z-value associated with the cumulative probability of 60%.



Find the z-value associated with the cumulative probability of 40%.



Find the z-value associated with the cumulative probability of 2.5%.



For a normal curve with mean 222 and standard deviation 17, find the value associated with the

cumulative probability of 88%.

Enter your answer as an integer (e.g., "5"). Round if necessary.

For a normal curve with mean 222 and standard deviation 17, find the value associated with the

cumulative probability of 28%.

Enter your answer as an integer (e.g., "5"). Round if necessary.

The Central Limit Theorem

How can the normal distribution help you sample Leo's hotel guests?

How do the unique properties of the normal distribution help us when we use a random sample to

infer something about the underlying population?

After all, when we sample a population, we usually have no idea whether or not the population is

normally distributed. We're typically sampling because we don't even know the mean of the

population! If the normal distribution is such a great tool, when can we use it?

It turns out that even if a population is not normally distributed, the properties of the normal

distribution are very helpful to us in sampling. To see why, let's first learn about a well-established

statistical fact known as the "Central Limit Theorem".

Definition

Roughly speaking, the Central Limit Theorem says that if we took many random samples from a

population and plotted the means of each sample, then — assuming the samples we take are

sufficiently large — the resulting plot of the sample means would look normally distributed.

Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample

means would be equal to the true mean of the population.

To repeat: no matter what type of distribution the population has — uniform, skewed, bi-modal, or

completely bizarre — if we took enough samples, and the samples were sufficiently large, then the

means of those samples would form a normal distribution centered around the true mean of the

population.

The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be

drawing a distribution of the means of many samples, but that is exactly what we are doing. We'll

call this distribution the Distribution of Sample Means. (Statisticians also often call it the

Sampling Distribution of the Mean).

Let's walk through this step-by-step. If we have a population — any population — we can take a

random sample. This sample has a mean.

We can plot that mean on a graph.

Then we take another sample. That sample also has a mean, which we also plot on the graph

Now, if we plot a lot of sample means in this way, they will start to form a normal distribution

around the population's mean.

The more samples we take, the more the graph of the sample means would look like a normal

distribution. Eventually, the graph of the sample means — the Distribution of the Sample Means

— would form a nearly perfect replica of a normal distribution.

Now, nobody would actually take a lot of samples, calculate all of the sample means, and then

construct a normal distribution with them. We're taking a lot of samples here just to let you see

that graphing the means of many samples would give you a normal curve.

In the real world, we take a single sample and squeeze it for all the information it's worth. But

what does the Central Limit Theorem allow us to say based on that single sample?

The Central Limit Theorem tells us that the mean of that one sample is part of a normal

distribution. More specifically, we know that the sample mean falls somewhere in a normal

Distribution of Sample Means that is centered at the true population mean.

The Central Limit Theorem is so powerful for sampling and estimation because it allows us to

ignore the underlying distribution of the population we want to learn about. Since we know the

Distribution of Sample Means is normally distributed and centered at the true population mean,

we can completely disregard the underlying distribution of the population.

As we'll see shortly, because we know so much about the normal distribution, we can use the

information about the Distribution of Sample Means to draw conclusions about the likelihood of

different values of the actual population mean.

SUMMARY

The Central Limit Theorem states that for any population distribution, the means of samples from

that population are distributed approximately normally. The more samples, and the larger the

sample size, the closer the Distribution of Sample Means fits a normal curve. The mean of a single

sample lies on this normal curve, so we can use the normal curve's special properties to extract

more information from a single sample mean.

1 Sample means are distributed approximately normally, regardless of the distribution of the

underlying population.

2 More samples, larger size

Better approximation to normal distribution

3Mean of Distribution of Sample Means = Mean of population distribution

4Use properties of normal distribution to extract information from a sample

Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are

marked "Uniform," "Bimodal," and "Skewed." On a later page, clicking on each of the three

sections in the navigation will display a different kind of distribution.

On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a

distribution for which all values in a specified range are equally likely to occur. Clicking on

"Bimodal" will display a distribution that has two separate areas where values are more likely to

occur than elsewhere. Clicking on "Skewed" will display a distribution that is not symmetrical —

values are more likely to fall above the mean than below.

Uniform

The population distribution is on the top half of the page. Let's take a sample of it. This sample has

a mean

Let's start building a distribution of the sample means on the bottom half of the page by placing

each sample mean on a graph. We repeat this process several times to create our distribution.

Take a sample. Find its mean. Record it in the sample mean histogram. This histogram

approximates the distribution of the sample means.

As we can see, the shape of the original distribution doesn't matter. The distribution of the sample

means will always form a normal distribution. This is what the Central Limit Theorem predicts.

Bimodal


a mean.


each sample mean on a graph. We repeat this process several times to create our distribution.


approximates the distribution of the sample means



Skewed


a mean.


each sample mean on a graph. We repeat this process several times to create our distribution


approximates the distribution of the sample means.



The Central Limit Theorem states that the means of sufficiently large samples are always normally

distributed, a key insight that will allow you to estimate the population mean from a sample.

Confidence intervals

Using the properties of the normal distribution and the Central Limit Theorem, you can construct a

range of values that is almost certain to contain the population mean.

For a normal distribution, we know that if we select a value at random, it will be within two

standard deviations of the distribution's mean 95% of the time.

The Central Limit Theorem offers us two additional insights. First, we know that the means of

sufficiently large samples are normally distributed, regardless of the distribution of the underlying

population.

Second, we know that the mean of the Distribution of Sample Means is equal to the true

population mean.

Combining these facts can give us a measure of how accurately the mean of a sample estimates

the population mean.

Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30

points — from a population, there is a 95% chance that the mean of that sample falls within two

standard deviations of the true population mean.

Let's build this up step by step to make sure we understand the logic.

First, we take a sample from a population and compute its mean. We know that the mean of that

sample is a point on a normal distribution — the Distribution of Sample Means.

Since the mean of our sample is a value randomly obtained from a normal distribution, there is a

95% chance that the sample mean is within two standard deviations of the mean of the

distribution.

The Central Limit Theorem tells us that the mean of that distribution is the same as the true

population mean. Thus, we can conclude that there is a 95% chance that the sample mean is within

two standard deviations of the population mean.

We have argued that 95% of our samples will have a mean within the range shown around the true

population mean.

Next we'll turn this around and look at intervals around sample means, because that's exactly what

a confidence interval is.

Let's look at intervals around the means of two different types of samples: those whose sample

means fall within the 2 standard deviation range around the population mean (which should be the

case for 95% of all samples) and those whose sample means fall outside the 2 standard deviation

range around the population mean (which should be the case for 5% of all samples).

First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around


Since this sample mean is outside the range, it must be more than 2 standard deviations away from

the population mean. Since the population mean is more than 2 standard deviations away from this

sample mean, an interval of width 2 standard deviations around this sample mean could not

contain the true population mean.

We know that 5% of all samples should have sample means outside the 2 standard deviation range

around the population mean. Therefore 5% of all samples we obtain will have intervals that do not

contain the population mean.

Now let's think about the remaining 95% of samples whose means do fall within the 2 standard

deviation range around the population mean

If we draw an interval of width 2 standard deviations around any one of these sample means, the

interval would contain the true population mean. Thus, 95% of all samples we obtain will have

intervals that contain the population mean

We've just shown how to go from any sample mean — a point estimate — to a range around the

sample mean — a 95% confidence interval. We've also argued that 95% of confidence intervals

obtained in this way should contain the true population mean.

It's important to emphasize: We are not saying that 95% of the time our sample mean is the

population mean, but we are saying that 95% of the time a range that is two standard deviations

wide centered around the sample mean contains the population mean.

To visualize the general concept of a confidence interval, imagine taking 20 different samples

from a population and drawing a confidence interval around each. As the diagram shows, on

average 95% of these intervals — or 19 out of 20 — would actually contain the population mean.

What does this insight mean for us as managers? When we set a confidence level of 95%, we are

agreeing to an approach that 1 out of 20 times will give us an interval that does not contain the

true population mean. If we aren't comfortable with those odds, we should raise the confidence

level.

P90 If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining

an interval that does not contain the true population mean. However, this higher confidence comes

at a cost. If we keep the same sample size, then the confidence interval will widen, thereby

decreasing the accuracy of our estimate. Alternatively, to keep the same interval width, we can

increase our sample size.

How do we know if an interval is too wide? Typically, if we would make a different decision for

different values within an interval, that interval is too wide. Let's look at an example.

P90 To estimate the percent of people in our industry who will attend the annual conference, we

might construct a confidence interval that ranges from 7% to 13%. If we would select a different

conference venue if the true percentages is 7% than if it is 13%, we need to tighten our range.

Now, before we are ready to actually create our own confidence intervals, there is a technical point

we need to be acquainted with. We need to know that the standard deviation of the Distribution of

Sample Means is σ, the standard deviation of the underlying population, divided by the square

root of n, the sample size.

We won't prove this fact here, but simply note that it is true, and that it should confirm our general

intuition about the Distribution of Sample Means. For example, if we have huge samples, we'd

expect the means of those large samples to be tightly clustered around the true population mean,

and thereby form a narrow distribution.

A confidence interval is an estimate for the mean of a population. It specifies a range that is likely

to contain the population mean. A confidence interval is centered at the mean of a sample

randomly drawn from the population under study. When we have a confidence level of 95% we

expect equally wide confidence intervals centered at 95 out of 100 such sample means to contain


1 The confidence interval is centered at the sample mean.

2 95%confidence => Confidence intervals around 95% of all sample means contain the

population mean

3 Weigh the odds of a confidence interval not containing the population mean against the costs

of a higher confidence level.

4 The confidence interval should be narrow enough so management decision will not change

for different values in the interval.

5 The standard deviation of the distribution of sample mean is :See note book P90-1

Finding a confidence interval

You understand the theory behind a confidence interval. But how do you actually construct one?

We can now translate the previous discussion into a simple method for finding a confidence

interval for the mean of any population. First, we randomly select a sample of size 30 from the

population.

We then compute the mean and standard deviation of the sample

Next, we assign the sample mean as the center of the confidence interval.

To find the width of the interval, we must know the level of confidence we want to assign to the

interval. If we want a 95% confidence interval, the interval should be 2 times the standard

deviation of the population divided by the square root of n, the sample size.

Since we typically don't know the standard deviation of the population, we substitute the best

estimate that we do have — the standard deviation of the sample.

Here's what the equation looks like for our example

If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply

by the z-value corresponding to the desired level of confidence.

We can use this formula to compute any confidence interval. There is one restriction: in order for

it to work, the sample size has to be at least 30.

Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them

estimate the average age of their subscribers so they can better target potential advertisers.

We tell them we plan to survey a sample of their subscribers. They say they're comfortable with

our working with a sample, but emphasize that they want to be 95% confident that the range we

give them contains the true average age of its full set of subscribers.

We obtain survey results from 60 randomly-chosen subscribers and determine that the sample has

a mean of 52 and a standard deviation of 40.

To find an appropriate confidence interval, we incorporate information about the sample into the

formula:

The z-value for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us

that a 95% confidence interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus

10.12, or 62.12. (SEE NOTE BOOK P90-2)

We give management the range from 41.88 to 62.12 as an estimate of the average age of its

subscribers, telling them they can be 95% confident that the true population mean falls between

these values.

What if we want a confidence level other than 95%? We can use the sample mean, standard

deviation, and size from the sample data, but how do we obtain the right z-value?

The z-value for 95% confidence is well known to be about 2, but how do we find a z-value for a

less common confidence interval? To be 98% confident that our interval contains the population

mean, how do we obtain the appropriate z-value?

(SEE NOTE BOOK P90-3)

To find the z-value for 98% confidence level, we are essentially asking: How far to the left and

right of the standard normal curve's mean do we have to go to capture 98% of the area?

Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails,

each covering 1% of the area under the curve. The z-value of the right boundary is the z-value

associated with a cumulative probability of 99% — the sum of the central 98% and the 1% in the

left tail.

Converting the desired confidence level into the corresponding cumulative probability on the

standard normal curve is essential because Excel's NORMSINV function and the z-table work

with cumulative probabilities.

To find the z-value associated with a cumulative probability of 99%, enter into Excel

=NORMSINV(0.99), which returns the z-value 2.33. Or, look in the z table and find the cell that

contains a cumulative probability closest to 0.9900. The z-value is 2.33, the sum of the row-value

2.3 and the column-value 0.03.

Try finding a z-value yourself. Find the z-value associated with a 99.5% confidence level using

the appropriate normal distribution function in Excel or using the Standard Normal Table (z-table)

in your briefcase.

The correct z-value for a confidence level of 99.5% is:

2.81

Our first step is to convert the confidence level of 99.5% into the corresponding cumulative

probability on the standard normal curve. To do this, note that to have 99.5% probability in the

middle of the standard normal curve, we must exclude a total area of 1 - 99.5% = 0.5% from the

curve. That area is divided into two equal parts in the distribution's tails: 0.25% in each tail.


We can now see that the cumulative probability associated with confidence level of 99.5% is 1 —

0.25% = 99.75%. Thus, the z-value for a confidence level of 99.5% is the same as the z-value of a

cumulative probability of 99.75%. We find the z-value in Excel by entering

=NORMSINV(0.9975), which returns the value 2.81. Alternatively, we could find the z-value in

the z-table by looking up the probability 0.9975

Summary

To calculate a confidence interval, we take a sample, compute its mean and standard deviation,

and then build a range around the sample mean with a specified level of confidence. The

confidence level indicates how confident we are that the sample mean we collected contains the

population mean.


Using Samll Samples

We assumed in our confidence limit calculations that the sample size was at least 30. What if it

isn't? What if we have only a small sample? Let's consider a different survey, one that concerns a

delicate matter.

The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find

out the value of her guests' belongings. She needs this value to determine the correct insurance

protection in case guest belongings disappear from their cabins, are destroyed in a fire, or sink

with the ship.

She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this

information. She is willing to ask only 16 guests to estimate the total value of the belongings in

their cabins. From this sample, we need to prepare an estimate.

With a sample size less than 30, we cannot calculate confidence intervals in the same way as with

a large sample size. A small sample increases our uncertainty about two important aspects of our

estimate of the population mean

First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we

cannot be sure that the sample means follow a normal distribution.

Second, with a small sample, we can't be sure that the sample standard deviation is a good

estimate of the population standard deviation.

Due to these additional uncertainties, we cannot use z-values to construct confidence intervals.

Using a z-value would overstate our confidence in our estimate.

Can we still create a confidence interval? Is there a way to estimate the population mean even if

we have only a handful of data points?

It depends: if we don't know anything about the underlying population, we cannot create a

confidence interval with fewer than 30 data points. However, if the underlying population is

normally distributed — or even roughly normally distributed — we can use a confidence interval

to estimate the population mean.

In practice, as long as we are sure the underlying population is not highly skewed or extremely

bimodal, we can construct a confidence interval, even when we have a small sample. However, we

do need to modify our approach slightly.

To estimate the population mean with a small sample, we use a t-distribution, which was

discovered in the early 20th century at the Guinness Brewing Company in Ireland.

A t-distribution gives us t-values in much the same way as a normal distribution gives us z-values.

What is the difference between the normal distribution and the t-distribution?

A t-distribution looks similar to a normal distribution, but is not as tall in the center and has

thicker tails, because it is more likely than the normal distribution to have values fall farther away

from the mean.

Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer

hold. For example, we must go more than 2 standard deviations on either side of the mean to

capture 95% of the probability for a t-distribution.

Thus, to achieve the same level of confidence, a confidence interval based on a t-distribution will

be wider than one based on a normal distribution. This reinforces our intuition: we have less

certainty about our estimate with a smaller sample, so we need a wider interval to achieve a given

level of confidence.

The t-distribution is also different because it varies with the sample size: For each sample size,

there is a different t-value associated with a given level of confidence. The smaller the sample size

n, the shorter the height and the thicker the tails of the t-distribution curve, and the farther we have

to go from the mean to reach a given level of confidence.

On the other hand, as the sample size increases, the shape of the t-distribution becomes more and

more like the shape of a normal distribution. Once we reach a sample size of 30, the t-distribution

becomes virtually identical to the z-distribution, so t-values and z-values can be used

interchangeably.

Incidentally, we can use the t-distribution even for sample sizes larger than 30. However, most

people use the z-distribution for larger samples, partially out of habit and partially because it's

easier, since the z-value doesn't vary based on the sample size.

To find the right t-value, we first have to identify the t-distribution that corresponds to our sample

size. We do this by finding the number of "degrees of freedom" of the sample, which for our

purposes is simply the sample size minus one. If our sample size is 16, we have 15 degrees of

freedom, and so on.

Excel provides a simple function for finding the appropriate t-value for a confidence interval. If

we enter 1 minus the level of confidence we want and the degrees of freedom into the Excel

function TINV, Excel gives us the appropriate t-value.

For example, for a 95% confidence interval and a sample size of n = 16, the Excel function

TINV(0.05,15) would return the value 2.131.

Once we find the t-value, we use it just like we used the z-value to find a confidence interval. For

example, for t = 2.131, the appropriate confidence interval is:

If we don't have Excel handy, we can use a t-distribution table to find the t-value associated with

the degrees of freedom and the confidence level we specify. When using different t-value tables,

we need to be careful to note which probability the table depicts.

Some tables report values associated with the confidence level, like 0.95. Others report values

based on the area in the tails, which would be 0.05 for a 95% confidence interval. Our t-table, like

many others, reports values associated with a cumulative probability, so for a 95% level of

confidence, we would have to look at a cumulative probability of 97.5%.

Returning to the good ship Demiurgos, let's determine an estimate of the average value of

passengers' belongings. The manager samples 16 guests, and reports that they have an average of

$10,200 worth of clothing, jewelry, and personal effects in their cabins. From her survey numbers,

we calculate a standard deviation of $4,800.

We need to double check that the distribution isn't too skewed, which we might expect, since some

of the passengers are quite wealthy. The manager explains that the insurance policy has a limited

liability clause that limits a passenger's maximum claim to $20,000.

Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey,

if a guest reported values above $20,000, the manager simply reported $20,000 as the value to be

covered for our data set.

We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we

feel comfortable using the t-distribution.

Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function

=TINV(0.05,15) tells us that the appropriate t-value is 2.131.

Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus

2.131 times $4,800 over the square root of 16. Thus, the width of the confidence interval is

2.131*4,800/4 = $2,557, and we can report that we are 95% confident that the average value of

passengers' belongings is between $7,643 and $12,757.

What if the Demiurgos' manager thinks this interval is too large?

She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and

also increases the size of the denominator (the square root of n). Both factors narrow the

confidence interval.

For example, if she asks 10 more guests, and the standard deviation of the sample does not

change, the t-value would drop to 2.06 and the square root of n in the denominator would increase.

The width of the interval would decrease significantly, from $2,557 to $1,939.

SUMMARY

Confidence intervals can be constructed even with a sample size of less than 30, as long as the

population is roughly normally distributed (or, at least not too skewed or bimodal). To find a

confidence interval with a small sample, use a t-distribution. T-distributions are a set of

distributions that resemble the normal distribution, but with shorter heights near the mean and

thicker tails. To find a confidence interval for a small sample size, place the appropriate t-value

into the confidence interval formula.

(SEE NOTE BOOK P90-6

When we take a survey, we often want a specific level of accuracy in our estimate of the

population mean. For example, when estimating car owners' average spending on car repairs each

year, we might want to be 95% confident that our estimate is within $50 of the true mean.

We know that the sample size of our survey directly affects the accuracy of our estimate. The

larger the sample size, the tighter the confidence interval and the more accurate our estimate. A

sample of size n gives us a confidence interval that extends a distance of d on either side of the

mean:

To find the sample size necessary to give us a specified distance d from the mean, we must have

an estimate of sigma, the standard deviation of spending. If we do not have an estimate based on

past data or some other source, we might take a preliminary survey to obtain a rough estimate of

sigma.

In this example, we estimate sigma to be $300 based on past experience. Since we want a 95%

level of confidence, we set z = 1.96. To ensure our desired accuracy — that d is no more than $50

— we must randomly sample at least 139 people.

In general, to ensure a confidence interval extends a distance of at most d on either side of the

mean, we choose a sample size n that satisfies the expression below. We can do this with simple

algebra, or by using the attached Excel utility.

When estimating a population mean, we can ensure that our confidence interval extends a distance

of at most d on either side of the mean by choosing an appropriate sample size.

(SEE NOTE BOOK P90-7

Here is a step-by-step process for creating a confidence interval:

First, we choose a level of confidence and a sample size n appropriate to the decision context.

Second, we take a random sample and find the sample mean. This is our best estimate for the

population mean.

Third, we find the sample's standard deviation.

Fourth, find the z-value or t-value associated with the proper confidence level. If our sample size

is over 30, we find the z-value for our confidence level. If not, we find the t-value for our

confidence level and with degrees of freedom = sample size - 1.

Fifth, we calculate the end points of the confidence interval using the formulae below.

SUMMARY

Construct confidence intervals using the steps outlined below. With a confidence interval derived

from an unbiased random sample, we can say that the true population mean falls within the

interval with the corresponding level of confidence.


Click here to open an Excel utility that allows you to create confidence intervals by providing the

sample mean, standard deviation, size, and desired level of confidence. You should enter data only

in the yellow input areas of the utility. To ensure you are using the utility correctly, try to

reproduce the results for the Wine Lover's Magazine and the Demiurgos examples.

The sample you collected earlier has all the data you need to create a confidence interval for Leo's

problem

You take another look at the survey you created earlier for Leo: you sampled 45 guests, and

calculated that the average satisfaction rate of the sample was 4.4, with a standard deviation of

1.54. Using this information, you decide to create a 95% confidence interval for Leo.

Your calculations show the following:

We can be 95% sure that the population mean falls between 3.95 and 4.85.

To create a 95% confidence interval, you take the mean of the sample and add/subtract the z-value

multiplied by the sample standard deviation divided by the square root of the sample size. Using

the numbers given, you obtain a 95% confidence interval by going 0.45 points above and below

the sample mean of 4.4, which translates into a confidence interval from 3.95 to 4.85.

You meet with Leo and tell him that you can be 95% certain that the population mean falls

between 3.95 and 4.85. Leo looks at your numbers and shakes his head.

That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be

happy, but if it's closer to 4, I'm concerned. Can we narrow the range at all?

Looking over your notes, you think you can give Leo some options.

We can survey a larger group of people.

This is the best answer. By increasing the sample size, you can narrow your confidence interval

even if the standard deviation stays constant.

Why don't you create a larger sample and report the results back to me?

You select another 40 guests at random and ask the hotel operator to conduct the survey for you

again. He is able to reach 25 guests. You combine the two samples, which gives a new sample size

of 70.

For the combined sample, you find that the new sample mean is 4.5 and the new sample standard

deviation is 1.2. Armed with more data, you create another confidence interval.

We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is

between: 4.22 and 4.78

To create this 95% confidence interval, you take the mean of the sample and add/subtract the z-

value multiplied by the sample standard deviation divided by the square root of the sample size.

Using the numbers given, you obtain a 95% confidence interval by going 0.28 points above and

below the sample mean of 4.5, which translates into a confidence interval from 4.22 to 4.78.

Thank you. I am much happier with this result. I have enough information now to decide whether

to keep the current scuba diving school.

Exercise1

Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150

stores spread throughout Japan. For over a year, the sales of high-end VCRs have lagged, due to a

shift towards DVD players.

Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase

a bulk shipment of high-end VCRs.

This would be a perfect opportunity for Toshi to liquidate slow-moving inventory currently

languishing on the shelves of his stores. Before he calls Veetek, he wants to know how many high-

end VCRs he can promise. After two days of furious phone calls, his deputy has gathered data

from 36 representative outlets in his retail chain.

The mean high-end VCR inventory in each store polled was 500 units. The standard deviation was

180. Toshi needs you to find a 95% confidence interval for the average VCR inventory per store.

The interval is:

From 441 to 559

Exercise2

Paul Segal manages the pig-farming division of the agricultural company Bowman-Lyons-

Centerville. A rumored outbreak of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the

verge of causing a public relations disaster.

The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by

measuring brain size post-mortem.

Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter

hundreds of swine to find out. At the preliminary stage, he can offer no more than 5 prime porkers

to be slaughtered and diagnosed.

For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06

lbs. With 95% confidence, in what range does the herd's average brain weight lie?

[0.106 lbs, 0.254 lbs]

Proportions

The next morning, you and Alice are about to head off to the hotel pool when Leo calls you.

I'm sorry to disturb you, but I have another problem, and I think you might be able to help.

The Kahana is a very popular resort during the summer tourist season. But the number of leisure

visitors drops significantly during the off-season, from September through February and then

April through May.

We usually have quite a few room vacancies during that period of time. We expect to have about

200 rooms vacant for weeklong periods during the slow season this year.

I've developed a new program that rewards our best guests with a special discount if they book a

weeklong stay during our slow period. They won't have complete date flexibility of course, but the

steep discount should make the offer attractive for them.

To see how many of our past guests would accept such an offer, I sent promotional brochures to

100 of them. The deadline by which they had to respond to the offer has passed. Ten guests

responded with the required room deposit prior to the deadline — that's a solid 10 percent.

I figure if we send out 2,000 promotions, we'll get about 200 responses.

This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer,

you might end up disappointing some of the very guests you're trying to reward. Or, if too many

respond and you give them all the discount, you'll have to turn away customers willing to pay full

price.

That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held

for 100 guests, will it hold for 2,000? What if 11% actually respond to the promotions?

Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by

telling them the offer is not valid, but I also don't want to turn away full paying guests to

accommodate the extra 20 guests at a discount.

I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How

many return guests can I safely send the discount offer and be confident that no more than 200

will respond?

You can tell that Leo is growing quite comfortable with relying on your statistical methods. He

seems almost as interested in them as he is in your results.

Sometimes, the question we pose to members of a sample calls for a yes or no answer.

We might survey people in a target market and ask if they plan to buy a new car this year. Or

survey voters and ask if they plan to vote for the incumbent candidate for office. Or we might take

a sample of the products our plant produced yesterday and count how many are defective.

Even though our question has only two answers, we still have to address an inherent uncertainty:

We know what values our data can take — yes or no — but we don't know how often each

response will be given.

In these cases, we usually convey the survey results by reporting the percentage of yes responses

as a proportion, p-bar. This is our best estimate of p, the true percentage of "yes" responses in the

underlying population NOTE 1001

Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red

Line," and want to know what percentage of all passengers remembers seeing our ad.

We create a proper survey, and ask randomly selected Red Line passengers if they remember

seeing our ad. 300 passengers respond to our survey, of which 100 passengers report remembering

the ad.

Then p-bar is simply 33%, which is the number of people that remember the ad, 100, divided by

the number of respondents, 300.

The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two

proportions always add up to 1 because survey respondents report either remembering the ad or

not.

Once we know the proportion of the sample, we can draw conclusions about all Red Line

passengers. Our best estimate, or point estimate, for p, the percentage of all passengers who

remember seeing our ad, is 33%.

As managers, we typically want more than this simple point estimate — we want to know how

accurate the estimate is. How far from 33% might the true percentage be? Can we say confidently

that it is between 30% and 36%, for example?

When we work with proportions, how do we find a confidence interval around our point estimate?

The process for creating a confidence interval around a proportion is nearly identical to the

process we've used before. The only difference is that we can approximate the standard deviation

of the population with a simple formula rather than calculating it directly from the raw data.

Based on our sample, our best estimate of the true population proportion is p-bar, the percentage

of "yes" responses in our survey. Statistical theory tells us that our best estimate of the standard

deviation of the true population proportion is the square root of [(p-bar)*(1 - (p-bar)]. We can use

this approximate standard deviation to determine a confidence interval for the proportion.

NOTE 1002

For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times

0.67, or 0.47. A 95% confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the

square root of 300. This is equal to 0.33 plus or minus 0.053, or 27.7% to 38.3%.

Unfortunately, there is one catch when we calculate confidence intervals around proportions...

NOTE 1003

Sample size

Sample size matters, particularly when dealing with very small or very large proportions. Suppose

we are sampling New Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou

Gehrig's Disease. In the U.S., the odds of having the disease are less than 1 in 10,000. Would our

sample be useful if we surveyed 100 people?

No. We probably wouldn't find a single person with the disease in our sample. Since the true

proportion is very small, we need to have a large enough sample to make sure we find at least a

few people with the disease. Otherwise, we will not have enough data to get a good estimate of the

true proportion.

There is a guideline we must meet to make sure that our sample is large enough when estimating

proportions. Two conditions must be met: First, the product of the sample size and the proportion

must be at least 5. Second, the product of the sample size and 1 minus the proportion must also be

at least 5. NOTE 1004

If both these requirements are met, we can use the sample. Essentially, this guideline guarantees

that our sample contains a reasonable number of "yes" and a reasonable number of "no" answers.

Our sample will not be useful otherwise.

To avoid an invalid sample, we need to create a large enough sample size to satisfy the

requirements. However, since we don't know the proportion p-bar before sampling, we don't know

if the two conditions are met before setting the sample size. How can we get around this problem?

We can obtain a preliminary estimate of p-bar using either of two methods: first, we can use past

experience. For example, to estimate the rate of Lou Gehrig's disease, we can research the rate of

occurrence in the general population. This is a reasonable first estimate for p-bar.

In many cases, however, we are sampling for the first time. Without past experience, we don't

know what p-bar might be. In this case, it may well be worth our time to take a small test sample

to estimate the proportion, p-bar.

For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3%

as our preliminary estimate of p-bar.

Substituting 3% for p-bar in our two requirements, n(p-bar) ≥ 5 and n(1 - (p-bar)) ≥ 5, tells us

that n must satisfy n*0.03 ≥ 5 and n*0.97 ≥ 5. Thus the sample size we need for our real sample

must be at least 167.

We would then use a real sample — with at least 167 respondents — to find an actual sample

value of p-bar to create a confidence interval for the population proportion.

NOTE 1005

Summary

Proportions are often used to indicate the frequency of some characteristic in a population. The

sample proportion p-bar is the number of occurrences of the characteristic in the sample divided

by the number of respondents, the sample size. It is our best estimate of the true proportion in the

population. We can construct a confidence interval for the population proportion. Two guidelines

for the sample size must be met for a valid confidence interval: n(p-bar) and n(1 - (p-bar)) must

each be at least five.

NOTE 1006

Creating confidence intervals around proportions is not much different from creating them around

means. Finding the right number of Leo's promotional brochures to mail should be easy.

Leo needs to know how accurate the 10 percent response rate of his 100-customer sample is. Will

this response rate hold for 2,000 guests? To how many guests can he send the discount offer for

his 200 rooms?

First, you calculate a 95% confidence interval for the response rate.

Enter the lower bound as a decimal number with two digits to the right of the decimal, (e.g., enter

"5" as "5.00"). Round if necessary.

The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and

15.88%. You obtain that answer by using the sample data and applying the familiar formula:

NOTE 1007

Then after giving Leo's questions some thought, you recommend to him that he send the mailing

to a specific number of guests.

Enter the number of guests as an integer, (e.g., "5"). Round if necessary.

Based on the confidence interval for the proportion, the maximum percentage of people who are

likely to respond to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of

people were to respond for 200 rooms, how many people should Leo send out the survey to?

Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey to at most

1,259 past customers.

Leo is pleased with your work. He tells you to relax and enjoy the resort.

Exercise1

GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world.

Arturo Lopez heads the Mexican sales division of the company's Latin American subsidiary.

GMW earns additional profit when customers choose to finance their car purchase with a GMW

financing package. Arturo has been asked to submit a report to the GMW CEO in Germany about

the percentage of GMW customers who opt for financing.

Arturo has asked you, a new member of the division sales team, to devise a way to estimate this

percentage. You take a random sample of 64 cars sold in the Mexican sales division, and find that

13 of them, or about 20.3%, opted for GMW financing.

you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage

of all Mexican customers opting for GMW financing falls in the range:

from 10.4% to 30.2%

EXERCISE2

Kayleigh Marlon is the Chief Buyer at Tar-Mart, a company that operates a chain of superstores

selling discount merchandise. Tar-Mart has a huge national presence, and manufacturers compete

fiercely to get their products onto Tar-Mart's shelves.

Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to

stock Crown for 4 weeks and display it prominently. After that period, she will stop stocking

Crown unless 5% of Tar-Mart's customers bought Crown or were considering buying Crown

within the next month.

The trial period is now over. Kayleigh has asked you to take a sample of customers to see if Tar-

Mart should continue stocking Crown. She would like you to be at least 95% confident in your

answer.

The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past,

when Tar-Mart introduced a new product, the percentage of people who expressed interest ranged

between 2% and 10%. What sample size should you use?

250 This is the best answer. This sample size will satisfy the two rules of thumb (n(p-bar) ≥ 5 and

n(1 - (p-bar)) ≥ 5) for all proportions falling in the range 2% to 10%.

You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people

surveyed had bought Crown or were considering buying Crown within the next month. What is

the 95% confidence interval for the population proportion?

From 1.6% to 6.4%

First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(p-bar)

= 250*0.04 = 10 ≥ 5 and n(1 - (p-bar)) = 250*0.96= 240 ≥5. Then, using the formula, you find

the confidence interval around the sample proportion. The endpoints of that interval are 1.6% and

6.4%.

EXERCISE3

OO-P-S is a small-package delivery service with worldwide operations. Celine Bedex, VP

Marketing, has heard increasing complaints about late deliveries, and wants to know how many of

the shipments are late by one day or more.

Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2

were delivered late, a proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in

the result of a confidence interval calculation, the interval is:

No valid inferences can be drawn from these data.

This is the best answer. One of the rules of thumb for the sample size is not being satisfied: n(p-

bar) = (256)(0.008) = 2 is less than 5.

Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99%

confident that the population proportion of late packages is between:

0.1% and 2.1%

This is the correct answer. The new sample size is sufficiently large to investigate a population

proportion of 0.011.

First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new

sample size satisfies the rules of thumb. Both n(p-bar) and n(1 - (p-bar)) are greater than 5.

Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%].

NOTE 1008

Hypothesis Testing

After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the

beach. Just as you are gathering your beach gear, Leo gives you another call.

Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd

like you to help me with next. I've been working on ideas to increase the Kahana's profits.

Is it possible to increase profits by raising the room prices? That would be an easy solution.

I wish it were that easy. Room prices are extremely competitive and are often the first thing

potential guests take into consideration. So if we increase room prices, I'm afraid we'll have fewer

guests. That might put us back where we started from with profits — or even worse.

What other factors influence your profits?

The two major ones are room occupancy rates and discretionary spending. "Discretionary

spending" is the money guests spend on non-room amenities. You know, food, drinks, spa

services, sports activities, and so on.

As a manager I can affect a variety of factors that influence discretionary spending: the quality of

the restaurant, for example, or the types of amenities offered.

And you'd like us to help you understand your guests' discretionary spending patterns better.

Right. Then I can explore new ways to increase profits on non-room amenities. I can also see if

some of my recent efforts to increase guest spending have paid off.

I'm particularly interested in restaurant operations. I've made some changes to the restaurants

recently. For example, I hired a new executive chef last year. I'd like to know if restaurant

revenues per person have changed since then.

I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher

spending on beverages.

Finally, I've been wondering if discretionary spending patterns are different for leisure and

business guests. If so, I might change our marketing campaigns to better suit each of those market

segments.

What records do you have for us to work with?

We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and

analyze the results.

You're really getting into these statistical methods, aren't you, Leo?

Leo made some important changes to his business and he has some ideas of what the impact of

these changes has been. How do you put his ideas to the test?

As managers, we often need to put our claims, ideas, or theories to the test before we make

important decisions. Based on whether or not our claim is statistically supported, we may wish to

take managerial action.

Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim

that we want to substantiate. To begin, we will learn how to test hypotheses about population

means.

For instance, suppose we know that the historical average number of defects in a production

process is 3 defects per 1,000 units produced. We have a hunch that a certain change to the process

— a new machine, say — has changed this number. The hypothesis we wish to substantiate is that

the average defect rate has changed — that it is no longer 3 per 1,000.

How do we conduct a hypothesis test? First, we collect a random sample of units produced by the

process. Then, we see whether or not what we learn about the sample supports our hypothesis that

the defect rate has changed.

Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can

we confidently say that the defect rate has changed?

That depends. To find out, we construct a range around the historical defect rate of 3 — the

population mean that has been cast in doubt. We construct the range so that if the mean defect rate

in the population is still 3, it is very likely for the mean of a sample taken from the population to

fall within that range.

The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls

within the range or not.

If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis

that the defect rate is still 3.

However, if the sample mean falls within the range, we don't have enough evidence to support the

claim that the defect rate has changed.

This example captures the essence of hypothesis testing, but we need to formalize our intuition

about the example and define our new statistical technique more precisely.

To conduct a hypothesis test, we formulate two hypotheses: the so-called null hypothesis and the

alternative hypothesis.

Based on experience or conventional wisdom, we have an initial value of the population mean in

mind. The null hypothesis states that the population mean is equal to that initial value: in our

example, the null hypothesis states that the current population mean is 3 defects per 1,000. We use

the Greek letter mu to represent the population mean, in this case the current average defect rate.

NOTE 1009

The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative

hypothesis is that the average defect rate has changed. Note that the alternative hypothesis states

that the null hypothesis does not hold.

As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we

gather from a sample, there are only two possible conclusions we can draw from a hypothesis test:

either we reject the null hypothesis or we do not reject it.

Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null

hypothesis we necessarily "accept" the alternative hypothesis.

In our example, the evidence from our sample will help us determine whether or not we should

reject the null hypothesis that the defect rate is still 3 in favor of the alternative hypothesis that the

defect rate has changed.

Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if

it is highly unlikely that our sample mean would come from a population with the mean stated by

the null hypothesis.

For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null

hypothesis. Drawing a sample with 14 defects from a population with an average defect rate of 3

would be very unlikely.

"We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come

from a population with the mean stated by the null hypothesis. The null hypothesis may or may

not be true: we simply don't have enough evidence to draw a definite conclusion."

For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the

null hypothesis, since it wouldn't be unusual to randomly draw a sample with 3.05 defects from a

population with an average defect rate of 3.

Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is

3. Thus we never say that we "accept" the null hypothesis — we simply don't reject it.

It is because we can never "accept" the null hypothesis that we do not pose the claim that we

actually want to substantiate as the null hypothesis — such a test would never allow us to "accept"

our claim! The only way we can substantiate our claim is to state it as the opposite of the null

hypothesis, and then reject the null hypothesis based on the evidence.

It is important that we understand exactly how to interpret the results of a hypothesis test. Let's

illustrate the two types of conclusions with an analogy: a US jury trial.

In the US judicial system, the accused is considered innocent until proven guilty. So, the null

hypothesis is that the accused is innocent. The alternative hypothesis is that the accused is guilty:

this is the claim that the prosecution is trying to prove.

The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the

accused unless it is certain beyond reasonable doubt that the accused is guilty. With insufficient

evidence, the jury cannot conclude that the accused truly is innocent. The jury simply declares that

the accused is "not guilty.

Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis,

then that does not prove that the null hypothesis is true. We simply have failed to show it is false,

and thus cannot reject it.

A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either

reject or leave unchallenged a particular statement: the null hypothesis.

Alice promises Leo that the two of you will drop by his office first thing in the morning to test if

Leo's survey results support his claims that food and beverage spending patterns have changed.

SUMMARY

We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states

that the population mean is equal to an initial value that is based on our experience or

conventional wisdom. We test the null hypothesis to learn if we should reject it in favor of our

claim, the alternative hypothesis, which states that the null hypothesis does not hold.

Use Hypothesis tests to substantiate theories and claims about population means.

The Null Hypothesis: Expresses conventional wisdom about mean.

The Alternative Hypothesis: Is the claim we wish to substantiate. Is the opposite of null

hypothesis.

To conduct a hypothesis test: Collect a sample. Ask: Is sample highly unlikely if null

hypothesis is true? If yes, reject null hypothesis. If no, do not reject the null hypothesis.

Never ”accept” the null hypothesis.

Single population means

The next morning, Leo explains the measures he has undertaken to increase customer spending on

food and beverages. "I'd like to see if they've had a discernable impact on my guests' restaurant-

related spending patterns."

Last year, I made two major changes to restaurant operations: I brought in a new executive chef

and renovated the main cocktail lounge.

The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put

some elaborate items on the menu, like that mango and brie tart I recommended to you. She also

has offerings that cater to simpler tastes. But the question is, have restaurant profits been affected

by the new chef?

Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have

increased, profits have increased too. Based on last year's consolidated reports, the average

spending on food per person per day was $55. I'm curious to see if that has changed.

In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space

inefficiently. Now more guests can be seated in the lounge, and more seats have good views of the

ocean

I also invested in a large machine that makes a wide variety of frozen drinks. Frozen pina coladas

are very, very popular.

I hope my investments in the bar are paying off in terms of higher guest spending on drinks.

Beverages have high margins, but I'm not sure if beverage sales have increased enough to cover

the investments.

Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes

in profits?"

Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last

year, the average spending on beverages per guest per day was $21.

Isn't that high?

Well, we have some very nice wines in our restaurants.

We don't have the consolidated report yet, but I've already had my staff choose a random sample

of guests.

We pulled the restaurant and lounge receipts for the guests in the sample and noted three items:

total food revenues, total beverage revenues, and number of guests at the table. Using this

information, we should be able to estimate the daily spending on food and beverages per guest.

You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef

and the bar renovations — have influenced the resort's profits.

Leo has prepared data for you. How are you going to put it to use?

Our first type of hypothesis test is used to study population means. Let's walk through an example

of this type of test

Suppose the manager of a movie theater implemented a new strategy at the beginning of the year:

he started showing old classics instead of recent releases.

He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a

possible 10 points. He would like to know if average customer satisfaction has changed since he

altered his theater's artistic focus.

The manager's null hypothesis states that the current mean satisfaction has not changed; it is still

6.7. We use the Greek letter mu to represent the current mean satisfaction rating of the theater's

entire film-going population.

His alternative hypothesis is the opposite of the null hypothesis: it states that average customer

satisfaction is now different. NOTE 1010

To substantiate his claim that the mean has changed, the manager takes a random sample of 196

moviegoers. He is careful to sample across movies, show times, and dates. The mean satisfaction

rating for the sample is 7.3, with a standard deviation of 2.8.

Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7

indicate that this year's moviegoers really are more satisfied?

Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually

high average satisfaction rating? This is equivalent to asking the question: If the null hypothesis is

true — the average satisfaction is still 6.7 — would we be likely to randomly draw the sample that

we did, with average satisfaction 7.3?

To answer this question, we have to first define what we mean by "likely." As in sampling and

estimation, we typically use 95% as our threshold level of likelihood.

We then construct a range around the population mean specified by our null hypothesis. The range

should be drawn so that if the null hypothesis is true, 95% of all samples drawn from the

population would fall in that range. In other words, we create a range of likely sample means.

The central limit theorem tells us that the distribution of sample means follows a normal curve, so

we can use its familiar properties to find probabilities. Moreover, the distribution of sample means

is centered at our assumed population mean, mu, and has standard deviation sigma/sqrt(n). We

don't know sigma, the underlying population standard deviation, so we use the sample standard

deviation as our best estimate.

As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) =

1.96*s/sqrt(n) on either side of the mean. However, when we conduct a hypothesis test, we center

the range around the mean specified in the null hypothesis because we always start a hypothesis

test by assuming the null hypothesis is true. NOTE1011

In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our

95% confidence level translates into a z-value of 1.96. We construct the range of likely sample

means: NOTE1012

This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly

selected sample will fall between 6.3 and 7.1.

Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject

the null hypothesis. Why? Because if the population mean were 6.7, it would be unlikely to collect

a sample whose mean falls outside this range.

The region outside the range of likely sample means is called the "rejection region," since we

reject the null hypothesis if our sample mean falls into it. In the movie theater example, the

rejection region contains all values less than 6.3 and all values greater than 7.1.

In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null

hypothesis. Whenever we reject the null hypothesis, we in effect accept the alternative hypothesis.

We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7.

If our sample mean had fallen within the range around 6.7, we could not make a definite statement

about moviegoers' satisfaction. We would not have enough evidence to state that things have

changed, but we can never claim that they have definitely remained the same.

Unless we poll every customer, we'll never know for sure if customer satisfaction has truly

changed. Working only with sample data, there is always a chance that we'll draw the wrong

conclusion about the population. We can go wrong in two ways: rejecting a null hypothesis that is

in fact true or failing to reject a null hypothesis that is in fact false. Let's look at the first of these:

the null hypothesis is true, but we reject it.

We choose the confidence level so it is unlikely — but not impossible — for the sample mean to

fall in the rejection region when the null hypothesis is true. In this case, we are using a 95%

confidence level, so by unlikely we mean a 5% chance. However, 5% of all samples from a

population with the null hypothesis mean would fall in the rejection region, so when we reject a

null hypothesis, there is a 5% chance we will do so erroneously.

Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that

we are justified in rejecting the null hypothesis. Hence we continue to speak of a confidence level

of 95%.

A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5%

significance level says that there is a 5% chance of a sample mean falling in the rejection region

when the null hypothesis is true. This is what people mean when they say that something is

"statistically significant at a 5% significance level.

If we increase our confidence level, we widen the range around the null hypothesis mean. At a

99% confidence level, our range captures 99% of all sample means. This reduces to 1% our

chance of rejecting the null hypothesis erroneously. But doing this has a downside: by decreasing

the chance of one type of error, we increase the chance of the other type.

The higher the confidence level the smaller the rejection region, and the less likely it is that we

can reject the null hypothesis when it is in fact false. This decreases our chance of being able to

substantiate the alternative hypothesis when it is true. As managers, we need to choose the

confidence level of our test based on the relative costs of making each type of error

The range of likely sample means should not be confused with a confidence interval. Confidence

intervals are always constructed around sample means, never around population means. When we

construct a confidence interval, we don't even have an initial estimate of the population mean.

Constructing a confidence interval is a process for estimating the population mean, not for testing

particular claims about that mean.

Smmary

In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we

construct a range of likely sample means around the null hypothesis mean. If the sample mean we

collect falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the

null hypothesis. The confidence level measures how confident we are that we are justified in

rejecting the null hypothesis.note1013

One-sided Hypothesis tests

The movie theater manager did not have a strong conviction about the direction of change for

customer satisfaction prior to performing the hypothesis test.

He wanted to test for change in both directions — up or down — and thus he used a two-sided

hypothesis test. The null hypothesis — that no change has taken place — could have been wrong

in either of two ways: Customer satisfaction may have increased or decreased. The two-tailed

nature of the test was reflected in the two-sided range we drew around the population mean.

Sometimes, we may want to know if the actual population mean differs from our initial value of

the population mean in a specific direction. For instance, if the theater manager were quite sure

that satisfaction had not decreased, he wouldn't have to test in that direction; rather, he'd only have

to test for positive change.

In these cases, our alternative hypothesis should clearly state which direction of change we want

to test for. These kinds of tests are called one-sided hypothesis tests. Here, we substantiate the

claim that the mean has increased only if the sample mean is sufficiently higher than 6.7, so our

rejection region extends only to the right.

Let's outline how to formulate one- and two-sided tests. For a two-sided test we have an initial

understanding of the population: the population mean is equal to a specified initial value.

If we want to substantiate the claim that a population mean has changed, the null hypothesis

should state that the mean still equals that initial value. The alternative hypothesis should state that

the mean does not equal that initial value.

If we want to know that the actual population mean is greater than the initial value — the null

hypothesis mean — then the null hypothesis should state that the population mean has at most that

value. The alternative hypothesis states that the mean is greater than the null hypothesis mean.

Likewise, if we want to substantiate the claim that a population mean is less than the initial value,

the null hypothesis should state that the mean is at least that initial value. The alternative

hypothesis should state that the mean is less than the null hypothesis mean, and the rejection

region extends only to the left.

When we conduct a one-sided hypothesis test, we need to create a one-sided range of likely

sample means. Suppose the theater manager claims that satisfaction improved. As usual, he states

the claim he wants to substantiate as his alternative hypothesis.

The 196-person sample has mean 7.3 and standard deviation 2.8. Does this sample provide

sufficient evidence to substantiate the claim that mean satisfaction increased? To find out, the

manager creates a one-sided range: he assumes the population mean is the null hypothesis mean,

6.7, and finds the range that contains the lower 95% of all sample means.

To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all

sample means be less than that value?

To find out, we use what we know about the cumulative probability under the normal curve: a

cumulative probability of 95% corresponds to a z-value of 1.645.

Why is this different from the z-value for a two-sided test with a 95% confidence level? For a two-

sided test, the z-value corresponds to a 97.5% cumulative probability, since 2.5% of the

probability is excluded from each tail. For a one-sided test, the z-value corresponds to a 95%

cumulative probability, since 5% of the probability is excluded from the upper tail.

z-table

We now have all the information we need to find the upper bound on the range of likely sample

means. note1014

The rejection region is everything above the value 7.0. The sample mean falls in the rejection

region, so the manager rejects the null hypothesis. He is confident that customer satisfaction is

higher.

summary

When we want to test for change in a specific direction, we use a one-sided test. Instead of finding

a range containing 95% of all sample means centered at the null hypothesis mean, we find a one-

sided range. We calculate its endpoint using the cumulative probability under the normal curve.

note1015

The Excel Utility link below allows you to perform hypothesis tests for single populations.

Make sure you do at least one example by hand to ensure you thoroughly understand the basic

concepts before using the utility. You should enter data only in the yellow input areas of the utility.

To ensure you are using the utility correctly, try to reproduce the results for the theater manager's

example.

A single-population hypothesis test tests a claim using a sample from a single population. With a

plan in mind, you take a look at Leo's sample data.

You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You

draw a table to organize the data from your sample on daily guest spending on restaurant food.

One change Leo made to his restaurant operations was to hire a new chef. He wants to know

whether average restaurant spending per guest has changed since she took over the menu and the

kitchen. This is a clear case for a hypothesis test.

Last year's average spending on food per person was $55; this gives you an initial value for the

mean.

Leo wants to know if mean spending has changed, so you use a two-sided test. You jot down your

null hypothesis, which states that the average revenue per guest is still $55.

If the null hypothesis is true, the difference between the sample mean of $64 and the initial value

of $55 can be accounted for by chance.

You add the alternative hypothesis to your notes.

Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to

construct a range of likely sample means around $55 and ask: does the sample mean of $64 fall

within that range? Or does it fall in the rejection region?

Leo didn't specify what level of confidence he wanted for your results. You call him for

clarification.

I suppose a 95% confidence level is okay. I'd like to be more confident, of course.

After you point out that higher confidence would reduce his chances of being able to substantiate a

change in spending if a change has taken place, he agrees to 95%. You pull out your trusty

calculator and get ready to compute a range around the null hypothesis mean of $55. Consulting

your notes, you find the correct formula:

You find the range containing 95% of all sample means. Its endpoints are:

[$49.12; $60.88]

This is the correct answer. The z-value for 95% confidence in a two-sided test is 1.96.

You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis

is true. Then 19 out of 20 samples of this size from the population of hotel guests would have

means that would fall in the calculated range.

The sample mean of $64 falls outside of this range. You and Alice report your results to Leo.

Looks like hiring that chef was a good decision. The evidence suggests that mean spending per

person has increased.

I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has

affected average beverage spending?

Leo emphasizes that he can't imagine that his investments in the bar could have reduced average

beverage spending per guest. He wants to know if spending has gone up. You decide to do a one-

sided test.

First, you write down all of Leo's data, along with the hypotheses:

You need to find an upper bound such that 95% of all sample means are smaller than it. To do so,

you use a z-value of 1.645. The upper bound is $24.29.

What is the correct interpretation of this number? Given that the null hypothesis is true,

If the sample mean is $21, 19 out of 20 samples have means LESS than $24.29

This is the correct answer. $24.29 is an upper bound: 95% of all sample means collected from a

population with the null hypothesis mean fall below $24.29.

The range of likely sample means contains the collected sample mean of $24. This tells you that:

The null hypothesis should NOT be rejected.

This is the correct answer. The difference between the sample means and the population mean

may well be due to the randomness of the sample. There is not enough evidence to reject the null

hypothesis.

Presenting your full report to Leo, he appears confused and disappointed.

How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink

machine didn't pay off, shouldn't the increase in seats have helped?

First of all, we haven't concluded that average revenue has not increased. We just can't be sure that

it has. The fact that our sample mean is $24 vs. $21 last year does not allow us to say anything

definitive about the change in average beverage revenue.

Remember, we set out to substantiate our hypothesis that spending has improved. Based just on

this sample, we are unable to conclude that spending has increased.

You added seats and now more people can be seated in your lounge. But a greater number of

guests does not necessarily translate into more spending per person

That does make a lot of sense.

Your overall revenues may have actually increased, because more guests can be seated in the

lounge.

Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm

quite optimistic I'll see a jump in total beverage revenues in the consolidated report at the end of

the year.

Why don't we go fill three of those new seats right now?

Exercise1

Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes

toasted pretzel snacks, and advertises that these pretzels contain an average of 112 calories per

serving.

In a recent test, an independent consumer research organization conducted an experiment to see if

this claim was true. Somewhat to their surprise, the researchers found that the average calorie

content in a sample of 32 bags was 102 calories per serving. The standard deviation of the sample

was 19.

Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can

market them appropriately. With 99% confidence, do these data indicate that the pretzels' calorie

content has changed?

Yes

The answer cannot be determined from the information provided.

This is the correct answer. The data indicate that the null hypothesis should be rejected. The

calorie content has probably changed.

You begin any hypothesis test by formulating a null and an alternative hypothesis. The null

hypothesis states that the population mean is equal to the initial value. In this problem, the null

hypothesis is that the caloric content in the actual population is what Oma's has always advertised

The alternative hypothesis should contradict the null hypothesis. For a two-sided test, the

alternative hypothesis simply states that the mean does not equal the initial value. A two-sided test

is more appropriate in this problem, since Blanche only wants to know if the mean calorie content

has changed.

You assume that the null hypothesis is true and construct a range of likely sample means around

the population mean. Using the data and the appropriate formula, you find the range [103; 121].

The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche

can be 99% confident that the population mean is not 112.

Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level

for her test?

She feels that it would be very costly to change her marketing campaign if there is in fact no

change in the average number of calories.

This is the correct answer. A high confidence level decreases our chance of erroneously rejecting

the null hypothesis. In this case, Blanche wants to minimize the chance of saying that the caloric

content has changed if it really is still 112 calories per serving.

Exercise2

The Clearwater Power Company produces electrical power from coal. A local environmental

group claims that Clearwater's emissions have raised sulfur dioxide levels above permissible

standards in Blue Sky, the town downwind of the plant.

According to Environmental Protection Agency standards, an acceptable average sulfur dioxide

level is 30 parts per billion (ppb). As Clearwater's PR consultant, you want to defend the company,

and you try to anticipate the environmentalist's argument.

The environmental group collects 36 samples on randomly selected days over the course of a year.

It finds a mean sulfur dioxide content of 35 ppb with a standard deviation of 24 ppb

The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide

levels are higher than permitted. Which of the following is an appropriate null hypothesis for this

problem?

The average sulfur dioxide level is no higher than 30 ppb, the EPA's standard of acceptability.

This is the best answer. The null hypothesis states the conventional wisdom: that the population

mean of the population under investigation the sulfur dioxide concentration of air in Blue Sky is

less than or equal to 30 ppb, the acceptability standard for which the EPA does not require a

remedy. The environmentalists will pose as the alternative hypothesis the claim they are trying to

substantiate: that Blue Sky's levels exceed the acceptable standard.

The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a

one-sided test. The alternative hypothesis states that the sulfur dioxide levels are above the

accepted standard. We assume they will choose a 95% confidence level.

What is the range of likely sample means?

All values below 36.58 ppb.

They calculate the one-sided range around the null hypothesis mean that contains 95% of all

samples. The z-value for a one-sided 95% range is 1.645. The upper bound on the range of likely

sample means is 36.58 ppb. note1016

Based on your calculations, you should:

Do not reject the null hypothesis.

This is the correct answer. 35 ppb falls within the range of likely sample means. Ata a 95%

confidence level, these sample data do not provide enough evidence to reject the null hypothesis.

Exercise3

You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent

storms. The machine that wraps Neshey's popular chocolate confection, Smooches, still works, but

you are afraid it may not be working at its former capacity.

If the machine isn't working at top capacity, you will need to have it replaced.

Which type of hypothesis test is most appropriate for this problem?

One-sided test

This is the best answer. You want to know if the machine's performance has been impaired, not

simply if the performance has changed.

The hourly output of the machine is normally distributed. Before the flood, the machine wrapped

an average of 340 Smooches per hour. Over the first week after the flood, you counted wrapped

Smooches during 32 randomly selected one-hour periods. The machine averaged 318 Smooches

per hour, with a standard deviation of 44.

You conduct a one-sided hypothesis test using a 95% confidence level. According to your

calculations, you should:

Have the machine replaced.

Continue to use the machine. The lower output in the sample hours you observed was due solely

to chance.

This is the correct answer. The sample mean falls below the lower bound of the one-sided range of

likely sample means around the null hypothesis mean. You can be 95% confident that the

machine's performance has been impaired.

The null hypothesis is that µ ≥ 340. The alternative hypothesis is that µ < 340 since you are using

a one-tail test and you are assuming that the new population mean is lower than the population

mean before the flood.

Identify the relevant values. The sample size n=32. The standard deviation s=44. The appropriate

z-value is 1.645 if you want to capture 95% of all sample means in a one-sided range around the

null hypothesis mean.

Use the formula and calculate the lower bound, 327. The sample mean of 318 falls well outside of

the calculated range of likely sample means. You accept this as strong evidence against the null

hypothesis, substantiating the alternative hypothesis that the mean output rate has dropped. You

should replace the machine.

Single Population Proportions

Happy with your work on restaurant spending, Leo jumps right into the next problem. "It's not just

the revenue of the restaurants that I care about," Leo says, "It's also my guests' satisfaction with

their restaurant experience."

When I go out to eat, I expect more than just excellent food. The whole dining experience is

essential — everything from the service, to the décor, to the design and quality of the silverware.

And it's not just that all of these factors must be excellent individually — they have to fit together.

The restaurant has to have ambiance! I'm sure my guests have similar expectations, and I want to

be sure my restaurant meets them.

Since my new chef introduced more sophisticated cuisine, I made some changes to the décor that I

think have improved the ambiance.

It took me a long time and a substantial amount of money to get everything right, but I'm pleased

with the result: the restaurants are elegant and distinctly Hawaiian. Just like the new chef's cuisine.

In the past, I've contracted a local market research firm to conduct surveys, asking guests to rate

the Kahana's restaurants' ambiance on a scale of one to five.

Historically, the percentage of people that rated ambiance the top score of 5 gave me a good idea

of how well we were doing. That percentage has been very high: 72%.

I've collected this year's data for you. Can you figure out if my guests are happier with my

restaurants' ambiance?

Alice tells you that testing Leo's claim about a proportion will be very similar to testing a mean.

Often the summary statistic we want to make a claim about is a proportion. How do we test a

hypothesis about a population proportion instead of a population mean?

We know from our work with confidence intervals that the processes for estimating population

proportions and population means are virtually identical. Similarly, hypothesis tests for

proportions are much like hypothesis tests for means.

Because we are examining a population proportion instead of a population mean, we use slightly

different notation: we use a lower case p to represent the population proportion in place of µ for a

population mean. We construct a hypothesis test to test a claim about the value of p.

Again, we formulate null and alternative hypotheses. Based on conventional wisdom or past

experience, we have an initial understanding of the population proportion. The null hypothesis for

a proportion test states the initial understanding. For example, in a two-sided test, the null

hypothesis asserts that the population proportion, p, is equal to the initial value we had in mind.

The alternative hypothesis is the claim we are using the hypothesis test to substantiate. The

alternative hypothesis typically states the opposite of the null hypothesis: it states that our initial

understanding is incorrect.

As with population means, we collect a random sample and calculate the sample proportion, "p

bar." However, for a hypothesis test about a population proportion, we don't need to calculate a

standard deviation from the sample.

Statistical theory tells us that σ, the standard deviation of the population proportion, is the square

root of [p*(1 - p)]. Since we always start the test assuming the null hypothesis is true, we will

calculate σ using the null hypothesis proportion.

Analogously to population mean tests, we create a range of likely sample proportions around the

null hypothesis proportion. To create the range, we substitute for σ, the standard deviation of the

underlying null hypothesis population.note1017

If our sample proportion falls outside the range of likely sample proportions, we reject the null

hypothesis. Otherwise, we cannot reject the null hypothesis

SUMMARY

In a hypothesis test for population proportions, we assume that the null hypothesis is true. Then,

we construct a range of likely sample proportions around the null hypothesis proportion. If the

sample proportion we collect falls in the rejection region, we reject the null hypothesis. Otherwise,

we cannot reject the null hypothesis. note1018

Once you understand hypothesis testing for means, using the same techniques on proportions is

easy.

By now, you're familiar with the concept of testing a hypothesis. You recognize that Leo's

restaurant ambiance problem calls for a hypothesis test for a population proportion.

Leo wants you to find out if the proportion of his guests that rate restaurant ambiance "excellent"

has increased. Historically, that population proportion has been 0.72. Since Leo wants to see if

there has been positive change, you do a one-sided test.

The appropriate pair of hypotheses is:

Null hypothesis p ≤ 0.72, alternative hypothesis: p > 0.72

You are doing a one-sided test to see if the proportion of guests rating the restaurant "excellent"

has increased. The alternative hypothesis states that the proportion has increased, and the null

hypothesis states that it has not increased.

You look at Leo's data. The sample proportion is 0.81 and the sample size is 126.

But what about the standard deviation?

You have enough information to calculate the standard deviation.

This is the correct answer. For proportions, you can calculate the standard deviation using the null

hypothesis proportion.

Here's how you find the standard deviation for a proportion problem:

Using the appropriate formula, you calculate the standard deviation to be 0.45.

Leo wanted you to use a 95% confidence level. Now you're ready to construct a range of likely

sample means around the null hypothesis value of the population proportion: 0.72.

Find the range of likely sample proportions around the null hypothesis proportion, and formulate a

short answer for Leo.

The evidence supports Leo's claim that the proporation of guests rating the restaurant ambiance

"excellent" has increased.

A one-sided test calls for a one-sided range of likely sample proportions. You need to find the

upper bound for this range such that the range captures the lower 95% of the sample proportions.

The z-value for a one-sided 95% confidence level is 1.645. Substitute the null hypothesis

proportion, 0.72, for p. The upper bound for the range containing the lower 95% of all sample

means is 0.78.

Since the sample proportion 0.81 falls in the rejection region, you reject the null hypothesis. The

data provide sufficient evidence that the population proportion has, in fact, changed.

Alice presents your findings to Leo, telling him that with 95% confidence, the data you collected

indicate that the difference between the historical population proportion and the proportion of the

random sample is not due to chance.

The proportion of your guests that rate the restaurants' ambiance as "excellent" has increased.

Exercse1

Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is

considering marketing a special insurance package to members of certain professional groups.

In particular, Luther wants to create a special package for health professionals.

To find out what rate to charge for this package, Luther conducts a preliminary study to see if

health professionals are less likely to be involved in car accidents than the rest of his customer

base.

If the data indicate that health professionals are less likely to be involved in car accidents, then

Ventura can offer health professionals a lower, more competitive rate.

In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the

following is the correct pair of hypotheses for solving Luther's problem?

Null hypothesis p ≥ 8.3%; Alternative hypothesis p < 8.3%

This is the correct answer. Luther wants a one-sided test, because he wants to know if medical

professionals are better drivers. The alternative hypothesis should state that medical professionals

are less likely to be in accidents.

EXERCISE1

Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is

considering marketing a special insurance package to members of certain professional groups.

In particular, Luther wants to create a special package for health professionals.

To find out what rate to charge for this package, Luther conducts a preliminary study to see if

health professionals are less likely to be involved in car accidents than the rest of his customer

base.

If the data indicate that health professionals are less likely to be involved in car accidents, then

Ventura can offer health professionals a lower, more competitive rate.

In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the

following is the correct pair of hypotheses for solving Luther's problem?

Null hypothesis p ≥ 8.3%; Alternative hypothesis p < 8.3%

This is the correct answer. Luther wants a one-sided test, because he wants to know if medical

professionals are better drivers. The alternative hypothesis should state that medical professionals

are less likely to be in accidents.

A sample of 240 customers in the health profession reveals that 12 (5.0%) have had accidents.

If he uses a 95% confidence level, which of the following is the best conclusion Luther can come

to?

The evidence suggests that health professionals are less likely to be involved with car accidents.

The data provide no evidence suggests that health professionals are more or less likely to be

involved in car accidents.

This is the best answer. The range of likely sample proportions around the null hypothesis

proportion does not contain the sample proportion, so we can reject the null hypothesis. With 95%

confidence, the proportion of health professionals involved in car accidents is lower than the

proportion of Ventura's population of drivers.

You need to find a range of likely sample proportions. To find this range, you calculate a standard

deviation. The standard deviation is 0.28.

For a one-sided test, a confidence level of 95% corresponds to a z-value of 1.645. The lower

bound of this range is 0.054 = 5.4%.

The range of likely sample proportions does not contain 5.0%, so you should reject the null

hypothesis. With 95% confidence, the proportion of health professionals involved in car accidents

is lower than the proportion of the overall population of drivers.

P-Values

After sleeping over your analysis of restaurant operations, Leo seems unsatisfied.

Don't get me wrong, I appreciate your hard work. But look here: these hypothesis tests result in a

"reject/don't reject" decision. If I understand you correctly, it doesn't matter how close to the

border of the rejection region our sample statistic falls: "reject" is "reject."

But can't you tell me more? I want to know how strong the evidence against the null hypothesis is,

not just if it is strong enough.

I'm glad you brought that issue up, Leo. We have a second method of doing hypothesis tests, one

that provides a measure of the strength of the evidence.

The evening before, Alice had acquainted you with p-values: "We can use the p-value method of

hypothesis testing to make 'reject/not reject' decisions in the same way we have been doing all

along. But the p-value also measures the strength of evidence against a null hypothesis."

In hypothesis tests we've done so far, we first chose the confidence level of the test. The

confidence level tells us the significance level of the test, which is simply 1 minus the confidence

level.

Typically, we chose a 5% significance level — a 95% confidence level — as our threshold value

for rejection. Assuming that the null hypothesis is true, we reasoned that certain sample mean

values are less likely to appear than others. If the mean of the sample we collected was sufficiently

unlikely to appear (that is, less than 5% likely) we considered the null hypothesis implausible and

rejected it.

Now, rather than simply checking whether the likelihood of collecting our sample is above or

below our chosen threshold, we'll ask: if the null hypothesis is true, how likely is it to choose a

sample with a mean at least as far from the null hypothesis mean as the sample mean we

collected?

The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that

falls at least a certain distance from the null hypothesis mean.

In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we

reject our null hypothesis.

The p-value does more than simply answer the question of whether or not we can reject the

hypothesis. It also indicates the strength of the evidence for rejecting the null hypothesis. For

example, if the p-value is 0.049, we barely have enough evidence to reject the null hypothesis at

the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the hypothesis.

Let's look at an example. Recall the movie theater manager who wanted to know if the average

satisfaction rate for his clientele had changed from its historical rate of 6.7.

To find out, we constructed the range, 6.3 to 7.1, which would have contained 95% of the sample

means if the null hypothesis mean had still been true. Since the mean of the sample of current

moviegoers we collected, 7.3, fell outside of that range, we rejected the null hypothesis.

Because 7.3 fell in the rejection region, we know that the likelihood of collecting a sample mean

as extreme as 7.3 is less than 5% if the null hypothesis is true. Now let's find out exactly how

unlikely it is by calculating the p-value

Calculating the p-value is a little tricky, but we have all the tools we need to do it. Recall that for

samples of sufficient size, the sample means of any population are distributed normally.

To calculate the likelihood of a certain range of sample mean values — in our example, sample

mean values greater than 7.3 or less than 6.1 — we just need to find the appropriate area under the

distribution curve of the sample means.

To calculate the p-value for this two-sided test, we want to find the area under the normal curve to

the right of 7.3 and to the left of 6.1. The standard deviation in this example is 2.8, and the sample

size is 196.

We can calculate this probability by first calculating the z-value associated with the value 7.3.

That z-value is 3.

Then, we find the probability of having a z-value less than -3 or greater than 3.

The area to the left of the z-value of -3 is 0.00135. The area to the right of the z-value of +3 is the

same size, so the total area is 0.0027. That is our p-value. These areas and the p-value can be

found in Excel using the NORMSDIST(-3) function, in the z-table, or with the Excel utility

provided.

Our p-value calculation tells us that the probability of collecting a sample mean at least as far from

6.7 as 7.3 is 0.0027. The p-value is lower than 0.05. Thus, at a significance level of 0.05, we

would reject the null hypothesis and conclude that moviegoers' average satisfaction rating is no

longer 6.7.

But the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at

0.0027, a much lower significance level. In other words, we can reject the null hypothesis with

99.73% confidence. In general, the lower the p-value, the higher our confidence in rejecting the

null hypothesis.

One-sided hypothesis tests are also easily conducted with p-values. For one-sided tests, the p-

value is the area under one side of the curve. In our movie theater example, if the alternative

hypothesis states that the population mean is larger than 6.7, the p-value is the area under the

normal curve to the right of the sample mean of 7.3.

Summary

The p-value measures the strength of the evidence against the null hypothesis. It is the likelihood,

assuming that the null hypothesis is true, of collecting a sample mean at least as far from the null

hypothesis mean as the sample actually collected. We compare the p-value to the threshold

significance level to make a reject/not reject decision. The p-value also tells us how comfortable

we can be with that decision. Note 1019

Now Alice explains the basics of p-values to Leo, so you can present the results of your restaurant

revenue hypothesis test again. This time, you'll be able to give Leo an idea of how strong the

statistical evidence is.

Leo wants you to complete the p-value hypothesis test right there in his office. You're a little

nervous — you've never had a client peering over your shoulder when you work. But you oblige

him, because you're growing more confident of your statistical skills.

Looking back at your notes on the problem, you find the data and the hypotheses. You make a

mental note that you are doing a two-sided test to see whether or not average spending on food has

changed from its historical level of $55.

An eager Leo interrupts your thought process:

When you ran the hypothesis test earlier I had you use a 95% confidence level. That corresponds

to a significance level of 0.5, right?

You politely respond:

I'm sorry, but I don't think that's right.

Good choice. Leo is still a little confused, but you bring him up to speed.

To find the significance level corresponding to a confidence level of 95%, simply subtract 95%

from 100%, and convert into decimal notation: 0.05.

After you clarify Leo's mistake, he sits back and lets you finish your analysis without further

interruption. First, you find the appropriate z-value. Enter the z-value as a decimal number with 2

digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.

The correct z-value is 3.00, corresponding to a right-tail probability of 0.00135. You:

Double that probability to find the p-value.

This is the correct answer. For a two-sided test, you calculate both tail probabilities.

Your sample has a mean (x-bar = $64) that is $9 higher than the assumed population mean, $55.

You want to calculate the likelihood of getting a sample mean that is at least as far from the

population mean as x-bar.

That likelihood is not just the tail probability to the right of the sample mean. Sample means on

the other side of the normal curve are just as far from the population mean as x-bar. They must be

included, too, when you calculate the p-value for a two-sided hypothesis test.

Doubling the right-tail probability gives you the correct p-value: 0.0027.

Alice summarizes your results for Leo.

All we have to do is compare the p-value to the significance level. The p-value 0.0027 is less than

the significance level 0.05. Our data are statistically significant at the 0.05 level.

Just as we calculated earlier by constructing a range around the null hypothesis mean, the p-value

method suggests that we reject the null hypothesis. With 95% confidence, average food spending

per guest has changed.

But now we can also see that the evidence is very strong, because the p-value is much lower than

the significance level. We can claim that food spending has changed at the 0.0027 level of

significance. Thanks, you two. I feel much more comfortable concluding that average guest

spending in my restaurant has changed.

In the following exercise you will revisit an earlier problem, this time solving it with the p-value

method.

Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes

toasted pretzel snacks. Each bag of pretzels contains one serving, and Oma's advertises that the

pretzel snacks contain an average of 112 calories per serving.

In a recent test, an independent consumer research organization conducted an experiment to see if

this claim was true. The researchers found that the average calorie content in a sample of 32 bags

was 102 calories per serving. The standard deviation of the sample was 19.

Blanche would like to know if the calorie content of Oma's pretzels has really changed, so she can

market them appropriately. At the significance level 0.01, do these data indicate that the pretzels'

calorie content has changed?

Yes.

This is the best answer. The data indicate that the null hypothesis should be rejected. The calorie

content has probably changed.

In this problem, the null hypothesis is that the actual population mean is what Oma's has always

advertised. A two-side test is more appropriate in this problem, since Blanche only wants to know

if the mean calorie content has changed.

Assuming that the null hypothesis is true, you find a z-value for the sample mean of 102 using the

appropriate formula. The z-value is -2.98.

Using the Excel NORMSDIST function or the Standard Normal Table, you can find the

corresponding left-tail probability of 0.0014. For a two-sided test, you double this number to find

the p-value, in this case 0.0028.

Since this p-value is less than the significance level, you can reject the null hypothesis. Moreover,

you now can say that you are rejecting the null hypothesis at the 0.0028 level of significance. You

can recommend to Blanche that she have the labeling changed on the pretzel bags, and adjust her

marketing accordingly.

Comparing two populations

Now satisfied with your analysis of the restaurant, Leo asks you to compare the discretionary

spending habits of two categories of guests: leisure and business.

Every hotel manager wrestles with the problem of stretching limited marketing resources. I want

to make sure that I'm wisely allocating each marketing dollar.

Leisure guests, such as tourists and honeymooners, are especially attracted to Hawaii. Also, many

professional associations like to have their conventions here, so our islands attract business

travelers, who mix business and pleasure.

Business travelers pay lower room prices because conferences book rooms in bulk. Bulk

reservations are good for me because they keep my occupancy levels high. However, I don't have

a good sense of whether the discretionary spending of my business guests is different from that of

my leisure guests: they may take fewer scuba lessons but use the spa services more, for example

Can you help me figure out whether there is any significant difference between leisure and

business travelers' discretionary spending habits? Your conclusions might influence my marketing

efforts.

I collected two random samples: one of leisure guests and one of business guests. Not including

room, meal, and beverage charges leisure travelers spent an average of $75 a day, compared to $64

a day for the business travelers.

I knew that the difference between the two averages of the two samples could be due to chance, so

I thought I'd have you do a hypothesis test to find out.

When I was compiling the data for you, I realized that my samples were of different sizes. I was

able to get 85 leisure guests to respond, but only 76 business guests returned my survey.

Which figure will you use as the sample size? Or will you add them together?

I also realized that with these data, you'd have to calculate two sample standard deviations, one for

each sample. How do you go about solving a problem like this?

How do you test whether two populations have different means?

So far, we've used hypothesis tests to study the mean or proportion of a single population. Often,

managers want to compare the means or proportions of two different populations: in this case, we

use a two-population hypothesis test. Let's clarify when we use each type of test.

We conduct single-population tests when we have an initial value for a population mean and want

to test to see if it is correct. Single population tests are especially useful when we suspect that the

population mean has changed. For example, we use a single-population test when we know the

historical average of a population and want to test whether that historical average has changed.

We conduct two-population tests to compare a characteristic of two groups for which we have

access to sample data for each group. For example, we'd use a two-population test to study which

of two educational software packages better prepares students for the GMAT. Do the students

using package 1 perform better on the GMAT than the students using package 2?

In two-population tests, we take two samples, one from each population. For each sample, we

calculate the sample mean, standard deviation, and sample size.

We can then use the two sets of sample data to test claims about differences between the two

populations. For example, when we want to know whether two populations have different means,

we formulate a null hypothesis stating that the means are not different: the first population mean is

equal to the second.

Let's look at the GMAT software package example more closely. The manager of one educational

software company might wonder if the average GMAT score of students using her software is

different from the average GMAT score of students using the competitor's software.

Since the manager only wants to test if the average GMAT scores are different, she conducts a

two-sided hypothesis test for two populations. The null hypothesis states that there is no difference

between the average GMAT scores of the students who use the two companies' software.

The alternative hypothesis states that the average GMAT scores of the students who use the two

companies' software are different.

We denote the average scores of the two populations by the Greek letter mu and distinguish them

with subscripts. Our hypotheses are:

To be 95% confident in the result of the test, we use a significance level of 0.05.

We collect two samples, one from each population. We denote the sample means with the familiar

x-bar, which we again distinguish with subscripts.

We are able to collect the GMAT scores of 45 people who used the company's software, and 36

people who used the competitor's software. As we will see shortly, the different sample sizes will

not pose a problem.

The respective sample means are 650 and 630, and the standard deviations are 60 and 50.

Could the two random samples we picked just happen to have different means by chance but

really have come from populations that have the same population means?

The null hypothesis states that there is no difference in the two population means. As with single-

population tests, we test the null hypothesis by asking how likely it would be to produce the

sample results if the null hypothesis is in fact true.

That is, if the average GMAT scores for students using the two different software packages

actually are the same, what is the chance that two samples we collect would have sample means as

different as 650 and 630?

Our intuition tells us that the greater the difference between the means of the two samples, the

more likely it is that the samples came from different populations. But how do we know when the

numerical difference is large enough to be statistically significant? When do we have enough

evidence to actually conclude that the two populations must be different?

We use p-values to answer this question. First we calculate a z-value for the difference of the

sample means, incorporating the data from both populations. It looks a bit complicated:

Let's compute the z-value for our example. Since we assume that the null hypothesis is true, we

have: NOTE 1020

Using the formula, we find that the z-value is 1.64.

For a two-sided test, a z-value of 1.64 translates into a probability in one tail of 0.05, and thus a p-

value of 0.10.

Since this p-value is greater than the significance level of 0.05, we cannot reject the null

hypothesis.

In other words, the high p-value tells us that there is insufficient evidence from the two samples to

conclude that the average GMAT score of the students who use the company's software is different

from the average GMAT score of students who use the competitor's software.

Two-population hypothesis tests can be performed using the formula shown above, or you can

click here to access the Excel utility for hypothesis testing.

summary

In a hypothesis test for two population means, we assume a null hypothesis: that the two

population means are equal. We collect a sample from each population and calculate its sample

statistics. We calculate a p-value for the difference between the two samples. If the p-value is less

than the significance level, we reject the null hypothesis. NOTE 1021

Often, managers want to know if two population proportions are equal. For example, a marketing

manager of a packaged snack foods company might want to compare the snack food habits of

different states in the US.

The marketing manager might think that the proportion of consumers who favor potato chips in

Texas is different from the proportion of consumers who favor potato chips in Oklahoma.

Comparing two population proportions is similar to comparing two population means. We have

two populations: the null hypothesis states that their proportions are the same; the alternative

hypothesis states that they are different.

We collect a sample from each population and calculate its sample size and sample proportion. As

in the single population proportion test, we don't need to find the sample standard deviation, since

we know that the population standard deviation is the square root of

[p*(1 - p)].

Similarly to the hypothesis tests for comparing two population means, we calculate a z-value for

the difference between the proportions using the formula below:

We translate the z-value into a p-value just as we would for any other type of hypothesis test. If

the p-value is less than our significance level, we reject the null hypothesis and conclude that the

proportions are different. If the p-value is greater than the significance level, we do not reject the

null hypothesis.

Let's take a closer look at the study of snacking habits in Texas and Oklahoma.

The manager does not wish to test for a particular direction of difference; he just wants to know if

the proportions are different. Thus, he should use a two-sided test.

The marketing manager wants to be 95% confident in the result of this test, so the significance

level is 0.05.

Suppose we collect responses from 400 people in Texas and 225 people in Oklahoma. The sample

proportions are 45% and 35%, respectively.

Could the two random samples we picked just happen to have different sample proportions? That

is, if the true proportions of Texans and Oklahomans favoring potato chips actually are the same,

what would be the chance that the sample proportions are 45% and 35% respectively?

We use p-values to answer this question. First, we calculate a z-value for the difference of the

sample proportions that incorporates the data from both populations. The null hypothesis states

that the population proportions are equal, so their difference is 0.

The z-value is 2.48.

For a two-sided test, a z-value of 2.48 translates into a probability in one tail of 0.0065 and hence

a p-value of 0.013

Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis.

In other words, the low p-value tells us that there is sufficient evidence from the samples to

conclude that there is a difference between the proportions of Texan and Oklahoman potato chip

lovers. We can make this claim at a 0.013 level of significance.

Two-population hypothesis tests for population proportions can be performed using the formula

shown above, or you can click here to access the Excel utility for hypothesis testing.

In a hypothesis test for two population proportions, we assume a null hypothesis: the two

population proportions are equal. We collect two samples and calculate the sample proportions.

We calculate a p-value for the difference between the sample proportions. If the p-value is less

than the significance level, we reject the null hypothesis. NOTE 1022

Make sure you do at least one example by hand to ensure you thoroughly understand the basic

concepts before using the utility. You should enter data only in the yellow input areas of the utility.

To ensure you are using the utility correctly, try to reproduce the results for the GMAT and potato

chip examples.

Two-population hypothesis tests help you determine whether two populations have different

means. You use a two-population test to solve Leo's problem.

You have to find out if leisure guests' average daily discretionary spending is different from

business guests' average daily discretionary spending.

Leo has provided these data:

Now it's time to state the null hypothesis. The best formulation is:

There is no difference between business and leisure guests' mean spending.

This is the best answer. You want to know if two means are different, not if they differ in one

particular direction. If Leo had asked you to conduct a test to learn only if business guests'

spending was greater than that of leisure guests, the second answer would be correct.

Regression Basics

Introduction

As you relax in your room during a brief afternoon downpour, your phone rings.

Leo just called. He wants us to come to his office immediately. He sounds a little angry. We'd

better not keep him waiting.

I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the

restaurant. A server spilled a tureen of crab bisque on one of our most "favored" guests, Mr. Pitt.

I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the

restaurant. A server spilled a tureen of crab bisque on one of our most "favored" guests, Mr. Pitt.

The Kahana's occupancy this year has been higher than I expected, and I had to hire extra help

from a staffing agency. Those staffing agencies charge a fortune, which is especially irritating

considering that the employees they refer to us are often poorly suited to customer service in an

up-scale hotel.

Really, this is my fault for not having a more effective staffing process. I just wish I could predict

my needs better. Sometimes, when demand is lower than I expected, I'm overstaffed. Then I lose

money paying idle bellhops. If I had a good sense of my staffing needs at least a month in

advance, I could avoid hiring workers at the last minute and having idle staff.

I had been thinking that the number of advance reservations would give me a good idea of how

high my occupancy would be a month down the road. But clearly advance reservations don't tell

me the whole story. I've been making way too many false predictions.

Is there anything you can do to help me here? What predictions about occupancy can I make based

on advance bookings? And how much can I trust them?

We'll take a look at the data on advance bookings and occupancy and let you know what we find

out.

Alice seems confident that the two of you can offer useful advice on Leo's staffing problem:

"This will be a great opportunity for you to learn regression. It's a powerful statistical tool used all

the time in business: in finance, demand forecasting, market research to name just a few areas. I'm

sure you'll use it in your MBA program. And it's a great chance to review what you've learned so

far: sampling, confidence intervals, and hypothesis testing all play a part in regression."

As we have seen, it is often useful to examine the relationship between two variables. Using

scatter diagrams, we can visualize such relationships. NOTE 1023

We can learn more about the relationship by finding the correlation coefficient, which measures

the strength of the linear relationship on a scale from -1 to 1.

Regression is a statistical tool that goes even further: it can help us understand and characterize the

specific structure of the relationship between two variables.

Let's look at an example. Julius Tabin owns a small food processing company that produces the

spreadable lunchmeat product EasyMeat. Julius is trying to understand the relationship between

his firm's advertising and its sales.

Total sales in the spreadable meat industry have been fairly flat over the last decade, and Julius'

competitors' actions have been quite stable. Julius believes that his advertising levels influence his

firm's sales positively, but he doesn't have a clear understanding of what the relationship looks

like.

Let's have a look at data on his firm's advertising and sales over the last 10 years. Click on the

Excel link to create the scatter diagram yourself from an Excel spreadsheet

Year Advertising ($) Actual Sales ($)

1992 35,000 1,100,000

1993 45,000 2,105,000

1994 55,000 3,000,000

1995 55,000 2,000,000

1996 65,000 3,200,200

1997 60,000 2,699,500

1998 70,000 3,100,000

1999 75,000 2,900,000

2000 80,000 4,007,000

2001 95,000 4,300,000

Plotting annual sales against annual advertising expenditures gives us a visual sense of the

relationship between the two variables. Looking at the graph, we can see that as advertising has

gone up, sales have generally increased. The relationship looks reasonably linear.

The correlation coefficient for the two variables is 0.93, indicating a strong linear relationship

between advertising and sales.

What if we were to draw a line that characterizes this relationship? Which line would best fit the

data? Our mind's eye already sees how the two variables are related, but how can we formalize our

visual impression?

Before we start any calculations, let's look at several lines that could describe the relationship.

One of these lines most accurately describes the relationship between the two variables: the "best-

fit" or regression line.

In our example, the best-fit line is Sales = -333,831 + 50*Advertising. For this line, the y-intercept

is

-333,831 and the slope is 50.

In general, a regression line can be described by a simple linear equation, y = a + bx, with y-

intercept a and slope b.

In this equation, the y-variable, sales, is called the dependent variable, to suggest that we think

Julius' sales depend to some degree on his advertising. The x-variable, advertising, is called the

independent variable, or the explanatory variable. NOTE 1024

When we observe that a change in the independent variable (here advertising) is typically

accompanied by a proportional change in the dependent variable (here sales), regression analysis

can identify and formalize that relationship.

Smmary

Regression analysis helps us find the mathematical relationship between two variables. We can use

regression to describe a linear relationship: one that can be represented by a straight line and

characterized by an equation of the form y = a + bx.

Plot the behavior of two variables on a scatter diagram to observe patterns in their

relationship

Use regression analysis to identify the linear relationship that best fits the data.

The linear relationship has the form y=a + bx.

a is the y-intercept of the line

b is the slope of the line

y is called the dependent variable and x is called the independent, or explanatory variable.

What kinds of questions can regression analysis help answer?

How does regression help us as managers? In can help in two ways: first, it helps us forecast. For

example, we can make predictions about future values of sales based on possible future values of

advertising.

Second, it helps us deepen our understanding of the structure of the relationship between two

variables by expressing the relationship mathematically.

Let's talk first about how managers can use regression to forecast. In our example, regression can

help Julius predict his company's sales for a specified level of advertising.

For example, if he plans to spend $65,000 in advertising next year, what might we expect sales to

be?

If we didn't know anything about the relationship, but only had the historical data, we might

simply note that the last time Julius spent $65,000 on advertising, his sales were $3,200,200. But

is this the best prediction we can make?

Not at all. Regression analysis brings the entire data set to bear on our prediction. In general, this

will allow us to make more accurate predictions than if we infer the future value of sales from a

single observation of advertising and sales. Having identified the relationship between the two

variables from the full data set, we can apply our understanding of that relationship to our forecast.

Using regression analysis, we found the regression line to be Sales = -333,831 + 50*Advertising.

If Julius plans to spend $65,000 in advertising, what would we predict sales to be?

Around $2,900,000.

This is the best answer.

The point on the line shows us what level of sales to expect. In this case, we would expect sales of

$2,916,169.

With regression, we can forecast sales for any advertising level within the range of advertising

levels we've seen historically. For example, even if Julius has never spent exactly $50,000 on

advertising, we can still forecast a corresponding level of sales. NOTE 1025

We must be extremely cautious about forecasting sales for values of advertising beyond the range

of values we have already observed. The further we are from the historical values of advertising,

the more we should question the reliability of our forecast.

For example, we might feel comfortable forecasting sales for advertising levels a bit above the

observed range- perhaps as high as $100,000 or $105,000. But we shouldn't infer that if Julius

spent $10 million on advertising, he would achieve $500 million in sales. The total market for

spreadable meat is probably much less than $500 million annually!

Likewise, we might feel comfortable forecasting sales for advertising levels just below the

observed range. But we certainly shouldn't report that if Julius spent $0 on advertising he would

have negative sales!

If we try to use our regression equation to forecast sales for advertising levels outside of the

historical range, we are implicitly assuming that the relationship between advertising and sales

continues to be linear outside of the historical range.

In reality, although the relationship may be quite linear for the range of values we've observed, the

curve may well level off for advertising values much lower or much higher than those we've

observed. With no observations outside the historical data range, we simply don't have evidence

about what the relationship looks like there.

Another critical caveat to keep in mind is that whenever we use historical data to predict future

values, we are assuming that the past is a reasonable predictor of the future. Thus, we should only

use regression to predict the future if the general circumstances that held in the past, such as

competition, industry dynamics, and economic environment, are expected to hold in the future.

Regression can be used to deepen our understanding of the structural relationship between two

variables. If we think about it, many business decisions are about increasing or decreasing one

variable — investments or advertising, for example — to affect some other variable —

productivity, brand recognition, or profits, for example. Regression can reveal the structure of

relationships of this type. NOTE 1026

Our regression analysis stipulates a linear relationship between sales and advertising.

Understanding "the structure" of this relationship translates into finding and interpreting the

coefficients of the regression equation.

As we've noted above, the constant term -333,831 may have no real managerial significance; it

just "anchors" the regression line by telling us the y-intercept. We've never seen advertising levels

close to $0, so we cannot infer that spending no money on advertising will lead to sales of -

$333,831!

The more important term is the advertising coefficient, 50, which gives us the slope of the line.

The advertising coefficient tells us how sales have changed on average as advertising has

increased.

In the past, when advertising has increased by $10,000, what has been the average corresponding

change in sales?

Sales have increased by $500,000.

Assuming that the relationship between sales and advertising is linear, each $1 increase in

advertising should be accompanied by the same average increase in sales. In our example, for

every incremental $1 in advertising, sales increase on average by $50. Thus, for every incremental

$10,000 in advertising, sales increase on average by $500,000.

The regression line gives us insight into how two variables are related. As one variable increases,

by how much does the other variable typically change? How much growth in sales can we

anticipate from an incremental increase in advertising expenditures? Regression analysis helps

managers answer questions like these.

Summary

We use regression analysis for two primary purposes: forecasting and studying the structure of the

relationship between two variables. We can use regression to predict the value of the dependent

variable for a specified value of the independent variable. The regression equation also tells us

how the dependent variable has typically changed with changes in the independent variable.

Use regression analysis to understand the structure of the relationship between two variables.

Structural Relationship: y = a + bx

Use regression analysis to forecast y for a value of x within the historically observed range of x-

values.

Be cautious about using regression to forecast for values beyond the historically observed range of

x-values.

Exercise 1

Per-capita consumption of soft drink beverages is related to per-capita gross domestic product

(GDP). Generally, the higher the GDP of a country, the more soda its citizens consume. Soft drink

consumption is measured in number of 8-oz servings.

Based on data from 12 countries, the relationship can be expressed mathematically as:

(Per Capita Soft drink consumption) = 130 + 0.018*per capita GDP

Based on this relationship, you can expect that, on average, for each additional $1,000 of per-

capita GDP a country's soda consumption increases by: 18 SERVINGS

The regression equation tells us that in our data set, average soda consumption increases by 0.018

servings for every additional $1 of per-capita GDP. So, for an additional $1,000, average

consumption increases by ($1,000)(0.018 servings/$) = 18 servings.

The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of

servings of soda consumed in the Netherlands per year?

Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if

necessary.

The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP).

Therefore, we anticipate the Netherlands' average soda consumption to be 580.6 servings.

Although the regression predicts a soda consumption of around 581 servings per person for the

Netherlands, the actual measured number of servings consumed is much lower: 362. The

discrepancy in the actual and predicted consumption reinforces that per-capita GDP alone is not a

perfect predictor of soda consumption.

A regression line helps you understand the relationship between two variables and forecast future

values of the dependent variable. Alice points out to you that these two features of regression

analysis make it a powerful tool for managers who make important decisions in the uncertain

world of business.

But how do you generate a regression line from observed data? Of all the straight lines that you

could draw through a scatter diagram, which one is the regression line?

Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight

line could be drawn that would pass through every point in the data set.

This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect

every data point to fall in a perfect line. The regression line depicts the best linear relationship

between the two variables. We attribute the difference between the actual data points and the line

to the influence that other variables have on sales, or to chance alone.

Since the regression line does not pass through every point, the line does not fit the data perfectly.

How accurately does the regression line represent the data?

To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's

look at one line we could draw through our data set.

Let's consider a second line. Click on the line that more closely fits the ten data points.

Although in this example we can see which of two lines is more accurate, it is useful to have a

precise measure of a line's accuracy.

To quantify how accurately a line fits a data set, we measure the vertical distance between each

data point and the line.

Why don't we measure the shortest distance between the point and the line — the distance

perpendicular to the line? Why do we measure vertically?

We measure vertical distance because we are interested in how well the line predicts the value of

the dependent variable. The dependent variable — in our case, sales — is measured on the vertical

axis. For each data point, we want to know: how close is the value of sales predicted by the line to

the historically observed value of sales?

From now on we will refer to this vertical distance between a data point and the line as the error in

prediction or the residual error, or simply the error. The error is the difference between the

observed value and the line's prediction for our dependent variable. This difference may be due to

the influence of other variables or to plain chance.

Going forward, we will refer to the value of the dependent variable predicted by the line as y-hat

and to the actual value of the dependent variable as y. Then the error is y - (y-hat), the difference

between the actual and predicted values of the dependent variable.

The complete mathematical description of the relationship between the dependent and

independent variables is y = a + bx + error. The y-value of any data point is exactly defined by

these terms: the value y-hat given by the regression line plus the error, y - (y-hat).

Collectively, the errors in prediction for all the data points measure how accurately a line fits a set

of data.

To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did,

positive and negative distances would cancel each other out.

Instead, we take the square of each distance and then sum all the squares, similarly to what we do

when we calculate variance.

This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a

good measure of how accurately a line describes a set of data.

The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors.

sUMMARY

To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the

Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances

from the data points to the line, square the distances, and sum the squares.

Error = vertical distance from data point to line

= actual value – predicted value

^

= y-y

Measure of accuracy = Sum of Squared Errors

Now that you have a way to measure how well a line fits a set of data, you need a way to

identify the line that "best fits" the data: the regression line.

We can calculate the Sum of Squared Errors for any line that passes through the data. Of

course, different lines will give us different Sums of Squared Errors. The line we are looking for

— the regression line — is the one with the smallest Sum of Squared Errors

Let's look at several lines that could describe the relationship between advertising and sales

in our example. Our intuition tells us that the middle line is a much better fit than line a or line b.

Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to

determine its accuracy.

The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher

the line's accuracy

The line that most accurately describes the relationship between advertising and sales — the

regression line — is the line that minimizes the sum of squares. Finding the regression line for a

set of data is a calculation-intensive process best left to statistical software.

Summary

The line that most accurately fits the data — the regression line — is the line for which the

Sum of Squared Errors is minimized.

Lower SSE Higher Accuracy

Lowest SSE Regression Line

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able

to do regression analysis using the regression tool. However, we suggest you read through the

following instructions to learn how Excel's regression tool works, so you can run regressions in

the future, when you do have access to the Data Analysis Toolpak.

Performing regression analysis by hand is a time-consuming process. Fortunately, statistical

software packages and major spreadsheet programs — Excel, for example — can do the necessary

calculations for you in a matter of seconds. Click on the Excel link to access the data file so you

can practice doing the analysis in Excel as you read through the instructions.

Let's go through the process step by step. We start with data entered in two columns in an

Excel spreadsheet. Each column contains values of a variable. To perform regression analysis,

there must be an equal number of entries in each column.

Under the Data tab in the toolbar we select the Data Analysis option.

A window pops open containing an alphabetical list of statistical tools. We select

"Regression" and click "OK".

A new window opens offering several options for regression analysis.

In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter

C1:C11, the range of cells containing the column label (C1) and the data (C2:C11) for the

dependent variable: Sales ($).

We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both

the column label (B1) and the data (B2:B11) for the independent variable: Advertising ($).

Since we included the column lables in row 1 in our ranges, we must check the "Labels" box.

Including labels is helpful because Excel uses the labels to identify the variable coefficients in the

output sheet. If you do not include the labels in your ranges, do not check the label box, or Excel

will treat the first row of data as labels, excluding those entries from the regression.

Finally, we select the output option "New Worksheet Ply:", enter the name for the new

worksheet, and click "OK."

Excel opens a new worksheet with the name we specified. In it, we see an intimidating array

of data.

For the moment, we are mainly interested in the entries in the cells labeled "Coefficients",

which specify the intercept and slope of the regression line.

Note that the label "Advertising ($)" has been carried over from the original data column.

The coefficient in the "Advertising ($)" row is the slope of the regression line.

For the exercises in this unit, we strongly recommend you find the relevant data in an Excel

spreadsheet and perform the regression analyses yourself. If you do not have the Analysis

Toolpak, you can open a file containing the relevant regression output.

EXERCISE1

To practice using Excel's regression tool, run a regression using the world soft drink

consumption data from an earlier exercise. Use soft drink consumption for the dependent variable

and per capita GDP for the independent variable.

What is the slope of the regression line? Enter the slope as a decimal number with 3 digits to the

right of the decimal point (e.g., enter "5" as "5.000"). Round if necessary.

We run the regression by selecting range C1:C13 for the Y-range, the dependent variable

consumption, and B1:B13 for the X-range, the independent variable GDP per capita. We check the

label box, and see the output below. The slope of the regression line is the coefficient of the

independent variable, GDP per capita.

What is the intercept of the regression line? Enter the intercept as an integer (e.g., "5"). Round if

necessary.

The intercept of the line is the coefficient labeled "Intercept."

Deeper Into Regression

Equipped with the basic tools needed to find and interpret the regression line, you feel ready

to tackle Leo's assignment. But Alice cautions you not to be hasty and urges you to consider some

tricky questions: "How well does the regression line actually characterize the relationship in the

data? Is a straight line even a good descriptor of the relationship?"

How much does the relationship between advertising and sales help us understand and

predict sales? We'd like to be able to quantify the predictive power of the relationship in

determining sales levels. How much more do we know about sales thanks to the advertising data?

To answer this question we need a benchmark telling us how much we know about the

behavior of sales without the advertising data. Only then does it make sense to ask how much

more information the advertising data give us.

Without the advertising data, we have the sales data alone to work with. Using no

information other than the sales data, the best predictor for future sales is simply the mean of

previous sales. Thus, we use mean sales as our benchmark, and draw a "mean sales line" through

the data.

Let's compare the accuracy of the regression line and the mean sales line. We already have a

measure of how accurately an individual line fits a set of data: the Sum of Squared Errors about

the line. Now we want a measure of how much more accurate the regression line is than the mean

line.

To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines,

and see how much smaller the error is around the regression line than around the mean line.

The Sum of Squared Errors for the mean sales line measures the total variation in the sales

data. In fact, it is the same measure of variation we use to derive the standard deviation of sales.

We call the Sum of Squared Errors for our benchmark — the mean sales line — the Total Sum of

Squares. Here, the Total Sum of Squares is 8.01 trillion.

The difference between the Total Sum of Squares and the Residual Sum of Squares, 6.88

trillion in this case, is called the Regression Sum of Squares. The Regression Sum of Squares

measures the variation in sales "explained" by the regression line.

Excel's regression output reports all three of these terms.

A standardized measure of the regression line's explanatory power is called R-squared. R-

squared is the fraction of the total variation in the dependent variable that is explained by the

regression line.

R-squared will always be between 0 and 1 — at worst, the regression line explains none of

the variation in sales; at best it explains all of it.

R-squared is presented either as a fraction, a percentage, or a decimal. We find that in the

advertising and sales example, the R-squared value is 6.88 trillion/8.01 trillion = 0.859 =

85.9%.NOTE1030

Then we subtract the fraction of unexplained variation from 1 to obtain R-squared.

Fortunately, we don't need to calculate R-squared ourselves — Excel computes R-squared

and includes it in the standard regression output.

In a regression that has only one independent variable, R-squared is closely related to the

correlation coefficient between the independent and dependent variables: the correlation

coefficient is simply the positive or negative square root of R-squared; positive if the slope of the

regression line is positive and negative if the slope of the regression line is negative. NOTE1031

Excel's regression output always computes the square root of R-squared, which it labels

"Multiple R." NOTE1032

SUMMARY

R-squared measures how well the behavior of the independent variable explains the behavior

of the dependent variable. R-squared is the ratio of the Regression Sum of Squares to the Total

Sum of Squares. As such, it tells us what proportion of the total variation in the dependent

variable is explained by its linear relationship with the independent variable. NOTE1034

Documents

Statistics