7/24/2019 0000 Stats Practice
1/52
Yale SOM
MGT 403: Statistics
Practice Problem Set - P1-1
Introduction. You have been hired to study the evolution of executive compensation over time.Specifically, how CEOs salaries vary between different sectors and how they are related to a com-panys sales in the early 1990s. You receive data on a random sample of CEOs which is containedin ceosalary1.dta. Type describe to see the contents of this data set.
Question 1
(a) There are two hypotheses concerning CEO compensation in the early 1990s. One is thataverage CEO salaries were at most $1,000,000. Another concerns the default belief thataverage CEO salaries were actually $1,200,000. You want to test these two hypotheses. Notethat the data to test them is contained in the variable salary (which measures CEO salary
in $1000). Can you reject the null hypotheses, at the 5% level, implied by these tests?To answer this question, write down the following steps for each test:
1. The null hypothesis
2. The alternative hypothesis
3. The formula for the realization of the test statistic
4. The rejection region: for which values of the test statistic you reject the null hypothesis
Now use the data to carry out the two tests.
First, do it manually by typing summarize salaryor using the User Menu, Summarizeand Describe Data, Simple Summary Statistics (summarize) to input this command. Usethe result to calculate the realization (or outcome) for the test statistics.
Can you reject the null hypotheses? Why or why not?
Second, check your answer by conducting a test of means in Stata. You can use SimpleTest of Association Test of Means in the Stata User menu.
What are the p-values for each of these two tests? Based on the p-value can you rejectthe null hypothesis at the 5% level for each test? Explain why.
(b) Compute the 95% confidence interval for the unknown population mean of CEO salaries, by
writing down the formula for the confidence interval
using the results from the command summarize salary to compute the confidence in-terval.
Does 1200 fall in the confidence interval?
1
7/24/2019 0000 Stats Practice
2/52
Yale SOM
MGT 403: Statistics
Practice Problem Set - P1-2
The Internet portal Yahoo may allow its members to customize their start pages
(homepages). As part of a short survey regarding likes and dislikes, users were asked
about their interests in options such as QuickTime movie clips with daily news and
sports events on their pages. Yahoo hopes that QuickTime will entice users to follow
a larger number of hyperlinks so that it can attract more advertisers.
The newly customized page option was made available to 100 Internet users whowere randomly sampled from the target population. The benchmark for Yahoo is 6
non-Yahoo content links clicked by all its customers on average prior to the avail-
ability of the QuickTime option (during any one-week period).
After one week of access to the customized homepage option, Yahoo observes
the (average) number of non-Yahoo links for each customer. For the sample of 100
customers, the average is 7.8 links and the standard deviation is 9.5 links.
1. Test the two-tailed null hypothesis that the customization with QuickTimedoes not alter the true average (benchmark) number of links at the 5% -level
(critical value is 1.96). Specify null and alternative hypotheses, compute the
value for the test statistic and state whether you can reject or not the null
hypothesis and why.
2. Construct a 95 percent confidence interval for the true but unknown population
parameter. Interpret the resulting interval statistically and managerially.
3. Which of these two procedures is more informative, the test of the null hypoth-esis or the confidence interval? Explain.
1
7/24/2019 0000 Stats Practice
3/52
Yale SOM
MGT 403: Statistics
Practice Problem Set - P1-1
Introduction. You have been hired to study the evolution of executive compensation over time.Specifically, how CEOs salaries vary between different sectors and how they are related to a com-panys sales in the early 1990s. You receive data on a random sample of CEOs which is containedin ceosalary1.dta. Type describe to see the contents of this data set.
Question 1
(a) There are two hypotheses concerning CEO compensation in the early 1990s. One is thataverage CEO salaries were at most $1,000,000. Another concerns the default belief thataverage CEO salaries were actually $1,200,000. You want to test these two hypotheses. Notethat the data to test them is contained in the variable salary (which measures CEO salary
in $1000). Can you reject the null hypotheses, at the 5% level, implied by these tests?To answer this question, write down the following steps for each test:
1. The null hypothesis
2. The alternative hypothesis
3. The formula for the realization of the test statistic
4. The rejection region: for which values of the test statistic you reject the null hypothesis
Now use the data to carry out the two tests.
First, do it manually by typing summarize salaryor using the User Menu, Summarizeand Describe Data, Simple Summary Statistics (summarize) to input this command. Usethe result to calculate the realization (or outcome) for the test statistics.
Can you reject the null hypotheses? Why or why not?
Second, check your answer by conducting a test of means in Stata. You can use SimpleTest of Association Test of Means in the Stata User menu.
What are the p-values for each of these two tests? Based on the p-value can you rejectthe null hypothesis at the 5% level for each test? Explain why.
(b) Compute the 95% confidence interval for the unknown population mean of CEO salaries, by
writing down the formula for the confidence interval
using the results from the command summarize salary to compute the confidence in-terval.
Does 1200 fall in the confidence interval?
1
7/24/2019 0000 Stats Practice
4/52
Yale SOM
MGT 403: Statistics
Practice Problem Set - P1-1
Introduction. You have been hired to study the evolution of executive compensation over time.Specifically, how CEOs salaries vary between different sectors and how they are related to a com-panys sales in the early 1990s. You receive data on a random sample of CEOs which is containedin ceosalary1.dta. Type describe to see the contents of this data set.
Question 1
(a) There are two hypotheses concerning CEO compensation in the early 1990s. One is thataverage CEO salaries were at most $1,000,000. Another concerns the default belief thataverage CEO salaries were actually $1,200,000. You want to test these two hypotheses. Notethat the data to test them is contained in the variable salary (which measures CEO salary
in $1000). Can you reject the null hypotheses, at the 5% level, implied by these tests?To answer this question, write down the following steps for each test:
1. The null hypothesis
2. The alternative hypothesis
3. The formula for the realization of the test statistic
4. The rejection region: for which values of the test statistic you reject the null hypothesis
Now use the data to carry out the two tests.
First, do it manually by typing summarize salaryor using the User Menu, Summarizeand Describe Data, Simple Summary Statistics (summarize) to input this command. Usethe result to calculate the realization (or outcome) for the test statistics.
Can you reject the null hypotheses? Why or why not?
Second, check your answer by conducting a test of means in Stata. You can use SimpleTest of Association Test of Means in the Stata User menu.
What are the p-values for each of these two tests? Based on the p-value can you rejectthe null hypothesis at the 5% level for each test? Explain why.
(b) Compute the 95% confidence interval for the unknown population mean of CEO salaries, by
writing down the formula for the confidence interval
using the results from the command summarize salary to compute the confidence in-terval.
Does 1200 fall in the confidence interval?
1
7/24/2019 0000 Stats Practice
5/52
Yale SOM
MGT 403: Statistics
Practice Problem Set - P1-2
The Internet portal Yahoo may allow its members to customize their start pages
(homepages). As part of a short survey regarding likes and dislikes, users were asked
about their interests in options such as QuickTime movie clips with daily news and
sports events on their pages. Yahoo hopes that QuickTime will entice users to follow
a larger number of hyperlinks so that it can attract more advertisers.
The newly customized page option was made available to 100 Internet users whowere randomly sampled from the target population. The benchmark for Yahoo is 6
non-Yahoo content links clicked by all its customers on average prior to the avail-
ability of the QuickTime option (during any one-week period).
After one week of access to the customized homepage option, Yahoo observes
the (average) number of non-Yahoo links for each customer. For the sample of 100
customers, the average is 7.8 links and the standard deviation is 9.5 links.
1. Test the two-tailed null hypothesis that the customization with QuickTimedoes not alter the true average (benchmark) number of links at the 5% -level
(critical value is 1.96). Specify null and alternative hypotheses, compute the
value for the test statistic and state whether you can reject or not the null
hypothesis and why.
2. Construct a 95 percent confidence interval for the true but unknown population
parameter. Interpret the resulting interval statistically and managerially.
3. Which of these two procedures is more informative, the test of the null hypoth-esis or the confidence interval? Explain.
1
7/24/2019 0000 Stats Practice
6/52
Yale SOM
MGT 403: Statistics
Practice Problem Set P1-1 Answers
Question 1
(a) To test the research hypothesis that the mean of salary is at most (less than or equal to) 1000,we have
1. The null hypothesis: H0 : >1000
2. The alternative hypothesis: Ha : 10003. The formula for the realization of the test statistic: t= x1000
/N
4. The rejection region: reject ift < 1.65.To test if the mean of salary is equal to 1200:
1. The null hypothesis: H0 : = 1200
2. The alternative hypothesis: H1 : 6= 12003. The formula for the realization of the test statistic: t= x1200
/N
4. The rejection region: reject if |t| > 1.96 (this is the same as saying that the rejectionregion is t < 1.96or t >1.96).
For the manual calculation of the realization of the test statistic we need the mean ofsalaries in the sample, the standard deviation, and the number of observations. We getall of these from Statas summarize command.
. summarize salary
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
salary | 206 1141.063 611.193 223 4143
Hence the realization or value for the test statistic for the first test equals
t=1141.063 1000
611.193/
206= 3.313.
The realization or value for the test statistic for the second test equals
t=1141.063
1200
611.193/206 = 1.384.
1
7/24/2019 0000 Stats Practice
7/52
For the first test, since the value for t is 3.3 which is not smaller than -1.65 we cannotreject the null hypothesis in favor of the alternative that that the mean of salaries in thepopulation of CEOs is at most $1,000,000 at the 5% level. For the second test, since thevalue for the test statistictof -1.384 is not in the rejection region oft < 1.96ort >1.96we also cannot reject the null hypothesis that the mean of salaries in the population ofCEOs is equal to $1,200,000 at the 5% level.
We get the same results using the ttestcommand in Stata. Note that when Stata setsas a default 95% confidence level", it is just asking you if you would like to see the 95%confidence interval for the unknown population mean of CEO salaries together with thevalue fort.
. ttest salary == 1000
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------salary | 206 1141.063 42.58383 611.193 1057.105 1225.022
------------------------------------------------------------------------------
mean = mean(salary) t = 3.3126
Ho: mean = 1000 degrees of freedom = 205
Ha: mean < 1000 Ha: mean != 1000 Ha: mean > 1000
Pr(T < t) = 0.9995 Pr(|T| > |t|) = 0.0011 Pr(T > t) = 0.0005
. ttest salary == 1200
One-sample t test------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
salary | 206 1141.063 42.58383 611.193 1057.105 1225.022
------------------------------------------------------------------------------
mean = mean(salary) t = -1.3840
Ho: mean = 1200 degrees of freedom = 205
Ha: mean < 1200 Ha: mean != 1200 Ha: mean > 1200
Pr(T < t) = 0.0839 Pr(|T| > |t|) = 0.1679 Pr(T > t) = 0.9161
The p-value for the first test is 0.9995. Since the p-value is greater than 5% we cannotreject the null hypothesis in favor of the alternative that that average salaries in thepopulation of CEOs are, at most, $1,000,000 at the 5% level. The p-value for the secondtest is 0.1679. Since the p-value is greater than 5% we also cannot reject the null
2
7/24/2019 0000 Stats Practice
8/52
hypothesis that average salaries in the population of CEOs are $1,200,000, at the 5%level.
(b) The formula for the 95% confidence interval for the mean of salary in our CEO population
is
x 1.96 N
,x + 1.96
N
.
With the results from summarize salaryabove we get
1141.063 1.96 611.193
206, 1141.063 + 1.96
611.193206
= [1057.5, 1224.5].
Thus we are 95% confident that the true mean of CEO salaries is between[1057.5, 1224.5].Note that the interval contains 1200 and it is almost equal to the interval given to usin the results for the command ttest salary == 1200. We would have obtained thesame values if we had done no rounding.
3
7/24/2019 0000 Stats Practice
9/52
Yale SOM
MGT 403: Statistics
Practice Problem Set - P1-2- Answers
The Internet portal Yahoo may allow its members to customize their start pages
(homepages). As part of a short survey regarding likes and dislikes, users were asked
about their interests in options such as QuickTime movie clips with daily news and
sports events on their pages. Yahoo hopes that QuickTime will entice users to follow
a larger number of hyperlinks so that it can attract more advertisers.
The newly customized page option was made available to 100 Internet users whowere randomly sampled from the target population. The benchmark for Yahoo is 6
non-Yahoo content links clicked by all its customers on average prior to the avail-
ability of the QuickTime option (during any one-week period).
After one week of access to the customized homepage option, Yahoo observes
the (average) number of non-Yahoo links for each customer. For the sample of 100
customers, the average is 7.8 links and the standard deviation is 9.5 links.
1. Null hypothesis: H0: = 6
Alternative hypothesis: H0:6= 6 Computing the value for the test statistic
t= x2/N
= 7.869.52/100
= 1.89
Given that t = 1.89 is not in the rejection region for a two-sided test at
the 5% level (t > 1.96 or t
7/24/2019 0000 Stats Practice
10/52
2. The 95% confidence interval for the true but unknown mean of the number of
links is:
7.81.96 9.510
= 7.81.9 = (5.9, 9.7)
Therefore, we can be quite sure or confident (95 percent) that the true but
unknown population mean is between 5.9 and 9.7 links.
3. None is more informative than the other it depends on the type of question
one is asking. The confidence interval gives us a range for which we are 95%
confident that the population mean falls into. It is good when we want to get
a sense of what the population mean could be. A hypothesis test, in contrast,
allows us to answer a different question: whether a specific hypothesis about
the population it true (supported by the data) or not.
2
7/24/2019 0000 Stats Practice
11/52
Yale SOM
MGT 403: Statistics
Practice Problem Set P2-1
Introduction You have been asked to analyze the relationship between research and development(R&D) spending and sales of firms in the chemical and telecommunications industries. You receivedata on a random sample of firms contained in the data set rd.dta. Type describe to see thecontents of the data set. The binary (dummy) variable chem is equal to one if the firm is in thechemical industry and equal to zero if the firm is in the telecommunications industry.
Question 1
1. Run the regression of sales as a function of R&D.
2. Does the estimated coefficient suggest that sales and R&D spending are positively or nega-tively correlated?
3. By how much do sales increase or decrease on average when R&D spending increases by onemillion dollars?
4. Is this effect significantly different from zero at the 5% level and why?
5. What is the interpretation of the estimate for the intercept parameter 0?
6. How much does the variation in R&D spending explain the variation in sales?
Question 2
After analyzing the relationship between prices you are asked how the returns of the DJIA andGE are related: when one increases, does the other decrease or vice-versa? Or when one increases
does the other also increase? To investigate this question, you first have to generate the returnsusing the User menu command Manipulate Variables and Obs Generate New Variable and theformula
returnt = 100pricet pricet1
pricet1.
Since each observation in our data set represents one date and the observations are chronologicallysorted, we can implement this formula in Stata by 100 * (close_DJIA - close_DJIA[_n-1]) /close_DJIA[_n-1], for example, for DJIA. Here [_n-1]means that we are taking the observationfrom the previous period. Do the same for GE returns.
1. Plot the relationship between the return for the GE stock and the return for DJIA. Is therelationship increasing or decreasing?
2. We can define the beta of a given stock asa = cov(returna,returnp)
var(returnp) wherereturnaand returnp
are the returns of the stock in question and the stock market index, respectively, and the riskfree rate is constant over time. Given the previous plot, should the beta of the GE stock bepositive or negative? Explain why.
1
7/24/2019 0000 Stats Practice
12/52
Yale SOM
MGT 403: Statistics
Practice Problem set P2-1 Answers
Question 1
1. The Stata command for the regression is regress sales rd, robust, which yields the fol-lowing output:
Linear regression Number of obs = 61
F( 1, 59) = 42.28
Prob > F = 0.0000
R-squared = 0.7971
Root MSE = 3542
------------------------------------------------------------------------------
| Robust
sales | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rd | 18.00799 2.769413 6.50 0.000 12.46641 23.54957
_cons | 1040.966 397.866 2.62 0.011 244.8379 1837.094
------------------------------------------------------------------------------
2. The estimated parameter 1 equals about 18, a positive number, which suggests that R&Dspending and sales are positively correlated
3. When R&D spending increases by one million dollars, predicted sales increase by 18 milliondollars.
4. This effect is significantly different from zero at the 5% level since the value of the t-statisticreported in the regression output is equal to 6.5, which is in the rejection region of t < 1.96or t > 1.96. Accordingly, the p-value is lower than 5% at nearly zero.
5. The estimate for the intercept, 0, is 1041, which implies that predicted sales are 1041 milliondollars when R&D spending equals zero.
6. The R-squared value tells us that variation in R&D spending explains about 80% of thevariation in sales. The fit is pretty high.
1
7/24/2019 0000 Stats Practice
13/52
Question 2
First generate the returns for DJIA and GE using the following commands:
generate return_DJIA = 100 * (close_DJIA - close_DJIA[_n-1]) / close_DJIA[_n-1]
generate return_GE = 100 * (close_GE - close_GE[_n-1]) / close_GE[_n-1]
1. We plot the returns of GE against the returns of DJIA using the Stata command twoway(scatter return_GE return_DJIA). See Figure 3.
Figure 1: Question I.4
2. From the graph we see that the relationship between GE returns and DJIA returns is posi-tive: higher DJIA returns are associated with higher GE returns. Thus the covariance in thissample between GE and DJIA returns is positive. Thus the numerator for beta, measuringthe covariance between GE returns and DJIA returns is positive. A variance is never negative(recall that a variance involves sums of squared terms, and squared terms are always nonneg-ative), so the denominator for beta is positive. This implies that the beta of the GE stock inthis sample is positive.
2
7/24/2019 0000 Stats Practice
14/52
Yale SOM
MGT 403: StatisticsPractice Problem Set P2-2
A consulting firm wants to get a better understanding of its cost structure based
on data on costs incurred for projects in the past so as to improve its bidding process
for projects. Experience suggests that there are two main components of costs in a
project: (1) variable costs that are directly related to the size of the project, which
is reasonably proxied by the number of person-hours for the project, and (2) fixed
costs, which are incurred irrespective of the size of the project.
A regression of the total costs (in $) against the number of person-hours based
on data on 42 projects gave the following results:
Linear regression Number of obs = 42
F( 1, 40) = 157.8
Prob > F = 0.000
R-squared = 0.87Root MSE = 2979
------------------------------------------------------------------------------
| Robust
totalcost | Coef. Std. Err. t P>|t| [95% Conf. Interval
-------------+----------------------------------------------------------------
Person-hours | 372.15 29.629 12.6 0.000 311.0 433.3
_cons | 3209.76 1387.962 2.31 0.030 345.1 6074.4
------------------------------------------------------------------------------
1
7/24/2019 0000 Stats Practice
15/52
1. Test the null hypothesis that the slope parameter is zero. State the hypotheses
in appropriate symbols, state the p-value, and interpret the result.
2. Define the 95 percent confidence interval for the true slope parameter and
interpret this interval.
3. Assuming that the equation is a reasonable approximation of the nature of
project costs, interpret the slope coefficient precisely in a manner understand-
able by a layperson.
4. What is the best estimate of fixed costs?
5. What is the predicted total cost for a project that will employ 1,000 person-hours?
2
7/24/2019 0000 Stats Practice
16/52
Yale SOM
MGT 403: StatisticsPractice Problem Set P2-2-Answers
1. H0: 1= 0
Ha: 1 6= 0
The p-value for the test is 0.000.
Since the p-value is less than 5% we can reject the null hypothesis at the
5% level.
We can therefore state that there is a relationship between total costs
of a project and the number of person-hours required to complete the
project in our sample and that this relationship is very significant (it
is not likely that, given our data, the relationship does not exist)
2. The 95% confidence interval for the slope parameter is (311.0, 433.3). We
can state confidently (95% confidence level) that the predicted total cost for
a project for each additional person-hour could be anywhere between $311.0
and $433.3.
3. If the number of person-hours for a project increases by an hour, the total cost
of the project is expected to increase by $372.
4. The best estimate of fixed costs is given by the intercept: the cost of the
project when the number of person-hours is zero. This cost is is $3,210.
5. The predicted total cost for a project that will employ 1,000 person-hours is
3, 210 + 3721000 = $375, 210
1
7/24/2019 0000 Stats Practice
17/52
Yale SOM
MGT 403: Statistics
Practice Problem Set P3-1
Jim Douglas, the manager of Colonial Furniture has been reviewing weekly ad-
vertising expenditures. All of his advertising thus far has been focused on radio.
He is interested in learning how the effect of advertising might differ across different
media. He recorded the following variables:
Sales: Number of customers in each week (individuals visiting an outlet)
# Ads: The number of ads in the week
Medium (1=radio, 2=television).
1. Jim recalled from a class he had taken, that regression analysis could be used to
estimate the effects of the different media. He proposed the following regression
model:
Sales=0+1Ads+2Medium
and he seeks your advice. Would you propose an alternative model? If so, ex-
plain the problem with Jims model. Write out your proposed model explicitly
in the form of an equation.
2. Jim then created one indicator variable: Radio (1 if radio, 0 otherwise, that is,
television). He then ran a regression of Sales against #Ads and Radio. The
results are reported below:
1
7/24/2019 0000 Stats Practice
18/52
Linear regression Number of obs = 52
F( 2, 49) = 14.91
Prob > F = 0.0000
R-squared = 0.69
Root MSE = 44.87
--------------------------------------------------------------------------
| Robust
sales | Coef. Std. Err. t P>|t| [95% Conf. Inter
-------------+------------------------------------------------------------Ads | 25 3.98 6.34 0.00 17.23 33.2
Radio | -47 16.44 -2.83 0.01 -79.64 -13.5
_cons | 283 17.46 16.19 0.00 247.50 317.6
--------------------------------------------------------------------------
Interpret the effect for Radio precisely.
3. What is the effect of a 1 unit increase in the number of ads? State this result
precisely.
4. What is the 95% prediction interval for sales when the company airs 50 ads on
television?
2
7/24/2019 0000 Stats Practice
19/52
Yale SOM
MGT 403: Statistics
Practice Problem Set P3-1-Answers
1. Medium is a categorical variable (with values 1 or 2). So I would not include
it directly. I would create a dummy variable for the Medium category. For
example, Radio (1 if radio; 0 otherwise, that is, TV)
I would then estimate the model (treating TV as the base category):
Sales=0+1Ads+2Radio
2. For any given level of advertising (number of ads), radio ads are expected to
produce 47 fewer customers relative to ads shown on TV.
3. An increase in the number of ads per week by 1 is expected to increase the
number of customers per week by 25, holding the medium through which the
the ads are transmitted constant.
4. To answer this question, we first need to compute the predicted sales value
when airing 50 ads on television.
Sales=283+25*50+(-47)*0=$1,533
Then the 95% prediction interval for sales is:
1, 533 1.96 44.87 = (1, 445.1; 1620.9)
1
7/24/2019 0000 Stats Practice
20/52
Yale SOM
MGT 403: Statistics
Practice Problem Set P3-2
Data Set and Questions
You have been hired to investigate the relationship between individuals physical
attractiveness and their wage. You receive the data set beauty3.dta, which contains
data on the wage and other characteristics, such as education and years of experience,
for a random sample of individuals.
The data set also contains the variablelooks, which measures a given individuals
subjective physical attractiveness. The variable looks encompasses five categories,
where 5 denotes the highest level of attractiveness and 1 denotes the lowest level of
attractiveness. The binary zero-one variablebelavg is derived from looks: belavg
is equal to 1 if looksequals 1 or 2 and 0 otherwise.
1. By how much more/less do individuals with below average looks earn per hour,
on average, relative to individuals with average/above average looks? Run the
appropriate regression and answer the question.
2. Is the above estimate significant at the 5% level?
3. Does experience attenuate the looks" advantage? Run the appropriate regres-
sion and answer the question.
4. How much does the variation in looks and experience explain the variation in
wages?
1
7/24/2019 0000 Stats Practice
21/52
Yale SOM
MGT 403: Statistics
Statistics Practice PS 3-2 Answers
1. Regression: The Stata command is regress wage belavg, robust, which
yields the following output:
Linear regression Number of obs =
F( 1, 1257) = 1
Prob > F = 0.R-squared = 0.
Root MSE = 4.
--------------------------------------------------------------------------
| Robust
wage | Coef. Std. Err. t P>|t| [95% Conf. Inter
-------------+------------------------------------------------------------
belavg | -1.118143 .3120741 -3.58 0.000 -1.730386 -.505
_cons | 6.387627 .128631 49.66 0.000 6.135272 6.63
--------------------------------------------------------------------------
Individuals with below-average looks earn1.12less per hour.
2. The coefficient onbelavgof1.12is significant at the 5% level since the value
for the t-statistic is lower than that of the critical value of1.96, which implies
that one can reject the null hypothesis at the 5% level. Recall that the rejection
region for a large sample two-sided test that each of the regression coefficientsis equal to zero, at the 5% level, is t < 1.96or t > 1.96.
1
7/24/2019 0000 Stats Practice
22/52
3. regress wage belavg exper, robust
Linear regression Number of obs =
F( 2, 1256) = 5
Prob > F = 0.
R-squared = 0.
Root MSE = 4.
--------------------------------------------------------------------------
| Robust
wage | Coef. Std. Err. t P>|t| [95% Conf. Inter
-------------+------------------------------------------------------------belavg | -1.270895 .29984 -4.24 0.000 -1.859137 -.682
exper | .0966187 .0093563 10.33 0.000 .078263 .114
_cons | 4.646653 .1693257 27.44 0.000 4.314461 4.97
--------------------------------------------------------------------------
It seems that experience does not attenuate the advantage of looks. Holding
experience constant, those with below-average looks earn1.27dollars per hour
than those with above-average looks. Further, this coefficient is statistically
significant at the 5% level as the p-value of 0.000 is less than 5%.
4. The variation in looks and experience only explain 8% of the variation in wages.
2
7/24/2019 0000 Stats Practice
23/52
Yale SOM
MGT 403: Statistics
Sample Exam Questions
Administrative Details
This final is open book. You can consult your class notes, problem set solutions
and other materials. But you cannot discuss the exam with anyone. This
constitutes a violation of the honor code. Show all your work, including all the Stata
output relevant to answer the questions.1
Sample Exam Question 1
The Nielsen Media organization conducts tests of commercials in its laboratories.
The firm regularly invites members of identified target markets to its premises. At-
tendees are shown one or more television programs in which commercials are embed-
ded, and asked questions about products and other aspects both before and after
they view programs.
Each study is typically sponsored by a single company such as Procter & Gamble(P&G). On November 29, 2002, Nielsen Media Research did a study on a brand
that was not performing well in the market. P&G was interested in whether new
commercials it proposes to air might change target members preferences for the
brand.
A total of 32 consumers participated in the study. They first provided preference
and perception data on multiple brands. Then they watched two TV programs
with a standard number of commercials. Thereafter they provided preference and
perception data on some of the same brands and other brands. (Researchers also
1Though all the sample questions already show the Stata output, you will have to create your
own Stata output when answering the questions in the exam.
1
7/24/2019 0000 Stats Practice
24/52
obtained brain scanner analyses based on principles of neuromarketing but those
data are ignored here.)
The data of interest pertain to brand X on which consumer preferences were
obtained both before and after the TV programs (with relevant commercials on brand
X as part of the TV program). The rating was on a 5 point scale, where 5=great
and 1=lousy. The sample data about the preferences of the brand are summarized
below:
1. You have taken a regression course and want to use this promising analytical
technique". You create a dependent variable with 32 before" and 32 after"
preference scores for brand X. The regression includes only one dummy (in-
dicator) variable, AFTER, to distinguish the two categories of observations:
AFTER = 1 if after, 0 if before.
2
7/24/2019 0000 Stats Practice
25/52
Linear regression Number of obs = 64
F( 1, 62) = 1.89
Prob > F = 0.17
R-squared = 0.18
Root MSE = 1.53
--------------------------------------------------------------------------
Robust
ratings | Coef. Std. Err. t P>|t| [95% Conf. Inter
-----------+--------------------------------------------------------------
After | 0.551 0.401 1.375 0.175 -0.251 1.3_cons | 2.068 0.283 7.293 0.000 1.500 2.6
--------------------------------------------------------------------------
What is the average preference in the sample for brand X beforethe TV pro-
grams?
2. What is the average preference in the sample for brand X afterthe TV pro-
grams?
3. Based on the above regression, do the ads for brand X have a statistically
significant effect on the average preference for the brand in the target market?
Be precise and show relevant numbers.
4. Your exposure to regression analysis suggests that it may be useful to include
other variables so as to improve the understanding of effects of interest. So
you decide to add two independent variables: PPur = 1, if the consumer has
purchased the product in the past, = 0 if not; Male = 1 if male, = 0 if female.
3
7/24/2019 0000 Stats Practice
26/52
Linear regression Number of obs = 64
F( 1, 60) = 35.5
Prob > F = 0.00
R-squared = 0.81
Root MSE = 0.92
--------------------------------------------------------------------------
Robust
ratings | Coef. Std. Err. t P>|t| [95% Conf. Inter
-----------+--------------------------------------------------------------
After | 0.521 0.240 2.170 0.030 0.051 0.9PPur | 2.452 0.254 9.649 0.000 1.942 2.9
Male | -0.047 0.201 -0.234 0.815 -0.441 0.3
_cons | 0.742 0.237 3.126 0.002 0.266 1.2
--------------------------------------------------------------------------
Based on the above analysis, do the ads have a statistically significant effect
on the average preference for the brand in the target market? Be precise and
show relevant numbers.
5. Is your conclusion in (4) different from your conclusion in (3)? Explain the
difference, if any. Relate the idea of controlling for other variables (that is,
adding more relevant variables to the model or holding constant these other
relevant variables) to the difference between the test you did in (3) and in (4).
4
7/24/2019 0000 Stats Practice
27/52
Sample Exam Question 2
Investment Bankers earn large fees for making arrangements and giving advice re-
lating to mergers and acquisitions (M&A) when one firm joins with or purchasesanother.
Consider the following regression on the total dollar amount of M&A activity
against the number of deals of the top 15 major firms in this industry.
Dependent Variable: Total M&A Volume (in millions of dollars) for a firm
Independent Variable: Number of Deals for the corresponding firm
Below is the regression output:
Linear regression Number of obs = 36
F( 1, 34) = 19.9
Prob > F = 0.000
R-squared = 0.604
Root MSE = 12286
------------------------------------------------------------------------------
Robust
M&AVolume | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
Deals | 269.660 60.512 4.456 0.000 138.932 400.389
_cons | 1461.941 5737.309 0.254 0.802 -10932.8 13856.640
------------------------------------------------------------------------------
For all questions, assume that the linearity assumption holds.
1. Does the regression equation have significant explanatory power? Be precise
(use a specific result to explain).
5
7/24/2019 0000 Stats Practice
28/52
2. How much of the variation in M&A volume across these firms is explained by
the number of deals that each firm handles?
3. What is the marginal increase in M&A volume attributable to an additionaldeal that a firm makes (or what is the predicted difference in M&A volume
between firm B and firm A, if B has one more deal than A)? Be precise and
state the units (e.g. billions? millions?).
4. Your firm wants to be among the top players in this industry next year with
100 deals. Assuming that the estimated relationship applies to next year, what
is your best estimate of M&A volume for your firm if it achieved its goal next
year? Again, be sure to state the units (e.g. millions or billions) Note: The
question asks for your best estimate of the predicted value; so whether or notsomething is statistically significant is irrelevant to this question.
6
7/24/2019 0000 Stats Practice
29/52
Sample Exam Question 3
The movie-v2.dtadataset contains a sample of movies shown on U.S movie screens
between 1985 and 2001. It contains the title of the movie, the year of its premiere,the number of screens per week, the movies total U.S box office (revenue from ticket
sales), a binary variable indicating whether the movie was produced in the U.S or
not, and the movies production budget, among other information.
Type describe to see the contents of the data set.
1. Run the regression of box office onto budget and whether the movie was pro-
duced in the U.S.
2. There is a claim that movies with larger budgets generate bigger box offi
cerevenues. Holding the location of production constant, how does an increase
in one thousand dollars in budget change the movies box office?
3. What is the predicted box office for a movie with a 50 million dollar budget
produced in the US?
4. What is the predicted box office for a movie with the same budget produced
outside the US?
5. By how much is the variation in box offi
ce explained by the production budgetand whether a movie is produced in the US?
7
7/24/2019 0000 Stats Practice
30/52
Yale SOM
MGT 403: Statistics
Sample Exam Questions-Answers
Administrative Details
This final is open book. You can consult your class notes, problem set solutions
and other materials. But you cannot discuss the exam with anyone. This
constitutes a violation of the honor code. Show all your work, including all the Stata
output relevant to answer the questions.1
Sample Exam Question 1
1. The average preference for brand X before TV programs=2.07 (intercept)
2. The average preference for brand X after TV programs=2.07+0.55=2.62
3. No, the p-value of the difference is 0.18 which is greater than 0.05. Therefore
we cannot reject the null that the difference of 0.55 (After-Before Advertising)
is equal to zero.
4. Yes, now they do.
The null is H0 :After = 0
The alternative is Ha :After 6= 0
The p-value is 0.03
7/24/2019 0000 Stats Practice
31/52
5. Yes, it is different. The reason is that there are other characteristics, such as
past purchase, that explain a difference in preferences between consumers. By
including such variables in a regression, we have a better chance of learning
the impact of ads on brand preference.
2
7/24/2019 0000 Stats Practice
32/52
Sample Exam Question 2
1. The p-value of the F-statistic is 0.000 (Stata rounded it to zero), which is less
than 0.01 or any reasonable type I error probability. Hence the regression ishighly significant and has significant explanatory power.
2. 60.4% of the variation in M&A Volume is explained by the number of deals (as
seen in the R-square).
3. The slope coefficient is 269.66. Therefore, the marginal increase in M&A vol-
ume attributable to an additional deal is $269.66 million.
4. 1,462+269.66*100=$28,428 million or $28.428 billion.
3
7/24/2019 0000 Stats Practice
33/52
Sample Exam Question 3
1. The regression is implemented as follows:
. regress boxoffice budget usa, robust
Linear regression Number of obs =
F( 2, 941) = 6
Prob > F = 0.
R-squared = 0.
Root MSE = 4
--------------------------------------------------------------------------
| Robust
boxoffice | Coef. Std. Err. t P>|t| [95% Conf. Inter
-------------+------------------------------------------------------------
budget | 1.032913 .1143477 9.03 0.000 .8085068 1.25
usa | 23697.24 4742.385 5.00 0.000 14390.36 3300
_cons | -10530.44 4557.711 -2.31 0.021 -19474.9 -1585
--------------------------------------------------------------------------
2. Holding the location of production constant, an increase in one thousand dollars
in budget increases the movies box office by 1.03 thousand dollars.
3. The predicted box office for a movie with a a budget of 50 million dollars
produced in the US is:
10, 530.4 + 1.03 50, 000 + 23, 697.2 1 = 64, 666.8 thousand.
That is 64.6668 million dollars.
4
7/24/2019 0000 Stats Practice
34/52
4. The predicted box office for a movie with the same budget produced outside
the US is
=10, 530.4 + 1.03 50, 000 + 23, 697.2 0 = 40, 969.6 thousand.
That is 40.9696 million dollars.
5. We can see by the R-squared that 31.5% of the variation in box office is ex-
plained by the production budget and whether a movie is produced in the
US.
5
7/24/2019 0000 Stats Practice
35/52
7/24/2019 0000 Stats Practice
36/52
7/24/2019 0000 Stats Practice
37/52
7/24/2019 0000 Stats Practice
38/52
7/24/2019 0000 Stats Practice
39/52
7/24/2019 0000 Stats Practice
40/52
7/24/2019 0000 Stats Practice
41/52
7/24/2019 0000 Stats Practice
42/52
7/24/2019 0000 Stats Practice
43/52
7/24/2019 0000 Stats Practice
44/52
MGT 403 Statistics PRACTICE PROBLEMS
MGT 403: Probability Modeling and Statistics
STATISTICS: PRACTICE PROBLEMS
This is a PRACTICE PROBLEM SET. You do NOT need to turn it in. It is optional,for students who would like a little more experience solving problems. Solutions will be
posted.
There are 3 QUESTIONS.
Question 1
The Internet portal Yahoo is considering allowing its members to customize their start
pages (homepages). As part of a short survey regarding likes and dislikes, users were asked
about their interests in options such as QuickTime movie clips with daily news and sports
events on their pages. Yahoo hopes that QuickTime will entice users to follow a larger
number of hyperlinks so that it can attract more advertisers.
The newly customized page option with QuickTime links was made available to 100 Inter-
net users who were randomly sampled from the target population. The prior benchmark
for Yahoo has been 6 non-Yahoo content links clicked on average by its members per visit.
It collected data on the 100 users over 1 week to see if the availability of the QuickTime
link options significantly changes the average non-Yahoo links clicked per visit.
Findings: After one week of access to the new customized homepage option with Quick-Time links, Yahoo observes that the average number of non-Yahoo links for each customer
in the sample per visit is 7.8 links and the standard deviation is 9.5 links.
Answer the following:
(i). Draw a graph and test the Null Hypothesis that the customization with QuickTime
does NOT alter the average number of non-Yahoo links clicked. Use the customary
95% Confidence Interval (t critical value is 1.96). State whether you reject the null
hypothesis or not. Also compute the t statistic.
(ii). Draw a new graph and show the 95% Confidence Interval for the estimated meannumber of non-Yahoo links clicked in the sample of 100 customers - be precise as
far as where numerically the boundaries of the Confidence Interval lie? Does the
Confidence Interval include the previous average of 7.8 or not? How does your
answer to this last question relate to your answer to (i)?
7/24/2019 0000 Stats Practice
45/52
MGT 403 Statistics PRACTICE PROBLEMS
Question 2
You have been hired to study executive compensation patterns. Your current project ex-
amines CEO salaries in the 1990s. You are curious whether some of the popular statements
about high CEO salaries during this time period are correct. You have collected data on
CEO salaries in the 90s - the data is in the STATA dataset ceosalary.dta (available on the
class website on Canvas).
A widely read commentator of the time is known to have stated that average CEO compen-
sation in the 90s (your sample period) was 1.2 million. You want to test this hypothesis.
(i). Carry out the appropriatet test in STATA, just as we did in class and you did in
Problem Set 1, Question 2. What is the t value? Is the Null Hypothesis rejected or
not? What is the p value?
(ii). You can also carry out this kind of test manually in STATA. To do this run thecommandsummarize salaryfrom the command line. This will show you the mean
of salary as well as its standard deviation. To proceed assume the distribution
for salary is a Normal distribution. Now compute the standard deviation of the
Test Statistic which is the average over the observations. To do this recall that the
standard deviation for the Test Statistic is:
=
N
where is the estimated standard deviation of the underlying variable, and N is the sizeof the sample. Once you compute this, go out 1.96 in either direction to construct
the Confidence Interval. Then check whether the Null Hypothesis value lies inside the
Confidence Interval or not.
7/24/2019 0000 Stats Practice
46/52
MGT 403 Statistics PRACTICE PROBLEMS
Question 2 - Continued - Regression
Next we will run a regression to explore factors that may influence CEO Salaries. Run
a regression in STATA in which salary is the dependent variable and the independent
variables are: sales - sales of the company in the preceding few years; roe - return on
equity for the company in the preceding few years; indus - a dummy variable set to 1
if the company is in an industrial sector; finance - a dummy variable set to 1 if the
company is in the financial secctor; and utility- a dummy variable set to 1 if the company
is a utility (often regulated). Interpret your results. Then run the model without sales.
What is strange (or hard to intepret about these results compared to the results when
sales is included? How can we interpret this?
7/24/2019 0000 Stats Practice
47/52
MGT 403 Statistics PRACTICE PROBLEMS
Question 3
Consider the VERY small dataset that consists of 3 datapoints:
X1= 1.0 Y1= 200.0
X2= 2.0 Y1= 145.0
X3 = 3.0 Y1= 20.0
Use the formulas:
1=
Ni=1(Xi X)(Yi Y)N
i=1(Xi X)2
0 = Y
1 X
X=N
i=1
Xi Y =N
i=1
Yi
to compute 0 and 1. Then compute residi for each datapoint and finally R2.
After you have done this calculation manually, enter these 3 datapoints into STATA (or
it has already been done for you in the dataset Q3-practice on the Canvas website under
STATS/FEINSTEIN/STATA Datasets. Run the regress command and check your work.
7/24/2019 0000 Stats Practice
48/52
7/24/2019 0000 Stats Practice
49/52
7/24/2019 0000 Stats Practice
50/52
7/24/2019 0000 Stats Practice
51/52
7/24/2019 0000 Stats Practice
52/52