0000 Stats Practice

Embed Size (px)

Citation preview

  • 7/24/2019 0000 Stats Practice

    1/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set - P1-1

    Introduction. You have been hired to study the evolution of executive compensation over time.Specifically, how CEOs salaries vary between different sectors and how they are related to a com-panys sales in the early 1990s. You receive data on a random sample of CEOs which is containedin ceosalary1.dta. Type describe to see the contents of this data set.

    Question 1

    (a) There are two hypotheses concerning CEO compensation in the early 1990s. One is thataverage CEO salaries were at most $1,000,000. Another concerns the default belief thataverage CEO salaries were actually $1,200,000. You want to test these two hypotheses. Notethat the data to test them is contained in the variable salary (which measures CEO salary

    in $1000). Can you reject the null hypotheses, at the 5% level, implied by these tests?To answer this question, write down the following steps for each test:

    1. The null hypothesis

    2. The alternative hypothesis

    3. The formula for the realization of the test statistic

    4. The rejection region: for which values of the test statistic you reject the null hypothesis

    Now use the data to carry out the two tests.

    First, do it manually by typing summarize salaryor using the User Menu, Summarizeand Describe Data, Simple Summary Statistics (summarize) to input this command. Usethe result to calculate the realization (or outcome) for the test statistics.

    Can you reject the null hypotheses? Why or why not?

    Second, check your answer by conducting a test of means in Stata. You can use SimpleTest of Association Test of Means in the Stata User menu.

    What are the p-values for each of these two tests? Based on the p-value can you rejectthe null hypothesis at the 5% level for each test? Explain why.

    (b) Compute the 95% confidence interval for the unknown population mean of CEO salaries, by

    writing down the formula for the confidence interval

    using the results from the command summarize salary to compute the confidence in-terval.

    Does 1200 fall in the confidence interval?

    1

  • 7/24/2019 0000 Stats Practice

    2/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set - P1-2

    The Internet portal Yahoo may allow its members to customize their start pages

    (homepages). As part of a short survey regarding likes and dislikes, users were asked

    about their interests in options such as QuickTime movie clips with daily news and

    sports events on their pages. Yahoo hopes that QuickTime will entice users to follow

    a larger number of hyperlinks so that it can attract more advertisers.

    The newly customized page option was made available to 100 Internet users whowere randomly sampled from the target population. The benchmark for Yahoo is 6

    non-Yahoo content links clicked by all its customers on average prior to the avail-

    ability of the QuickTime option (during any one-week period).

    After one week of access to the customized homepage option, Yahoo observes

    the (average) number of non-Yahoo links for each customer. For the sample of 100

    customers, the average is 7.8 links and the standard deviation is 9.5 links.

    1. Test the two-tailed null hypothesis that the customization with QuickTimedoes not alter the true average (benchmark) number of links at the 5% -level

    (critical value is 1.96). Specify null and alternative hypotheses, compute the

    value for the test statistic and state whether you can reject or not the null

    hypothesis and why.

    2. Construct a 95 percent confidence interval for the true but unknown population

    parameter. Interpret the resulting interval statistically and managerially.

    3. Which of these two procedures is more informative, the test of the null hypoth-esis or the confidence interval? Explain.

    1

  • 7/24/2019 0000 Stats Practice

    3/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set - P1-1

    Introduction. You have been hired to study the evolution of executive compensation over time.Specifically, how CEOs salaries vary between different sectors and how they are related to a com-panys sales in the early 1990s. You receive data on a random sample of CEOs which is containedin ceosalary1.dta. Type describe to see the contents of this data set.

    Question 1

    (a) There are two hypotheses concerning CEO compensation in the early 1990s. One is thataverage CEO salaries were at most $1,000,000. Another concerns the default belief thataverage CEO salaries were actually $1,200,000. You want to test these two hypotheses. Notethat the data to test them is contained in the variable salary (which measures CEO salary

    in $1000). Can you reject the null hypotheses, at the 5% level, implied by these tests?To answer this question, write down the following steps for each test:

    1. The null hypothesis

    2. The alternative hypothesis

    3. The formula for the realization of the test statistic

    4. The rejection region: for which values of the test statistic you reject the null hypothesis

    Now use the data to carry out the two tests.

    First, do it manually by typing summarize salaryor using the User Menu, Summarizeand Describe Data, Simple Summary Statistics (summarize) to input this command. Usethe result to calculate the realization (or outcome) for the test statistics.

    Can you reject the null hypotheses? Why or why not?

    Second, check your answer by conducting a test of means in Stata. You can use SimpleTest of Association Test of Means in the Stata User menu.

    What are the p-values for each of these two tests? Based on the p-value can you rejectthe null hypothesis at the 5% level for each test? Explain why.

    (b) Compute the 95% confidence interval for the unknown population mean of CEO salaries, by

    writing down the formula for the confidence interval

    using the results from the command summarize salary to compute the confidence in-terval.

    Does 1200 fall in the confidence interval?

    1

  • 7/24/2019 0000 Stats Practice

    4/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set - P1-1

    Introduction. You have been hired to study the evolution of executive compensation over time.Specifically, how CEOs salaries vary between different sectors and how they are related to a com-panys sales in the early 1990s. You receive data on a random sample of CEOs which is containedin ceosalary1.dta. Type describe to see the contents of this data set.

    Question 1

    (a) There are two hypotheses concerning CEO compensation in the early 1990s. One is thataverage CEO salaries were at most $1,000,000. Another concerns the default belief thataverage CEO salaries were actually $1,200,000. You want to test these two hypotheses. Notethat the data to test them is contained in the variable salary (which measures CEO salary

    in $1000). Can you reject the null hypotheses, at the 5% level, implied by these tests?To answer this question, write down the following steps for each test:

    1. The null hypothesis

    2. The alternative hypothesis

    3. The formula for the realization of the test statistic

    4. The rejection region: for which values of the test statistic you reject the null hypothesis

    Now use the data to carry out the two tests.

    First, do it manually by typing summarize salaryor using the User Menu, Summarizeand Describe Data, Simple Summary Statistics (summarize) to input this command. Usethe result to calculate the realization (or outcome) for the test statistics.

    Can you reject the null hypotheses? Why or why not?

    Second, check your answer by conducting a test of means in Stata. You can use SimpleTest of Association Test of Means in the Stata User menu.

    What are the p-values for each of these two tests? Based on the p-value can you rejectthe null hypothesis at the 5% level for each test? Explain why.

    (b) Compute the 95% confidence interval for the unknown population mean of CEO salaries, by

    writing down the formula for the confidence interval

    using the results from the command summarize salary to compute the confidence in-terval.

    Does 1200 fall in the confidence interval?

    1

  • 7/24/2019 0000 Stats Practice

    5/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set - P1-2

    The Internet portal Yahoo may allow its members to customize their start pages

    (homepages). As part of a short survey regarding likes and dislikes, users were asked

    about their interests in options such as QuickTime movie clips with daily news and

    sports events on their pages. Yahoo hopes that QuickTime will entice users to follow

    a larger number of hyperlinks so that it can attract more advertisers.

    The newly customized page option was made available to 100 Internet users whowere randomly sampled from the target population. The benchmark for Yahoo is 6

    non-Yahoo content links clicked by all its customers on average prior to the avail-

    ability of the QuickTime option (during any one-week period).

    After one week of access to the customized homepage option, Yahoo observes

    the (average) number of non-Yahoo links for each customer. For the sample of 100

    customers, the average is 7.8 links and the standard deviation is 9.5 links.

    1. Test the two-tailed null hypothesis that the customization with QuickTimedoes not alter the true average (benchmark) number of links at the 5% -level

    (critical value is 1.96). Specify null and alternative hypotheses, compute the

    value for the test statistic and state whether you can reject or not the null

    hypothesis and why.

    2. Construct a 95 percent confidence interval for the true but unknown population

    parameter. Interpret the resulting interval statistically and managerially.

    3. Which of these two procedures is more informative, the test of the null hypoth-esis or the confidence interval? Explain.

    1

  • 7/24/2019 0000 Stats Practice

    6/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set P1-1 Answers

    Question 1

    (a) To test the research hypothesis that the mean of salary is at most (less than or equal to) 1000,we have

    1. The null hypothesis: H0 : >1000

    2. The alternative hypothesis: Ha : 10003. The formula for the realization of the test statistic: t= x1000

    /N

    4. The rejection region: reject ift < 1.65.To test if the mean of salary is equal to 1200:

    1. The null hypothesis: H0 : = 1200

    2. The alternative hypothesis: H1 : 6= 12003. The formula for the realization of the test statistic: t= x1200

    /N

    4. The rejection region: reject if |t| > 1.96 (this is the same as saying that the rejectionregion is t < 1.96or t >1.96).

    For the manual calculation of the realization of the test statistic we need the mean ofsalaries in the sample, the standard deviation, and the number of observations. We getall of these from Statas summarize command.

    . summarize salary

    Variable | Obs Mean Std. Dev. Min Max

    -------------+--------------------------------------------------------

    salary | 206 1141.063 611.193 223 4143

    Hence the realization or value for the test statistic for the first test equals

    t=1141.063 1000

    611.193/

    206= 3.313.

    The realization or value for the test statistic for the second test equals

    t=1141.063

    1200

    611.193/206 = 1.384.

    1

  • 7/24/2019 0000 Stats Practice

    7/52

    For the first test, since the value for t is 3.3 which is not smaller than -1.65 we cannotreject the null hypothesis in favor of the alternative that that the mean of salaries in thepopulation of CEOs is at most $1,000,000 at the 5% level. For the second test, since thevalue for the test statistictof -1.384 is not in the rejection region oft < 1.96ort >1.96we also cannot reject the null hypothesis that the mean of salaries in the population ofCEOs is equal to $1,200,000 at the 5% level.

    We get the same results using the ttestcommand in Stata. Note that when Stata setsas a default 95% confidence level", it is just asking you if you would like to see the 95%confidence interval for the unknown population mean of CEO salaries together with thevalue fort.

    . ttest salary == 1000

    One-sample t test

    ------------------------------------------------------------------------------

    Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

    ---------+--------------------------------------------------------------------salary | 206 1141.063 42.58383 611.193 1057.105 1225.022

    ------------------------------------------------------------------------------

    mean = mean(salary) t = 3.3126

    Ho: mean = 1000 degrees of freedom = 205

    Ha: mean < 1000 Ha: mean != 1000 Ha: mean > 1000

    Pr(T < t) = 0.9995 Pr(|T| > |t|) = 0.0011 Pr(T > t) = 0.0005

    . ttest salary == 1200

    One-sample t test------------------------------------------------------------------------------

    Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

    ---------+--------------------------------------------------------------------

    salary | 206 1141.063 42.58383 611.193 1057.105 1225.022

    ------------------------------------------------------------------------------

    mean = mean(salary) t = -1.3840

    Ho: mean = 1200 degrees of freedom = 205

    Ha: mean < 1200 Ha: mean != 1200 Ha: mean > 1200

    Pr(T < t) = 0.0839 Pr(|T| > |t|) = 0.1679 Pr(T > t) = 0.9161

    The p-value for the first test is 0.9995. Since the p-value is greater than 5% we cannotreject the null hypothesis in favor of the alternative that that average salaries in thepopulation of CEOs are, at most, $1,000,000 at the 5% level. The p-value for the secondtest is 0.1679. Since the p-value is greater than 5% we also cannot reject the null

    2

  • 7/24/2019 0000 Stats Practice

    8/52

    hypothesis that average salaries in the population of CEOs are $1,200,000, at the 5%level.

    (b) The formula for the 95% confidence interval for the mean of salary in our CEO population

    is

    x 1.96 N

    ,x + 1.96

    N

    .

    With the results from summarize salaryabove we get

    1141.063 1.96 611.193

    206, 1141.063 + 1.96

    611.193206

    = [1057.5, 1224.5].

    Thus we are 95% confident that the true mean of CEO salaries is between[1057.5, 1224.5].Note that the interval contains 1200 and it is almost equal to the interval given to usin the results for the command ttest salary == 1200. We would have obtained thesame values if we had done no rounding.

    3

  • 7/24/2019 0000 Stats Practice

    9/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set - P1-2- Answers

    The Internet portal Yahoo may allow its members to customize their start pages

    (homepages). As part of a short survey regarding likes and dislikes, users were asked

    about their interests in options such as QuickTime movie clips with daily news and

    sports events on their pages. Yahoo hopes that QuickTime will entice users to follow

    a larger number of hyperlinks so that it can attract more advertisers.

    The newly customized page option was made available to 100 Internet users whowere randomly sampled from the target population. The benchmark for Yahoo is 6

    non-Yahoo content links clicked by all its customers on average prior to the avail-

    ability of the QuickTime option (during any one-week period).

    After one week of access to the customized homepage option, Yahoo observes

    the (average) number of non-Yahoo links for each customer. For the sample of 100

    customers, the average is 7.8 links and the standard deviation is 9.5 links.

    1. Null hypothesis: H0: = 6

    Alternative hypothesis: H0:6= 6 Computing the value for the test statistic

    t= x2/N

    = 7.869.52/100

    = 1.89

    Given that t = 1.89 is not in the rejection region for a two-sided test at

    the 5% level (t > 1.96 or t

  • 7/24/2019 0000 Stats Practice

    10/52

    2. The 95% confidence interval for the true but unknown mean of the number of

    links is:

    7.81.96 9.510

    = 7.81.9 = (5.9, 9.7)

    Therefore, we can be quite sure or confident (95 percent) that the true but

    unknown population mean is between 5.9 and 9.7 links.

    3. None is more informative than the other it depends on the type of question

    one is asking. The confidence interval gives us a range for which we are 95%

    confident that the population mean falls into. It is good when we want to get

    a sense of what the population mean could be. A hypothesis test, in contrast,

    allows us to answer a different question: whether a specific hypothesis about

    the population it true (supported by the data) or not.

    2

  • 7/24/2019 0000 Stats Practice

    11/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set P2-1

    Introduction You have been asked to analyze the relationship between research and development(R&D) spending and sales of firms in the chemical and telecommunications industries. You receivedata on a random sample of firms contained in the data set rd.dta. Type describe to see thecontents of the data set. The binary (dummy) variable chem is equal to one if the firm is in thechemical industry and equal to zero if the firm is in the telecommunications industry.

    Question 1

    1. Run the regression of sales as a function of R&D.

    2. Does the estimated coefficient suggest that sales and R&D spending are positively or nega-tively correlated?

    3. By how much do sales increase or decrease on average when R&D spending increases by onemillion dollars?

    4. Is this effect significantly different from zero at the 5% level and why?

    5. What is the interpretation of the estimate for the intercept parameter 0?

    6. How much does the variation in R&D spending explain the variation in sales?

    Question 2

    After analyzing the relationship between prices you are asked how the returns of the DJIA andGE are related: when one increases, does the other decrease or vice-versa? Or when one increases

    does the other also increase? To investigate this question, you first have to generate the returnsusing the User menu command Manipulate Variables and Obs Generate New Variable and theformula

    returnt = 100pricet pricet1

    pricet1.

    Since each observation in our data set represents one date and the observations are chronologicallysorted, we can implement this formula in Stata by 100 * (close_DJIA - close_DJIA[_n-1]) /close_DJIA[_n-1], for example, for DJIA. Here [_n-1]means that we are taking the observationfrom the previous period. Do the same for GE returns.

    1. Plot the relationship between the return for the GE stock and the return for DJIA. Is therelationship increasing or decreasing?

    2. We can define the beta of a given stock asa = cov(returna,returnp)

    var(returnp) wherereturnaand returnp

    are the returns of the stock in question and the stock market index, respectively, and the riskfree rate is constant over time. Given the previous plot, should the beta of the GE stock bepositive or negative? Explain why.

    1

  • 7/24/2019 0000 Stats Practice

    12/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem set P2-1 Answers

    Question 1

    1. The Stata command for the regression is regress sales rd, robust, which yields the fol-lowing output:

    Linear regression Number of obs = 61

    F( 1, 59) = 42.28

    Prob > F = 0.0000

    R-squared = 0.7971

    Root MSE = 3542

    ------------------------------------------------------------------------------

    | Robust

    sales | Coef. Std. Err. t P>|t| [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    rd | 18.00799 2.769413 6.50 0.000 12.46641 23.54957

    _cons | 1040.966 397.866 2.62 0.011 244.8379 1837.094

    ------------------------------------------------------------------------------

    2. The estimated parameter 1 equals about 18, a positive number, which suggests that R&Dspending and sales are positively correlated

    3. When R&D spending increases by one million dollars, predicted sales increase by 18 milliondollars.

    4. This effect is significantly different from zero at the 5% level since the value of the t-statisticreported in the regression output is equal to 6.5, which is in the rejection region of t < 1.96or t > 1.96. Accordingly, the p-value is lower than 5% at nearly zero.

    5. The estimate for the intercept, 0, is 1041, which implies that predicted sales are 1041 milliondollars when R&D spending equals zero.

    6. The R-squared value tells us that variation in R&D spending explains about 80% of thevariation in sales. The fit is pretty high.

    1

  • 7/24/2019 0000 Stats Practice

    13/52

    Question 2

    First generate the returns for DJIA and GE using the following commands:

    generate return_DJIA = 100 * (close_DJIA - close_DJIA[_n-1]) / close_DJIA[_n-1]

    generate return_GE = 100 * (close_GE - close_GE[_n-1]) / close_GE[_n-1]

    1. We plot the returns of GE against the returns of DJIA using the Stata command twoway(scatter return_GE return_DJIA). See Figure 3.

    Figure 1: Question I.4

    2. From the graph we see that the relationship between GE returns and DJIA returns is posi-tive: higher DJIA returns are associated with higher GE returns. Thus the covariance in thissample between GE and DJIA returns is positive. Thus the numerator for beta, measuringthe covariance between GE returns and DJIA returns is positive. A variance is never negative(recall that a variance involves sums of squared terms, and squared terms are always nonneg-ative), so the denominator for beta is positive. This implies that the beta of the GE stock inthis sample is positive.

    2

  • 7/24/2019 0000 Stats Practice

    14/52

    Yale SOM

    MGT 403: StatisticsPractice Problem Set P2-2

    A consulting firm wants to get a better understanding of its cost structure based

    on data on costs incurred for projects in the past so as to improve its bidding process

    for projects. Experience suggests that there are two main components of costs in a

    project: (1) variable costs that are directly related to the size of the project, which

    is reasonably proxied by the number of person-hours for the project, and (2) fixed

    costs, which are incurred irrespective of the size of the project.

    A regression of the total costs (in $) against the number of person-hours based

    on data on 42 projects gave the following results:

    Linear regression Number of obs = 42

    F( 1, 40) = 157.8

    Prob > F = 0.000

    R-squared = 0.87Root MSE = 2979

    ------------------------------------------------------------------------------

    | Robust

    totalcost | Coef. Std. Err. t P>|t| [95% Conf. Interval

    -------------+----------------------------------------------------------------

    Person-hours | 372.15 29.629 12.6 0.000 311.0 433.3

    _cons | 3209.76 1387.962 2.31 0.030 345.1 6074.4

    ------------------------------------------------------------------------------

    1

  • 7/24/2019 0000 Stats Practice

    15/52

    1. Test the null hypothesis that the slope parameter is zero. State the hypotheses

    in appropriate symbols, state the p-value, and interpret the result.

    2. Define the 95 percent confidence interval for the true slope parameter and

    interpret this interval.

    3. Assuming that the equation is a reasonable approximation of the nature of

    project costs, interpret the slope coefficient precisely in a manner understand-

    able by a layperson.

    4. What is the best estimate of fixed costs?

    5. What is the predicted total cost for a project that will employ 1,000 person-hours?

    2

  • 7/24/2019 0000 Stats Practice

    16/52

    Yale SOM

    MGT 403: StatisticsPractice Problem Set P2-2-Answers

    1. H0: 1= 0

    Ha: 1 6= 0

    The p-value for the test is 0.000.

    Since the p-value is less than 5% we can reject the null hypothesis at the

    5% level.

    We can therefore state that there is a relationship between total costs

    of a project and the number of person-hours required to complete the

    project in our sample and that this relationship is very significant (it

    is not likely that, given our data, the relationship does not exist)

    2. The 95% confidence interval for the slope parameter is (311.0, 433.3). We

    can state confidently (95% confidence level) that the predicted total cost for

    a project for each additional person-hour could be anywhere between $311.0

    and $433.3.

    3. If the number of person-hours for a project increases by an hour, the total cost

    of the project is expected to increase by $372.

    4. The best estimate of fixed costs is given by the intercept: the cost of the

    project when the number of person-hours is zero. This cost is is $3,210.

    5. The predicted total cost for a project that will employ 1,000 person-hours is

    3, 210 + 3721000 = $375, 210

    1

  • 7/24/2019 0000 Stats Practice

    17/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set P3-1

    Jim Douglas, the manager of Colonial Furniture has been reviewing weekly ad-

    vertising expenditures. All of his advertising thus far has been focused on radio.

    He is interested in learning how the effect of advertising might differ across different

    media. He recorded the following variables:

    Sales: Number of customers in each week (individuals visiting an outlet)

    # Ads: The number of ads in the week

    Medium (1=radio, 2=television).

    1. Jim recalled from a class he had taken, that regression analysis could be used to

    estimate the effects of the different media. He proposed the following regression

    model:

    Sales=0+1Ads+2Medium

    and he seeks your advice. Would you propose an alternative model? If so, ex-

    plain the problem with Jims model. Write out your proposed model explicitly

    in the form of an equation.

    2. Jim then created one indicator variable: Radio (1 if radio, 0 otherwise, that is,

    television). He then ran a regression of Sales against #Ads and Radio. The

    results are reported below:

    1

  • 7/24/2019 0000 Stats Practice

    18/52

    Linear regression Number of obs = 52

    F( 2, 49) = 14.91

    Prob > F = 0.0000

    R-squared = 0.69

    Root MSE = 44.87

    --------------------------------------------------------------------------

    | Robust

    sales | Coef. Std. Err. t P>|t| [95% Conf. Inter

    -------------+------------------------------------------------------------Ads | 25 3.98 6.34 0.00 17.23 33.2

    Radio | -47 16.44 -2.83 0.01 -79.64 -13.5

    _cons | 283 17.46 16.19 0.00 247.50 317.6

    --------------------------------------------------------------------------

    Interpret the effect for Radio precisely.

    3. What is the effect of a 1 unit increase in the number of ads? State this result

    precisely.

    4. What is the 95% prediction interval for sales when the company airs 50 ads on

    television?

    2

  • 7/24/2019 0000 Stats Practice

    19/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set P3-1-Answers

    1. Medium is a categorical variable (with values 1 or 2). So I would not include

    it directly. I would create a dummy variable for the Medium category. For

    example, Radio (1 if radio; 0 otherwise, that is, TV)

    I would then estimate the model (treating TV as the base category):

    Sales=0+1Ads+2Radio

    2. For any given level of advertising (number of ads), radio ads are expected to

    produce 47 fewer customers relative to ads shown on TV.

    3. An increase in the number of ads per week by 1 is expected to increase the

    number of customers per week by 25, holding the medium through which the

    the ads are transmitted constant.

    4. To answer this question, we first need to compute the predicted sales value

    when airing 50 ads on television.

    Sales=283+25*50+(-47)*0=$1,533

    Then the 95% prediction interval for sales is:

    1, 533 1.96 44.87 = (1, 445.1; 1620.9)

    1

  • 7/24/2019 0000 Stats Practice

    20/52

    Yale SOM

    MGT 403: Statistics

    Practice Problem Set P3-2

    Data Set and Questions

    You have been hired to investigate the relationship between individuals physical

    attractiveness and their wage. You receive the data set beauty3.dta, which contains

    data on the wage and other characteristics, such as education and years of experience,

    for a random sample of individuals.

    The data set also contains the variablelooks, which measures a given individuals

    subjective physical attractiveness. The variable looks encompasses five categories,

    where 5 denotes the highest level of attractiveness and 1 denotes the lowest level of

    attractiveness. The binary zero-one variablebelavg is derived from looks: belavg

    is equal to 1 if looksequals 1 or 2 and 0 otherwise.

    1. By how much more/less do individuals with below average looks earn per hour,

    on average, relative to individuals with average/above average looks? Run the

    appropriate regression and answer the question.

    2. Is the above estimate significant at the 5% level?

    3. Does experience attenuate the looks" advantage? Run the appropriate regres-

    sion and answer the question.

    4. How much does the variation in looks and experience explain the variation in

    wages?

    1

  • 7/24/2019 0000 Stats Practice

    21/52

    Yale SOM

    MGT 403: Statistics

    Statistics Practice PS 3-2 Answers

    1. Regression: The Stata command is regress wage belavg, robust, which

    yields the following output:

    Linear regression Number of obs =

    F( 1, 1257) = 1

    Prob > F = 0.R-squared = 0.

    Root MSE = 4.

    --------------------------------------------------------------------------

    | Robust

    wage | Coef. Std. Err. t P>|t| [95% Conf. Inter

    -------------+------------------------------------------------------------

    belavg | -1.118143 .3120741 -3.58 0.000 -1.730386 -.505

    _cons | 6.387627 .128631 49.66 0.000 6.135272 6.63

    --------------------------------------------------------------------------

    Individuals with below-average looks earn1.12less per hour.

    2. The coefficient onbelavgof1.12is significant at the 5% level since the value

    for the t-statistic is lower than that of the critical value of1.96, which implies

    that one can reject the null hypothesis at the 5% level. Recall that the rejection

    region for a large sample two-sided test that each of the regression coefficientsis equal to zero, at the 5% level, is t < 1.96or t > 1.96.

    1

  • 7/24/2019 0000 Stats Practice

    22/52

    3. regress wage belavg exper, robust

    Linear regression Number of obs =

    F( 2, 1256) = 5

    Prob > F = 0.

    R-squared = 0.

    Root MSE = 4.

    --------------------------------------------------------------------------

    | Robust

    wage | Coef. Std. Err. t P>|t| [95% Conf. Inter

    -------------+------------------------------------------------------------belavg | -1.270895 .29984 -4.24 0.000 -1.859137 -.682

    exper | .0966187 .0093563 10.33 0.000 .078263 .114

    _cons | 4.646653 .1693257 27.44 0.000 4.314461 4.97

    --------------------------------------------------------------------------

    It seems that experience does not attenuate the advantage of looks. Holding

    experience constant, those with below-average looks earn1.27dollars per hour

    than those with above-average looks. Further, this coefficient is statistically

    significant at the 5% level as the p-value of 0.000 is less than 5%.

    4. The variation in looks and experience only explain 8% of the variation in wages.

    2

  • 7/24/2019 0000 Stats Practice

    23/52

    Yale SOM

    MGT 403: Statistics

    Sample Exam Questions

    Administrative Details

    This final is open book. You can consult your class notes, problem set solutions

    and other materials. But you cannot discuss the exam with anyone. This

    constitutes a violation of the honor code. Show all your work, including all the Stata

    output relevant to answer the questions.1

    Sample Exam Question 1

    The Nielsen Media organization conducts tests of commercials in its laboratories.

    The firm regularly invites members of identified target markets to its premises. At-

    tendees are shown one or more television programs in which commercials are embed-

    ded, and asked questions about products and other aspects both before and after

    they view programs.

    Each study is typically sponsored by a single company such as Procter & Gamble(P&G). On November 29, 2002, Nielsen Media Research did a study on a brand

    that was not performing well in the market. P&G was interested in whether new

    commercials it proposes to air might change target members preferences for the

    brand.

    A total of 32 consumers participated in the study. They first provided preference

    and perception data on multiple brands. Then they watched two TV programs

    with a standard number of commercials. Thereafter they provided preference and

    perception data on some of the same brands and other brands. (Researchers also

    1Though all the sample questions already show the Stata output, you will have to create your

    own Stata output when answering the questions in the exam.

    1

  • 7/24/2019 0000 Stats Practice

    24/52

    obtained brain scanner analyses based on principles of neuromarketing but those

    data are ignored here.)

    The data of interest pertain to brand X on which consumer preferences were

    obtained both before and after the TV programs (with relevant commercials on brand

    X as part of the TV program). The rating was on a 5 point scale, where 5=great

    and 1=lousy. The sample data about the preferences of the brand are summarized

    below:

    1. You have taken a regression course and want to use this promising analytical

    technique". You create a dependent variable with 32 before" and 32 after"

    preference scores for brand X. The regression includes only one dummy (in-

    dicator) variable, AFTER, to distinguish the two categories of observations:

    AFTER = 1 if after, 0 if before.

    2

  • 7/24/2019 0000 Stats Practice

    25/52

    Linear regression Number of obs = 64

    F( 1, 62) = 1.89

    Prob > F = 0.17

    R-squared = 0.18

    Root MSE = 1.53

    --------------------------------------------------------------------------

    Robust

    ratings | Coef. Std. Err. t P>|t| [95% Conf. Inter

    -----------+--------------------------------------------------------------

    After | 0.551 0.401 1.375 0.175 -0.251 1.3_cons | 2.068 0.283 7.293 0.000 1.500 2.6

    --------------------------------------------------------------------------

    What is the average preference in the sample for brand X beforethe TV pro-

    grams?

    2. What is the average preference in the sample for brand X afterthe TV pro-

    grams?

    3. Based on the above regression, do the ads for brand X have a statistically

    significant effect on the average preference for the brand in the target market?

    Be precise and show relevant numbers.

    4. Your exposure to regression analysis suggests that it may be useful to include

    other variables so as to improve the understanding of effects of interest. So

    you decide to add two independent variables: PPur = 1, if the consumer has

    purchased the product in the past, = 0 if not; Male = 1 if male, = 0 if female.

    3

  • 7/24/2019 0000 Stats Practice

    26/52

    Linear regression Number of obs = 64

    F( 1, 60) = 35.5

    Prob > F = 0.00

    R-squared = 0.81

    Root MSE = 0.92

    --------------------------------------------------------------------------

    Robust

    ratings | Coef. Std. Err. t P>|t| [95% Conf. Inter

    -----------+--------------------------------------------------------------

    After | 0.521 0.240 2.170 0.030 0.051 0.9PPur | 2.452 0.254 9.649 0.000 1.942 2.9

    Male | -0.047 0.201 -0.234 0.815 -0.441 0.3

    _cons | 0.742 0.237 3.126 0.002 0.266 1.2

    --------------------------------------------------------------------------

    Based on the above analysis, do the ads have a statistically significant effect

    on the average preference for the brand in the target market? Be precise and

    show relevant numbers.

    5. Is your conclusion in (4) different from your conclusion in (3)? Explain the

    difference, if any. Relate the idea of controlling for other variables (that is,

    adding more relevant variables to the model or holding constant these other

    relevant variables) to the difference between the test you did in (3) and in (4).

    4

  • 7/24/2019 0000 Stats Practice

    27/52

    Sample Exam Question 2

    Investment Bankers earn large fees for making arrangements and giving advice re-

    lating to mergers and acquisitions (M&A) when one firm joins with or purchasesanother.

    Consider the following regression on the total dollar amount of M&A activity

    against the number of deals of the top 15 major firms in this industry.

    Dependent Variable: Total M&A Volume (in millions of dollars) for a firm

    Independent Variable: Number of Deals for the corresponding firm

    Below is the regression output:

    Linear regression Number of obs = 36

    F( 1, 34) = 19.9

    Prob > F = 0.000

    R-squared = 0.604

    Root MSE = 12286

    ------------------------------------------------------------------------------

    Robust

    M&AVolume | Coef. Std. Err. t P>|t| [95% Conf. Interval]

    -----------+----------------------------------------------------------------

    Deals | 269.660 60.512 4.456 0.000 138.932 400.389

    _cons | 1461.941 5737.309 0.254 0.802 -10932.8 13856.640

    ------------------------------------------------------------------------------

    For all questions, assume that the linearity assumption holds.

    1. Does the regression equation have significant explanatory power? Be precise

    (use a specific result to explain).

    5

  • 7/24/2019 0000 Stats Practice

    28/52

    2. How much of the variation in M&A volume across these firms is explained by

    the number of deals that each firm handles?

    3. What is the marginal increase in M&A volume attributable to an additionaldeal that a firm makes (or what is the predicted difference in M&A volume

    between firm B and firm A, if B has one more deal than A)? Be precise and

    state the units (e.g. billions? millions?).

    4. Your firm wants to be among the top players in this industry next year with

    100 deals. Assuming that the estimated relationship applies to next year, what

    is your best estimate of M&A volume for your firm if it achieved its goal next

    year? Again, be sure to state the units (e.g. millions or billions) Note: The

    question asks for your best estimate of the predicted value; so whether or notsomething is statistically significant is irrelevant to this question.

    6

  • 7/24/2019 0000 Stats Practice

    29/52

    Sample Exam Question 3

    The movie-v2.dtadataset contains a sample of movies shown on U.S movie screens

    between 1985 and 2001. It contains the title of the movie, the year of its premiere,the number of screens per week, the movies total U.S box office (revenue from ticket

    sales), a binary variable indicating whether the movie was produced in the U.S or

    not, and the movies production budget, among other information.

    Type describe to see the contents of the data set.

    1. Run the regression of box office onto budget and whether the movie was pro-

    duced in the U.S.

    2. There is a claim that movies with larger budgets generate bigger box offi

    cerevenues. Holding the location of production constant, how does an increase

    in one thousand dollars in budget change the movies box office?

    3. What is the predicted box office for a movie with a 50 million dollar budget

    produced in the US?

    4. What is the predicted box office for a movie with the same budget produced

    outside the US?

    5. By how much is the variation in box offi

    ce explained by the production budgetand whether a movie is produced in the US?

    7

  • 7/24/2019 0000 Stats Practice

    30/52

    Yale SOM

    MGT 403: Statistics

    Sample Exam Questions-Answers

    Administrative Details

    This final is open book. You can consult your class notes, problem set solutions

    and other materials. But you cannot discuss the exam with anyone. This

    constitutes a violation of the honor code. Show all your work, including all the Stata

    output relevant to answer the questions.1

    Sample Exam Question 1

    1. The average preference for brand X before TV programs=2.07 (intercept)

    2. The average preference for brand X after TV programs=2.07+0.55=2.62

    3. No, the p-value of the difference is 0.18 which is greater than 0.05. Therefore

    we cannot reject the null that the difference of 0.55 (After-Before Advertising)

    is equal to zero.

    4. Yes, now they do.

    The null is H0 :After = 0

    The alternative is Ha :After 6= 0

    The p-value is 0.03

  • 7/24/2019 0000 Stats Practice

    31/52

    5. Yes, it is different. The reason is that there are other characteristics, such as

    past purchase, that explain a difference in preferences between consumers. By

    including such variables in a regression, we have a better chance of learning

    the impact of ads on brand preference.

    2

  • 7/24/2019 0000 Stats Practice

    32/52

    Sample Exam Question 2

    1. The p-value of the F-statistic is 0.000 (Stata rounded it to zero), which is less

    than 0.01 or any reasonable type I error probability. Hence the regression ishighly significant and has significant explanatory power.

    2. 60.4% of the variation in M&A Volume is explained by the number of deals (as

    seen in the R-square).

    3. The slope coefficient is 269.66. Therefore, the marginal increase in M&A vol-

    ume attributable to an additional deal is $269.66 million.

    4. 1,462+269.66*100=$28,428 million or $28.428 billion.

    3

  • 7/24/2019 0000 Stats Practice

    33/52

    Sample Exam Question 3

    1. The regression is implemented as follows:

    . regress boxoffice budget usa, robust

    Linear regression Number of obs =

    F( 2, 941) = 6

    Prob > F = 0.

    R-squared = 0.

    Root MSE = 4

    --------------------------------------------------------------------------

    | Robust

    boxoffice | Coef. Std. Err. t P>|t| [95% Conf. Inter

    -------------+------------------------------------------------------------

    budget | 1.032913 .1143477 9.03 0.000 .8085068 1.25

    usa | 23697.24 4742.385 5.00 0.000 14390.36 3300

    _cons | -10530.44 4557.711 -2.31 0.021 -19474.9 -1585

    --------------------------------------------------------------------------

    2. Holding the location of production constant, an increase in one thousand dollars

    in budget increases the movies box office by 1.03 thousand dollars.

    3. The predicted box office for a movie with a a budget of 50 million dollars

    produced in the US is:

    10, 530.4 + 1.03 50, 000 + 23, 697.2 1 = 64, 666.8 thousand.

    That is 64.6668 million dollars.

    4

  • 7/24/2019 0000 Stats Practice

    34/52

    4. The predicted box office for a movie with the same budget produced outside

    the US is

    =10, 530.4 + 1.03 50, 000 + 23, 697.2 0 = 40, 969.6 thousand.

    That is 40.9696 million dollars.

    5. We can see by the R-squared that 31.5% of the variation in box office is ex-

    plained by the production budget and whether a movie is produced in the

    US.

    5

  • 7/24/2019 0000 Stats Practice

    35/52

  • 7/24/2019 0000 Stats Practice

    36/52

  • 7/24/2019 0000 Stats Practice

    37/52

  • 7/24/2019 0000 Stats Practice

    38/52

  • 7/24/2019 0000 Stats Practice

    39/52

  • 7/24/2019 0000 Stats Practice

    40/52

  • 7/24/2019 0000 Stats Practice

    41/52

  • 7/24/2019 0000 Stats Practice

    42/52

  • 7/24/2019 0000 Stats Practice

    43/52

  • 7/24/2019 0000 Stats Practice

    44/52

    MGT 403 Statistics PRACTICE PROBLEMS

    MGT 403: Probability Modeling and Statistics

    STATISTICS: PRACTICE PROBLEMS

    This is a PRACTICE PROBLEM SET. You do NOT need to turn it in. It is optional,for students who would like a little more experience solving problems. Solutions will be

    posted.

    There are 3 QUESTIONS.

    Question 1

    The Internet portal Yahoo is considering allowing its members to customize their start

    pages (homepages). As part of a short survey regarding likes and dislikes, users were asked

    about their interests in options such as QuickTime movie clips with daily news and sports

    events on their pages. Yahoo hopes that QuickTime will entice users to follow a larger

    number of hyperlinks so that it can attract more advertisers.

    The newly customized page option with QuickTime links was made available to 100 Inter-

    net users who were randomly sampled from the target population. The prior benchmark

    for Yahoo has been 6 non-Yahoo content links clicked on average by its members per visit.

    It collected data on the 100 users over 1 week to see if the availability of the QuickTime

    link options significantly changes the average non-Yahoo links clicked per visit.

    Findings: After one week of access to the new customized homepage option with Quick-Time links, Yahoo observes that the average number of non-Yahoo links for each customer

    in the sample per visit is 7.8 links and the standard deviation is 9.5 links.

    Answer the following:

    (i). Draw a graph and test the Null Hypothesis that the customization with QuickTime

    does NOT alter the average number of non-Yahoo links clicked. Use the customary

    95% Confidence Interval (t critical value is 1.96). State whether you reject the null

    hypothesis or not. Also compute the t statistic.

    (ii). Draw a new graph and show the 95% Confidence Interval for the estimated meannumber of non-Yahoo links clicked in the sample of 100 customers - be precise as

    far as where numerically the boundaries of the Confidence Interval lie? Does the

    Confidence Interval include the previous average of 7.8 or not? How does your

    answer to this last question relate to your answer to (i)?

  • 7/24/2019 0000 Stats Practice

    45/52

    MGT 403 Statistics PRACTICE PROBLEMS

    Question 2

    You have been hired to study executive compensation patterns. Your current project ex-

    amines CEO salaries in the 1990s. You are curious whether some of the popular statements

    about high CEO salaries during this time period are correct. You have collected data on

    CEO salaries in the 90s - the data is in the STATA dataset ceosalary.dta (available on the

    class website on Canvas).

    A widely read commentator of the time is known to have stated that average CEO compen-

    sation in the 90s (your sample period) was 1.2 million. You want to test this hypothesis.

    (i). Carry out the appropriatet test in STATA, just as we did in class and you did in

    Problem Set 1, Question 2. What is the t value? Is the Null Hypothesis rejected or

    not? What is the p value?

    (ii). You can also carry out this kind of test manually in STATA. To do this run thecommandsummarize salaryfrom the command line. This will show you the mean

    of salary as well as its standard deviation. To proceed assume the distribution

    for salary is a Normal distribution. Now compute the standard deviation of the

    Test Statistic which is the average over the observations. To do this recall that the

    standard deviation for the Test Statistic is:

    =

    N

    where is the estimated standard deviation of the underlying variable, and N is the sizeof the sample. Once you compute this, go out 1.96 in either direction to construct

    the Confidence Interval. Then check whether the Null Hypothesis value lies inside the

    Confidence Interval or not.

  • 7/24/2019 0000 Stats Practice

    46/52

    MGT 403 Statistics PRACTICE PROBLEMS

    Question 2 - Continued - Regression

    Next we will run a regression to explore factors that may influence CEO Salaries. Run

    a regression in STATA in which salary is the dependent variable and the independent

    variables are: sales - sales of the company in the preceding few years; roe - return on

    equity for the company in the preceding few years; indus - a dummy variable set to 1

    if the company is in an industrial sector; finance - a dummy variable set to 1 if the

    company is in the financial secctor; and utility- a dummy variable set to 1 if the company

    is a utility (often regulated). Interpret your results. Then run the model without sales.

    What is strange (or hard to intepret about these results compared to the results when

    sales is included? How can we interpret this?

  • 7/24/2019 0000 Stats Practice

    47/52

    MGT 403 Statistics PRACTICE PROBLEMS

    Question 3

    Consider the VERY small dataset that consists of 3 datapoints:

    X1= 1.0 Y1= 200.0

    X2= 2.0 Y1= 145.0

    X3 = 3.0 Y1= 20.0

    Use the formulas:

    1=

    Ni=1(Xi X)(Yi Y)N

    i=1(Xi X)2

    0 = Y

    1 X

    X=N

    i=1

    Xi Y =N

    i=1

    Yi

    to compute 0 and 1. Then compute residi for each datapoint and finally R2.

    After you have done this calculation manually, enter these 3 datapoints into STATA (or

    it has already been done for you in the dataset Q3-practice on the Canvas website under

    STATS/FEINSTEIN/STATA Datasets. Run the regress command and check your work.

  • 7/24/2019 0000 Stats Practice

    48/52

  • 7/24/2019 0000 Stats Practice

    49/52

  • 7/24/2019 0000 Stats Practice

    50/52

  • 7/24/2019 0000 Stats Practice

    51/52

  • 7/24/2019 0000 Stats Practice

    52/52