MA Statistics Tutorial

8/21/2019 MA Statistics Tutorial

1/61

STATISTICS TUTORIAL

FOR ECON MA STUDENTS


2/61

This tutorial offers a chance for students with limited statistics background a concise

review of and introduction to fundamental topics in the MA program. It also

provides a refresher for students with more extensive statistics backgrounds.

To encourage a practical understanding, topics are presented using actual data for airtravel data and Excel screenshots of statistical results.

There is a self-test at the end of each section to help each student evaluate grasp of the

material.

No one will grade these self tests; responsibility rests with the student Students are advised to review incorrect answers and seek additional assistance in

understanding incorrect answers if needed. Students may [email protected]

with questions.

Additional, concise sources of information on the topics presented are available fromHyperstatshttp://davidmlane.com/hyperstat/

Statsoft Electronic Textbook -- http://www.statsoft.com/textbook/stathome.html
mailto:[email protected]://davidmlane.com/hyperstat/http://davidmlane.com/hyperstat/mailto:[email protected]


3/61

Section I

Descriptive Statistics and

Measures of Sampling Error


4/61

Air Travel Data

For 21 cities, the following data have been

recorded or computed:

City = city identifying code

Fare = cheapest coach fare from Nashville to

city in $ on Orbitz on given a given day

Distance = distance in miles for the routeFare per Mile = Fare divided by distance


5/61

Excel Screen Shot of Data


6/61

Distribution of Fare per Mile

The histogram has a normal (bell-shaped)distribution curve superimposed

The distribution of fare per mile is similar to

the normal after smoothing out the rectangles,

but is just slightly right-tilted or skewed

This graphic was produced by the statisticalsoftware package, SPSS


7/61

Table 1-- Descriptive Statistics for Fare per Mile

Key Univariate Descriptive Statistics

Mean= average of 28 cents per mile

Median = middle value (50thpercentile); so half ofthe values are above 28.8 cents per mile and half are

below; the median is a better measure of the center

of the data set when the data are highly skewed

Standard deviation= average distance or

variability from the mean fare for the observations;

in this case, the 21 observations differ from the

mean by an average of 9.6 cents per mile

Range= difference between the minimum and

maximum values

Skewness= degree of asymmetry; zero is perfectly

symmetric; large positive values (1.0 or larger)

indicate a leaning to the right; large negative values

indicate a leaning to the left; the value of 0.559

indicates a slight rightward skew as shown in the

graph on the prior page


8/61

Sample Statistics v. Population Parameters

The statistics reported in Table 1 are sample statisticstheysummarize the 21 observations in the sample

The full set of all possible fares between all cities of interest would

represent the population of fares and fares per mile

Population Parameterrefers to a summary measure using all possible

data; for example, the population mean or population standarddeviation

The sample statisticsreported in Table 1 provide estimates of these

population parameters

Table 1 also provides numerical estimates of the accuracy and

reliability of the sample mean in estimating the population mean (seenext slide)


9/61

Table 1-- Estimates of Sampling Error

Key Univariate Descriptive Statistics

Standard Error (of sample mean)= estimate ofthe likely sampling error between the sample mean

and the population mean; 0.021 implies that

repeated samples of the same size could easily find

sample means 2.1 cents higher or lower;

Confidence Levels (95%) = roughly two times the

standard error; (for 99%, it is roughly 2.5 times thestandard error); as such, it provides a figure similar

to the standard error, but with a wider margin for

error; 0.044 with a 95% Confidence level implies

that about 95 out of 100 samples of this size would

likely result in sample means within 4.4 cents of the

estimated value


10/61

How Reliable are Sample Error Estimates?

Standard Errors and Confidence Intervals estimate sampling error Sampling Error is error arising because one is using less than the entire

population

To accurately estimate population parameters and sampling error,

samples must be representative of the population

Randomly selected samples are the best (though not foolproof way) ofassuring this

Error not related to sampling selection (question bias, response bias,

dishonest responses, data entry errors, ) must be small relative to the

size of the sampling error

This kind of error is called non-sampling error


11/61

Using Sampling Error in Testing Claims

(Hypothesis Testing)

Estimates of sampling error permit a claims or conjectures (hypotheses)concerning population parameters to be tested with sample statistics while

taking into account a margin for error

Testing a claim for the population mean:

Suppose someone thinks that the mean fare per mile for the full population is 30

cents or higher

Given the sample mean (0.282) and the standard error of 0.21, it is quite likely that

another sample would yield an estimate of 30 cents or higher

If we double the standard error to get a 95% confidence interval and margin for

error of 0.042, we see that the claim of 30 cents or higher is quite likely

In contrast, if someone were to claim that the mean is 35 cents or higher, the

standard error and confidence interval suggests that such a figure is not very likely


12/61

Testing Claims with P-values

Put briefly, a p-value shows thelikelihood of obtaining the sampleestimate by chance if the null hypothesiswere true

Take the claim of a mean of .30 testedhere (using SPSS software) given thesample mean of 0.282 and s.e. of 0.021

The estimated p-value (called Sig.-2tailed) is 0.425

The chance of finding such a value bychance is 42.5 percent

Typically, we reject the null only if thisp-value is below a 5 percent threshold

Note: our test is really 1-tailed since weare testing greater than 0.30. We shouldcut the p-value in half to 21.25, but this isstill well above 0.05

One-Sample Test

-.814 20 .425 -.01714 -.0611 .0268farepermilet df Sig. (2-tailed)

Mean

Difference Lower Upper

95% ConfidenceInterval of the

Difference

Test Value = .30


13/61

Testing Claims with P-values

Now, test a mean of .35 or higher: The estimated p-value (called Sig.-2

tailed) is 0.005

The chance of finding such a value by

chance is 0.5 percent which is far below

the 5 percent threshold even before

cutting it in half for a 1-tailed test The p-value indicates that there is only a

0.5 percent chance of finding our mean of

0.282 if the true mean were 0.35 or

higher

The null hypothesis of a mean of 0.35 or

higher is rejected

One-Sample Test

-3.189 20 .005 -.06714 -.1111 -.0232farepermile

t df Sig. (2-tai led)

Mean

Di ffe re nce L owe r Up pe r

95% Confidence

Interval of the

Difference

Test Value = .35


14/61

Sidebar on Hypothesis Testing

In the previous slide, the proposition that the

coefficient was equal to zero was tested using thep-value Any time that a p-value appears, a null hypothesis is

being tested

The proposition being examined is called the nullhypothesis

Using p-values from the output of software is thesimplest way of testing a hypothesis

With small data sets, especially with small effects being

tested, a p-value may not be below 0.05. This does not mean that the null hypothesis is true It may indicate that the test lacks Power to reject a false null

(due to lack of data); See Statsoft textbook under xxxxxxx forfurther information


15/61

Sidebar on Hypothesis Testing

In addition to p-values, t-statistics andconfidence intervals (all derived from

standard errors) can also test a hypothesis

As a rule-of-thumb, t-values greater than 2 in

absolute value are equivalent to p-values below

0.05


16/61

Self Test Section I

The self test uses a data set on 5K running times;the raw data appears on the next slide; variablesare

Time = 5k time in minutes (decimals are fractions

of minutes) Age = age in years

Intervals = 1 if hard interval workouts were usedand 0 if not;

Miles Per Week = number of miles per week intraining at peak of training


17/61


18/61

Self-Test for Section I1. The measure that provides the middle or 50thpercentile observation is

a. 19.30

b. 19.50

c. 0.800

d. 19.00

2. The statistic that indicates how spread out the individual 5k times are from theaverage time is

a. 3.250

b. 0.160

c. 0.800

d. 0.192

3. Based on the data, you can say that the times are

a. Nearly symmetric

b. Highly skewed to the right

c. Highly skewed to the left

d. Not enough information


19/61

Self-Test for Section I4. The likely sampling error for the mean is The measure that provides the middle or

50thpercentile observation is

a. 0.160

b. 0.192

c. 0.800

d. 3.250

5. The 95% confidence interval for the mean is computed by

a. Multiplying the standard error by about 2.0

b. Multiplying the standard deviation by 95%

c. Dividing the range by about 10

d. Dividing the mean by the sample size

6. The value for Age for the second observation is

a. 42

b. 21

c. 22

d. 44


20/61

Self-Test for Section I7. In the output, a test of the mean is provided. The null hypothesis being tested is

a. That the population mean equals 19.3

b. That the population mean equals -4.373

c. That the population mean equals -0.700

d. That the population mean equals 20

8. The results in the table provide a 2-tailed test. To compute a 1-tailed test, youwould

a. Double the p-value

b. Divide the t-statistic by two

c. Divide the p-value by two

d. Double the size of the confidence interval

9. Which of the following indicates that the null hypothesis should be rejected?

a. t = -4.373

b. p-value (Sig. 2-tailed) = 0.000

c. Both a and b

d. Neither a or b One-Sample Test

-4.373 24 .000 -.70000 -1.0304 -.3696Timet df Sig. (2-tailed)

Mean

Difference Lower Upper

95% Confidence

Interval of the

Difference

Test Value = 20


21/61

Correct Answers to Self-Test Section I

1. A2. C

3. A (the skewness statistic is very small, 0.192, indicating only a

slight amount of positive skew; 0 would be perfectly symmetric;

above or below 1.0/-1.0 would indicate substantial asymmetry)

4. A

5. A

6. C (go back to the original data sheet for this)

7. D (this is indicated by the Test Value = 20 in the SPSS output)

8. C (the test provided is 2-tailed because it tests whether the meanequals 20 or not; 1-tailed would test whether it was 20 or more)

9. C (the p-value is less than the typical 0.05 threshold for rejecting the

null hypothesis; the t-values absolute value is greater than 2.0)


22/61

Section II

Regression Analysis


23/61

Relationships Between Variables

In economics investigators are frequently interested in how one variable interacts with

another; Example: sales and income

Often, one of the variables causes changes in the other such as higher incomes causingmore sales.

The causal variable is referred to as the X, Independent, or Explanatory Variable

The responding variable is referred to as the Y or Dependent Variable

Sometimes the relationship is not causal but merely one of association because of links

to a third variable

Example: SAT & ACT cores, which are both caused by academic ability and achievement

The most frequently used statistical technique used to examine relationships between

variables is Regression Analysisor some technique that is very similar to regression

analysis.

Regression analysis can be used for all kinds data and relationships including

Linear relationships and Curved relationships

Quantitative data and Qualitative data

Cross-sectional and time series data

The following slides present the simplest form of Regression Analysis

A quantitative dependent variable (Air Fare) and one quantitative, independent

variable (Distance)

The relationship is treated as linear


24/61

Scatterplot for Fare & Distance

The Scatterplot presented in Figure 1

depicts the 21 Fare (Y-axis) and

Distance (X-axis) combinations in the

data set

The graph shows that as distance

increases, fare also tends to increase,

but that the relationship is not perfect;

otherwise, it would lay on a straight line

Fig. 2 -- Scatterplot of Fare (Y) and Distance (X)

0

50

100

150

200

250

300

350

400

0 500 1000 1500 2000 2500

Distance

Fare


25/61

Regression from a Visual Standpoint

Figure 3. Scatterplot and Regression

Plot for Fare-Distance

0

100

200

300

400

0 500 1000 1500 2000 2500Distance

Fare

Figure 3 adds another element to the

plota straight line of points (a line

connecting the pink points)

These points represent the regressionline that Excel chose as the straight line

that best fit the scatterplot points

Software chooses the line to minimize

the sum of the (squared) distances

between the blue points and the pinklinethis method is called the Least

Squares or Ordinary Least Squares

(OLS) method and is widely used


26/61

Fare-Distance Regression as Tabular Output


27/61

Regression in Table Form

Table R1 presents the same regression results Figure 3

The Regression Statistics and ANOVA parts of the table

evaluate the overall performance of the regression in predicting

Fare to different cities The bottom part with Coefficients for Intercept and Distance

presents the regression line as numbers that can be put into an

equation along with estimates of sampling error

The following slides breakdown the different parts of the table


28/61

Regression output always implies an equation written generally as

y = b0 + b1*X

b0 = y-intercept

b1 = slope (change in Y over change in X)

b0 and b1 are referred to as regression coefficients or intercept coefficient

and slope coefficient

The pink line in Figure 3 can be written down as an equation

Recall, the slope-intercept form of a line (y=mx+b) from basic algebraif you

draw a line through the pink points in Figure 3, and extend it to where Distance

(X) = 0, the intercept should be obvious

The equation for this line is

Fare = 157 + 0.084 * Distance + Error(Intercept) (Slope)

Coefficients

Intercept 157.614

Distance 0.084


29/61

Slope & Intercept Meaning

The slope indicates that for every 1 mile Distance, the Fare is

increasing by 0.084 (or about 8 cents). The slope produced in regression analysis always shows the amount of

increase in Y (or decrease if negative) for a 1 unit increase in X

To correctly interpret the slope for a regression, it is critical to know the

units in which X and Y are measured; here, the units are miles and dollars

A 100 mile increase implies an $8.40 (100 x 0.084) increase in Fare

The y-intercept indicates that if distance were 0, the fare would be 157

The intercept in this case is not an economically meaningful number

because there are no flights of 0 miles

The intercept merely extends the line to the X-axis for statistical purposes Be aware of the relevant range (min, max) of the X-variable


30/61

Regression Line Errors (Residuals)

Using the regression equation, Y-values for

given X-values can be calculated Predicted Y= intercept + slope*(X-value)

Example: Observation 1 is Dallaswith a

distance of 600 miles:

Predicted Fare = 157.6 + 0.084*(600) = 208

(Excels prediction is 208.310werounded)

The regression Error (residual)=

Actual Y valuePredicted Yvalue

For Dallas (observation 1), the actual farewas $250, so we calculate

Residual = 250208.310 = 41.690

Each observation has a predicted fare and

error associated with it


31/61

R Squarereports the percent of the Y-variable explained by the

X-variable

In other words, expresses (as a percent) how close the regression

line points come to predicting the actual scatterplot points

The maximum R-square is 1.0 (100%) and the minimum is 0.

In this case, Distance, by itself, can account for 48.6% of the Fare

differences between cities

In a 2-variable regression like this one, the Multiple R is the same

thing as the Correlation Coefficient between X and Y.

The R-square is the squared correlation coefficient in such cases. Its maximum is 1.0 or -1.0 (perfectly correlated) and 0 is the min

It can take on positive or negative values depending on the direction

of the relationship between the two variables

Multiple R 0.697

R Square 0.486

Adjusted R Square 0.459

Standard Error 43.294

Observations 21.000


32/61

Regression Coefficient Accuracy

Just like the sample mean, the regression coefficients are sample statistics

that are usually used to estimate what the true relationship would be if all

possible data were used

Regression coefficients, therefore, also have standard errorsthat estimate

their sampling error

The slope coefficient for distance (0.08) has a standard error of 0.02

This implies that the population parameter (regression coefficient using

all possible data) may easily be 2 cents higher or lower than the 0.08

coefficient estimated by this sample For a wider (apx. 95%) margin for error, this standard error can be

multiplied by about 2.0


33/61

More on Regression Coefficient Accuracy

The t-statandp-valueare also ways of assessing the reliability of the

coefficient

They test whether the coefficient is significantly different from zero

As a rule of thumb, if the t-statistic is > 2.0 (< - 2.0), this is viewed as

significantly different from zero

The t-Stat on Distance is 4.239, so it is statistically significant

The p-value estimates the likelihood of finding the coefficient of 0.084

by mere chance if the true value were zero

The p-value of 0.000 indicates that this would be very unlikely, also

showing a statistically significant result

In scientific research, p-values below 5 percent (0.05) are taken as

statistically significant

In other settings, the cutoff level for the p-value may vary


34/61

Expanded Regression Analysis

In most situations in economics, investigators look at the effects of multiple

variables on a dependent variables when using regression analysis

Example: price and income effects on sales

Such regressions are sometimes called multiple regression analysis and

involve only slight modifications of the earlier points

Also, economists widely use qualitative variables as independent variables.

When these take on only two values (male, female) they are usually coded as

(1,0) and called binary or dummy variables

In the Air Travel data, we have such a variable, Direct SWA, that indicateswhether Southwest Airlines flies this route directly (1) or not (0). This

variable is added to the regression analysis, resulting in the following Excel

output:


35/61

Fare Regression with Distance and Direct SWA


36/61

The regression equation is now

Fare = 193 + 0.08*Distance66*Direct SWA + Residual

The slope coefficient for Distance is still about 0.08

The y-intercept coefficient was 157; It is now 193

The Direct SWA variable has these effects:

When SWA = 0 (when SWA does not fly that route), the regression equation is

Fare = 193 + 0.081*Distance ; because -66*(0) = 0

When SWA =1 (when SWA flies the route), the regression equation is

Fare = 193 + 0.081* Distance66*(1) = 127 + 0.081*Distance

Note that the SWA dummy variable only influences the y-intercept

The SWA variable does not influence the slope for distance (see next slide)

Coefficients Standard Error t Stat P-value

Intercept 193.032 14.411 13.395 0.000

Distance 0.081 0.012 6.698 0.000

Direct SWA -66.779 11.446 -5.834 0.000


37/61

Distance Line Fit Plot

050

100

150

200

250300

350

400

0 500 1000 1500 2000 2500

Distance

Fare

The line connecting the upper pink dots

shows the regression line when

SWA= 0

The line connecting the lower pink dotsshows the regression line when SWA=1

The Fare-Distance slope for both lines

is 0.08

Table R2 Regression with Multiple X-


38/61

Another important difference that results from adding the SWA variable is the

increase in the R-Square value

It is now 82.2 (it was about 48% when using only Distance)

The combination of Distance and Direct SWA account for 82.2% of the differences

in Fares across cities.

Adding SWA increased this value by about 36%

Table R2. Regression with Multiple X-

Regression Statistics

Multiple R 0.907

R Square 0.822

Adjusted R Square 0.802

Standard Error 26.161Observations 21.000


39/61

From the regression predictions and errors, Excel (and other software) compute an Analysis ofVariance or ANOVA

The F-Statisticis the most important number here; itcomputes the ratio of the mean regression sum ofsquares by the mean residual sum of squares

Unlike the R-Square value, the F-statistic adjusts for the number of variables used

The Significance F is simply a p-valuetesting the null hypothesis that the F-statistic equals zero;With this data, this null hypothesis is rejected because the p-value is very low

In effect, the F-statistic tests whether the X-variables, as a group matter in explaining the Y-variable

The SS above refers to Sum of Squares.

The Residual SS simply squares the individual errors and adds them up. MS refers to mean sum ofsquares which divides the SS by the number of observations (minus the number of variables in theregression).

The Predicted sum of squares computes differences in the actual and predicted values for Fare andthen adds them up

The Total sum of squares adds the Predicted and Residual together

The R-Square is simply the regression sum of squares divided by the total

The Adjusted R-squared, like the F-statistic, adjusts for the number of variables used

ANOVA

df SS MS F Significance F

Regression 2.000 56977.083 28488.541 41.627 0.000

Residual 18.000 12318.727 684.374

Total 20.000 69295.810


40/61

Regression Pointers

Regressions that are well done have residuals that have no obvious

patterns and are roughly bell shaped; Checking the residuals for theseand other characteristics is called Residual Analysis

Regressions that leave out key explanatory (X) variables can yieldmisleading slopesthis is called the Omitted Variables Bias;

Regressions leaving out key variables should be viewed as exploratoryor preliminary in nature

There is no magical R-squared value to be obtained; if a model is puttogether well, then a low R-squared is fine; if a model has key flaws init, then a high R-Squared value does not make it good

Only humans can determine if a regression is causal (Income-Sales) ormerely associative (SAT-ACT); the software treats both cases the same


41/61

Self Test Section II

The self test again uses a data set on 5Krunning times shown on the next slide

Time = 5k time in minutes (decimals are

fractions of minutes) Age = age in years

Intervals = 1 if hard interval workouts wereused and 0 if not;

Miles Per Week = number of miles perweek in training at peak of training


42/61

F Th Q ti R f t thi O t t


43/61

For These Questions, Refer to this Output


44/61

1. The regression equation depicted by the table is

a. 5k Time = 0.731 + Age + Intervals + Residual

b. 5k Time = 17.554 + Age*Intervals + Residual

c. 5k Time = 17.554+ 0.071*(-0863)*Age*Intervals + Residuald. 5k Time = 17.554 + 0.071*Age0.863*Intervals + Residual

2. The percent of 5k time differences accounted for by Age andIntervals in the regression model is

a. 0.731b. 17.554

c. 12.660

d. 0.535

3. The slope coefficient for Age isa. 0.071

b. 0.731

c. 17.554

d. 0.016

4 Th lik l li i h l ffi i f A i


45/61

4. The likely sampling error in the slope coefficient for Age is

a. 0.071

b. 0.731

c. 17.554

d. 0.016

5. The slope coefficient for Age implies that

a. For each 1 minute increase in Time, Age increases by 0.071 years

b. For each 1 year increase in Age, Time increases by 1 minute

c. For each 1 year increase in Age, Time increases by 0.071 minutes

d. For each 1 year increase in Time, Age increases by about 53%

6. The regression results imply that if Age were 0, then Time would bea. 0.731

b. 12.660

c. 24.000

d. 17.554

7 The value in the preceding question


46/61

7. The value in the preceding question

a. Means that a newborn baby would be predicted to run this time in a 5k

b. Means that the value is really only a hypothetical extension of theregression line because none of the actual data go back to zero years of Age

c. Means that the regression is not reliable at any values

d. Means that babies should compete in the Olympics

8. The coefficient for Intervals implies that

a. When interval equals 1, the Age slope is reduced by 0.863

b. When interval equals 0, the y-intercept value is reduced by 0.863 minutes

c. When interval equals 1, the Age slope is the same but the entire regressionline shifts down by 0.863 minutes

d. When interval equals 0, the Age slope is the same but the entire regressionline shifts down by 0.863 minutes

9. If you wanted to compute the effects of 10 more years of Age on thepredicted 5k Time, you should multiply

a. 0.10 x 0.071

b. 10 x 0.071

c. 10 x 1.0

d. 100 x 0.071

10 Th di t d l f 5k Ti h i 47 d i i t l i


47/61

10. The predicted value for 5k Time when a person is 47 and using intervals in

training would be found by which of the following equations?

a. Predicted 5k Time = 17.554 + 0.071*(47)

b. Predicted 5k Time = 0.731 + 0.072*470.863*((1)

c. Predicted 5k Time = 17.554 + 0.071*(47)0.863*(1)d. Predicted 5k Time = 0.071*(47) - 0.863*(1)

11. Using the data sheet provided earlier, compute the residual for the first

observation. (Note: you will first have to compute the predicted time)

a. -0.545b. 0.631

c. -.034

d. 1.232

12. The data provided on the accuracy of the coefficients indicates thata. All are not significantly different from zero

b. Age is significantly different from zero but not Intervals

c. Intervals is significantly different from zero but not Age

d. All are significantly different from zero


48/61

Correct Answers Section II Self Test

1. D

2. D

3. A

4. D

5. C

6. D7. B

8. C

9. B (the slope for a 1 unit (year) change in time is 0.071, a 10 year change is

simply 10 x slope)

10. D11. A (Predicted Time = 17.554+0.071*(21)0.86*(0) = 19.045;

Residual = ActualPredicted = 18.5019.045)

12. D (All of the p-values for the coefficients are below the 0.05 threshold for

significance; All of the t-statistics are above 2.0 in absolute valuethe

rule-of-thumb value for significance


49/61

Section III

Statistical Software


50/61

Overview

Personal computers and software make it possible for almost anyone to

complete complicated or lengthy computations needed for statisticsknowing what to do with them is the hard part

Excel contains many useful statistical and graphing capabilities; theseare introduced in the next few slides

Software dedicated to statistical operations vastly expands the breadthof procedures possible as well as doing some much easier than inExcel. Some commonly used statistical software includes SAS (www.sas.com); the company offers many varieties; JMP is a point-click

product; SAS is available in some places at WKU

SPSS (www.spss.com); This software is available in most computer labs oncampus; it is not as widely used by economists as SAS but contains most of thesame features, especially for basic purposes

Stata (www.stata.com) is widely used by economists and contains broad and verypowerful tool; Eviews (www.eviews.com) is also very powerful and especiallyuseful for time series and forecasting applications; both provide point-clickfunctionality
http://www.sas.com/http://www.spss.com/http://www.stata.com/http://www.eviews.com/http://www.eviews.com/http://www.stata.com/http://www.spss.com/http://www.sas.com/


51/61

Excel Stat Introduction 1

Making Application While there is no self-test with this section, you are strongly encouraged

to practice on Excel; even if you use other software in later classes, the

practice in Excel will be helpful

One of the main differences in Excel and spreadsheets in statistical

software is that Excel is address driven (each cell has an address),

whereas the stat software is variable drivenonce a column of data

exists for a variable, the entire column can be manipulated simply by

referring to the name


52/61


Click the Tools menu in Excel; if Data Analysis appears as anoption you may skip to the next slide; if not then

Select the Add-Ins option under the Tools menu

Check the box for Analysis Tool Pak

The Data Analysis option should now appear under the Tools menu

(Note: If you opened Excel from your desktop, the procedures above

should work; if you happened to open Excel by opening an Excel-

based spreadsheet while browsing on the internet, it may not work)


53/61


Take one of the data sheets, Air Travel or 5k Times, used in thistutorial and enter the data into Excel. The instruction here proceed

using the Air Travel data.

To compute descriptive statistics for a variable

Select the Tools menu Select the Data Analysis option

Select the Descriptive Statistics option

Click on the icon next to the blank for Input Range

Highlight the column for Fare including the label

Check the Labels in the First Row box Check the Summary Statistics box

Check the Confidence Interval for the Mean box

Check the OK button


54/61


You should now have an output table on a new sheet One disadvantage of Excel is that statistical output table like this one tend

to be collapsed or condensed and need to be formatted

Formatting the output table (this is something you should always do in

Excel) Highlight the columns with the table

Select the Format menu

Select the Column and AutoFit Selection options

Again, select the Format menu

Select the Cells options In the Number menu, choose the Number option

Pick a number for the Decimal Places box (the number of decimal

places depends somewhat on the data3 will be fine here)

Make sure to do this step in Excel; tables with a lot of insignificant decimal

places are very messy to read


55/61


Return to the original data sheet Create a regression analysis:

Select the Tools menu and the Data Analysis option

Select the Regression Analysis option in the window

Select the icon next to the Input Y Range blank and highlight the data

containing Fare including the label Select the icon next to the Input X Range and highlight the data

containing Distance and Direct SWA including the labels

(Note: if you try to highlight the whole columns you may get an error)

Check the Labels box, the Residuals box, and the Line Fit box

Select the OK button and reformat the output tables as before You will also need to reformat the Line Fit plot (another small hassle in

Excel); just expand it using the mouse


56/61


Return to the original data sheet Charts in Excel

Excel can also be used to create scatterplots, histograms, and other types

of plots

This is an area where statistical software is much easier to use

If you want to tinker some, click on the Chart Wizard icon that shouldappear below the top level menus

The icon has the appearance of a bar chart

Also, under the Data menu, there is a Pivot Table and Pivot Chart

option that provides further capabilities

If you would like a hands-on introduction to other statistical software,

please contact Brian Goff [email protected]. Also, other several

other economics professors can provide assistance in becoming

acquainted with software.
mailto:[email protected]:[email protected]


57/61

Probability Distributions

A final topic briefly introduced here is that of probability distributions(PD)

A PD is a formula (often presented as a graphic or table) that links

values of a variable with the probability of those values

PDs are used in many ways; for statistics, one of the key uses is to

assess hypotheses including the use of t-statistics and p-values

Statistical software makes an extensive knowledge of PDs not

necessary because the relevant information about the PD is stored bythe computer and used as needed; however, a few basic points are

worthwhile even for basic statistics users


58/61

Probability Distributions 2 PDs have a center, dispersion, and symmetry or skew (asymmetry)

measures of location of center include mean & median

measures of dispersion include the standard deviation and range

PDs have tails (the ends), measured by the amount of kurtosis

Normal (Probability) Distribution

Most widely known due to its bell-shape

Many real life situations are approximately (though not perfectly) distributedNormal

The mother of PDs in that many other distributions are related to it or convergeto it with large samples or other conditions

t-Distribution

Also bell-shaped

Is wider in its tails than the normal but converges to it with large samples

Binomial Distributiondeals with 2 outcome situations

F-Distribution, Chi-Square Distributioncommonly used distribution whenthe topic is variability

Excel permits PDs to be used directly if desired


59/61

Excel permits PDs to be used directly if desired

Click on the function icon (the script f) just below the top menus

Select Statistical in the window and scroll to the desired distribution such as

NORMDIST for normal

We can now produce probabilities for a variable assumed to be normal or near

normal

Example: Lets assume that male height is apx. Normal with a mean of 70 inches

and a standard deviation of 2 inches, what is the probability of finding someone

taller than 74? In the NORMDIST window, plug in 74 for X, 72 for Mean, and 2 for

Standard Deviation

In the Cumulative box, put True.

Excel will produce a number that is the probability of being 74 or less (that is, the

cumulative probability)

This number is 0.977

The probability of being taller than 74 is 1-.977 = .023 or 2.3%

The same or similar procedures can be used for 2 outcome (binomial) problems

or many others and opens up a wide array of uses

Clockwise from Left Corner:


60/61

Clockwise from Left Corner:

Normal, t-, F-, and Chi-Square Distributions


61/61

A gallery of PDs and more background is

offered at the Engineering Statistics

Handbookhttp://www.itl.nist.gov/div898/handbook/eda/sect

ion3/eda366.htm

Documents

MA Statistics Tutorial