Download ppt - Malimu descriptive statistics

Descriptive Statistics

Five types of statistical analysis

Descriptive

Inferential

Differences

Associative

Predictive

What are the characteristics of the respondents?What are the characteristics ofthe population?

Are two or more groups the sameor different?

Are two or more variables relatedin a systematic way?

Can we predict one variable if we know one or more other variables?

Summarization of a collection of data in a clear and understandable way

the most basic form of statistics lays the foundation for all statistical knowledge

Descriptive Statistics

Measures of central tendency • mean, median, mode

Measures of dispersion • range, standard deviation, and coefficient of variation

Measures of shape • skewness and kurtosis

•If you use fewer statistics to describe the distribution of a variable, you lose information but gain clarity.

Type ofMeasurement

Nominal

Twocategories

More thantwo categories

Frequency tableProportion (percentage)

Frequency tableCategory proportions

(percentages)Mode

Type of descriptive analysis

Ratio means

Type ofMeasurement

Type of descriptive analysis

Ordinal Rank orderMedian

Interval Arithmetic mean

Data Tabulation• Tabulation: The organized arrangement of data in a

table format that is easy to read and understand.– A count of the number of responses to each question.

• Simple Tabulation: tabulating of results of only one variable informs you how often each response was given.

• Frequency Distribution: A distribution of data that summarizes the number of times a certain value of a variable occurs expressed in terms of percentages.

The arrangement of statistical data in a row-and-column format that exhibits the count of responses or observations for each category assigned to a variable• How many of certain brand users can be called loyal?• What percentage of the market are heavy users and

light users?• How many consumers are aware of a new product?• What brand is the “Top of Mind” of the market?

Frequency Tables

More on relative frequency distributions• Rules for relative frequency distributions:

– Make sure each observation is in one and only one category.– Use categories of equal width.– Choose an appealing number of categories.– Provide labels – Double-check your graph.

A histogram is a relative frequency distribution of a quantitative variable

643 Netw orking213 print ad179 Online recruitment site112 Placement f irm18 Temporary agency

How did you find your last job?

7006005004003002001000

Netw orking

print ad

Online recruitment site

Placement f irm

Temporary agency

55.2 %

18.3 %

15.4 %

9.6 %

1.5 %

A bar graph is a relative frequency distribution of a qualitative variable

How many times per week do you use mouthwash ?

1__ 2__ 3__ 4__ 5__ 6__ 7__

1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7

1 2

2 3

3 5

4 7

5 5

6 3

7 20

1

2

3

4

5

6

7

1234567

Normal Distribution

- a b

IQ

The total area under the curve is equal to 1, i.e. It takes in all observations

The area of a region under the normal distribution between any two values equals the probability of observing a value in that range when an observation is randomly selected from the distribution

For example, on a single draw there is a 34% chance of selecting from the distribution a person with an IQ between 100 and 115

Normal Distributions Curve is basically bell shaped from - to

symmetric with scores concentrated in the middle (i.e. on the mean) than in the tails.

Mean, medium and mode coincide

They differ in how spread out they are.

The area under each curve is 1.

The height of a normal distribution can be specified mathematically in terms of two parameters: the mean () and the standard deviation ().

Occur when one tail of the distribution is longer than the other.

Positive Skew Distributions have a long tail in the positive direction. sometimes called "skewed to the right" more common than distributions with negative skews E.g. distribution of income. Most people make under $80,000 a year, but some make quite a bit more with a small number making many millions of dollars per year The positive tail therefore extends out quite a long way

Negative Skew Distributionshave a long tail in the negative direction. called "skewed to the left." negative tail stops at zeroE.g. GPA

Skewed Distributions

• Kurtosis: how peaked a distribution is. A zero indicates normal distribution, positive numbers indicate a peak, negative numbers indicate a flatter distribution)

Peakeddistribution

Flat distribution

Thanks, Scott!

–central tendency

–Dispersion or variabilityA quantitative measure of the degree to which scores in a distribution are spread out or are clustered together

Summary statistics

Measures of Central Tendency

• Mode: the number that occurs most often in a string (nominal data)

• Median: half of the responses fall above this point, half fall below this point (ordinal data)

• Mean: the average (interval/ratio data)

Mode the most frequent category

users 25%non-users 75%

Advantages: • meaning is obvious• the only measure of central tendency that can be used with nominal data.

Disadvantages• many distributions have more than one mode, i.e. are “multimodal”• greatly subject to sample fluctuations • therefore not recommended to be used as the only measure of central tendency.

Medianthe middle observation of the data

number times per week consumers use mouthwash

1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7

Frequency distribution of Mouthwash use per week

Heavy userLight user Mode

MedianMean

The Mean (average value)sum of all the scores divided by the number of scores.

a good measure of central tendency for roughly symmetric distributions

can be misleading in skewed distributions since it can be greatly influenced by extreme scores in which case other statistics such as the median may be more informative

formula = X/N (population)

X = xi/n (sample)

where an X are the population & sample means

and N and n are the number of scores.

¯¯

Normal Distributions with different Means

0- 1 2

• Minimum, Maximum, and Range (Highest value minus the lowest value)

• Variance • Standard Deviation (A measure’s distance

from the mean)

Measures of Dispersion or Variability

Distribution of Final Course Grades in MGMT 3220Y

0

5

10

15

20

25

Grade

Freq

uenc

y

Frequency 3 10 20 23 12F D C B A

RANGE

- 1 SD+ 1 SD

Variance• The difference between an observed value and the mean is called the deviation from the mean

• The variance is the mean squared deviation from the mean

• i.e. you subtract each value from the mean, square each result and then take the average.

• Because it is squared it can never be negative

2 = (x- xi)2/n¯

• The standard deviation is the square root of the variance

• Thus the standard deviation is expressed in the same units as the variables• Helps us to understand how clustered or spread the distribution is around the mean value.

Standard Deviation

S = (x- xi)2/n¯

Measures of DispersionSuppose we are testing the new flavor of a fruit punch

Dislike 1 2 3 4 5 Like Data 1. 32. 53. 34. 55. 36. 5

x

x

x

x

x

x

X= 42= 1S = 1

2 = (x- xi)2/n¯ S = (x- xi)2/n¯

Measures of Dispersion


x

x

x

x

x

xX = 4.672=0.22S = 0.47

2 = (x- xi)2/n¯ S = (x- xi)2/n¯

¯

Measures of Dispersion


x

x

x

x

xx

X= 32=4S = 2

2 = (x- xi)2/n¯ S = (x- xi)2/n¯

¯

-

12

3

Normal Distributions with different SD

• A statistical technique that involves tabulating the results of two or more variables simultaneously

• informs you how often each response was given• Shows relationships among and between variables• frequency distribution for each subgroup

compared to the frequency distribution for the total sample

• must be nominally scaled

Cross Tabulation

Cross-tabulation• Helps answer questions about whether two

or more variables of interest are linked:– Is the type of mouthwash user (heavy or

light) related to gender?– Is the preference for a certain flavor (cherry

or lemon) related to the geographic region (north, south, east, west)?

– Is income level associated with gender?• Cross-tabulation determines association not

causality.

• The variable being studied is called the dependent variable or response variable.

• A variable that influences the dependent variable is called independent variable.

Dependent and Independent Variables

Cross-tabulation• Cross-tabulation of two or more variables is

possible if the variables are discrete:– The frequency of one variable is subdivided by the

other variable categories.• Generally a cross-tabulation table has:

– Row percentages– Column percentages– Total percentages

• Which one is better?DEPENDS on which variable is considered as

independent.

• A contingency table shows the conjoint distribution of two discrete variables

• This distribution represents the probability of observing a case in each cell– Probability is calculated as:

Contingency Table

Observed casesTotal cases

P=

10 9 1952.6% 47.4% 100.0%55.6% 18.8% 28.8%15.2% 13.6% 28.8%

5 25 3016.7% 83.3% 100.0%27.8% 52.1% 45.5%7.6% 37.9% 45.5%

3 14 1717.6% 82.4% 100.0%16.7% 29.2% 25.8%4.5% 21.2% 25.8%

18 48 6627.3% 72.7% 100.0%

100.0% 100.0% 100.0%27.3% 72.7% 100.0%

Count% within GROUPINC% within Gender% of TotalCount% within GROUPINC% within Gender% of TotalCount% within GROUPINC% within Gender% of TotalCount% within GROUPINC% within Gender% of Total

income <= 5

5>Income<= 10

income >10

GROUPINC

Total

Female MaleGender

Total

Cross tabulationGROUPINC * Gender Crosstabulation

General Procedure for Hypothesis Test

1. Formulate H0 (null hypothesis) and H1 (alternative hypothesis)

2. Select appropriate test3. Choose level of significance4. Calculate the test statistic (SPSS)5. Determine the probability associated with

the statistic.• Determine the critical value of the test

statistic.

General Procedure for Hypothesis Test

6 a) Compare with the level of significance,

b) Determine if the critical value falls in the

rejection region. (check tables)

7 Reject or do not reject H0

8 Draw a conclusion

• The hypothesis the researcher wants to test is called the alternative hypothesis H1.

• The opposite of the alternative hypothesis is the null hypothesis H0 (the status quo)(no difference between the sample and the population, or between samples).

• The objective is to DISPROVE the null hypothesis. • The Significance Level is the Critical probability of choosing

between the null hypothesis and the alternative hypothesis

1. Formulate H1and H0

• The selection of a proper Test depends on:– Scale of the data

• nominal• interval

– the statistic you seek to compare• Proportions (percentages)• means

– the sampling distribution of such statistic• Normal Distribution• T Distribution2 Distribution

– Number of variables• Univariate• Bivariate• Multivariate

– Type of question to be answered

2. Select Appropriate Test

Example

A tire manufacturer believes that men are more aware of their brand than women. To find out, a survey is conducted of 100 customers, 65 of whom are men and 35 of whom are women.

The question they are asked is: Are you aware of our brand: Yes or No. 50 of the men were aware and 15 were not, whereas 10 of the women were aware and 25 were not.

Are these differences significant?

Aware 50 10 60

Unaware 15 25 40 65 35 100

Men WomenTotal

We want to know whether brand awareness is associated with gender. What are the Hypotheses

1. Formulate H1and H0

H0:

H1:

There is no difference in brand awareness based on gender

There is a difference in brand awareness based on gender

Chi-square test results are unstable if cell count is lower than 5

• Used to discover whether 2 or more groups of one variable (dependent variable) vary significantly from each other with respect to some other variable (independent variable).

• Are the two variables of interest associated:– Do men and women differ with respect to product usage

(heavy, medium, or light) – Is the preference for a certain flavor (cherry or lemon) related

to the geographic region (north, south, east, west)?

H0: Two variables are independent (not associated)

H1: Two variables are not independent (associated)• Must be nominal level, or, if interval or ratio must be divided into

categories

X2 (Chi Square)

2. Select Appropriate Test

Aware 50/39 10/21 60

Unaware 15/26 25/14 40 65 35 100

Men Women Total

Awareness of Tire Manufacturer’s Brand

Estimated cell Frequency n

CRE ji

ij

Ri = total observed frequency in the ith rowCj = total observed frequency in the jth columnn = sample sizeEij = estimated cell frequency

3. Choose Level of SignificanceWhenever we draw inferences about a population, there is a risk that an incorrect conclusion will be reached

The real question is how strong the evidence in favor of the alternative hypothesis must be to reject the null hypothesis.

The significance level states the probability of rejecting H0 when in fact it is true.

In this example an error would be committed if we said that there is a difference between men and women with respect to brand awareness when in fact there was no difference i.e. we have rejected the null hypothesis when it is in fact true

This error is commonly known as Type I error, The value of is called the significance level of the test Type I error

• Significance Level selected is typically .05 or .01

• i.e 5% or 1%

•In other words we are willing to accept the risk that 5% (or 1%) of the time the results we get indicate that we should reject the null hypothesis when it is in fact true.

• 5% (or 1%) of the time we are willing to commit a Type 1 error

• stating there is a difference between men and women with respect to brand awareness when in fact there is no difference

3. Choose Level of Significance• We commit Type error II when we

incorrectly accept a null hypothesis when it is false. The probability of committing Type error II is denoted by .

• In our example we commit a type II error when we say that.

there is NO difference between men and women with respect to brand awareness (we accept the null hypothesis) when in fact there is

Accept null Reject null

Null is true

Null is false

Correct-Correct-no errorno error

Type IType Ierrorerror

Type IIType IIerrorerror

Correct-Correct-no errorno error

Type I and Type II Errors

Which is worse?• Both are serious, but traditionally Type I error has

been considered more serious, that’s why the objective of hypothesis testing is to reject H0 only when there is enough evidence that supports it.

• Therefore, we choose to be as small as possible without compromising . (accepting when false)

• Increasing the sample size for a given α will decrease β (I.e. accepting the null hypothesis when it is in fact false)

Aware 50/39 10/21 60

Unaware 15/26 25/14 40 65 35 100

Men Women Total

Awareness of Tire Manufacturer’s Brand

Estimated cell Frequency n

CRE ji

ij


x² = chi-square statisticsOi = observed frequency in the ith cellEi = expected frequency on the ith cell

i

ii )²( ²E

EOx

nCR

E jiij


Estimated cell Frequency

Chi-Square statistic

Chi-Square Test

Degrees of Freedom

d.f.=(R-1)(C-1)

While there will be n such squared deviations only (n - 1) of them are free to assume any value whatsoever.

This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample.

All of the other (n - 1) squared deviations from the mean can, theoretically, have any values whatsoever..

Degrees of Freedom

the number of values in the final calculation of a statistic that are free to vary

For example To calculate the standard deviation of a random sample, we must first calculate the mean of that sample and then compute the sum of the squared deviations from that mean

21)2110(

39)3950( 22

2

X

14)1425(

26)2615( 22

161.22643.8654.4762.5102.3

2

2

1)12)(12(..)1)(1(..

fdCRfd

4. Calculate the Test StatisticChi-Square Test: Differences Among Groups

Chi-square test results are unstable if cell count is lower than 5

5. Determine the Probability-value (Critical Value)

•The p-value is the probability of seeing a random sample at least as extreme as the sample observed given that the null hypothesis is true. • given the value of alpha, we use statistical theory to determine the rejection region.• If the sample falls into this region we reject the null hypothesis; otherwise, we accept it• Sample evidence that falls into the rejection region is called statistically significant at the alpha level.

A combination is the selection of a certain number of objects taken from a group of objects without regard to order. We use the symbol (5 choose 3) to indicate that we have five objects taken three at a time, without regard to order.

To calculate the possible number of combinations the formula used is 5x4x3x2x1 = 120 = 10 (3x2x1)x(2x1) = 12

If we choose a sample of 5 from a total of 20 there are 15, 504 possible combinations.

If we took the means of some measurement for each of the possible combinations those means would form a normal distribution.

COMBINATIONS

0 2.8162.023-2.023

/2/2

A critical value is the value that a test statistic must exceed in order for the the null hypothesis to be rejected.

For example, the critical value of t (with 12 degrees of freedom using the .05 significance level) is 2.18.

This means that for the probability value to be less than or equal to .05, the absolute value of the t statistic must be 2.18 or greater.

Critical value

Test statisticSignificance level (.05) critical value

Significance from p-values -- continued

• How small is a “small” p-value? This is largely a matter of semantics but if the – p-value is less than 0.01, it provides “convincing”

evidence that the alternative hypothesis is true;– p-value is between 0.01 and 0.05, there is “strong”

evidence in favor of the alternative hypothesis;– p-value is between 0.05 and 0.10, it is in a “gray area”;– p-values greater than 0.10 are interpreted as weak or no

evidence in support of the alternative.

Chi-square Test for IndependenceUnder H0, the probability distribution is approximately

distributed by the Chi-square distribution (2).

Chi-square

2

Reject H0 3.84

22.16

2 with 1 d.f. at .05 critical value = 3.84

5. Determine the Probability-value (Critical Value)

6 a) Compare with the level of significance, b) Determine if the critical value falls in the rejection region. (check tables)22.16 is greater than 3.84 and falls in the rejection area

In fact it is significant at the .001 level, which means that the chance that our variables are independent, and we just happened to pick an outlying sample, is less than 1/1000

Or, in other words, the chance that we have a Type 1 error is less than .1%

i.e. That there is a .1% chance that we reject the null hypothesis when it is true -- that there is no difference between men and women with respect to brand awareness, and say that there is, when in fact the null hypothesis is true: there is no difference.

7 Reject or do not reject H0

Since 22.16 is greater than 3.84 we reject the null hypothesis

8 Draw a conclusionMen and women differ with respect to brand awareness, specifically, men are more brand aware then women