Introduction to Data AnalysisProvide valid and reliable results only when the data collection and research methods follow established scientific procedures In other words… The results

Introduction to Data AnalysisIntroduction to Data AnalysisIntroduction to Data AnalysisIntroduction to Data Analysis

What is What is a Statistica Statistic??What is What is a Statistica Statistic??

StatisticStatistic� Measure based on a sample� Different than a parameter

ParameterParameterMeasure taken from a population

StatisticStatistic� Measure based on a sample� Different than a parameter

ParameterParameterMeasure taken from a population� Measure taken from a population

� “True measure” (Reality)� Sometimes theoretical in nature

� Since we most often use statistics, not parameters, our goal in data analysis is to determine, with a fair degree of certainty, how closely our statistics represent the population measures (parameters)

� Measure taken from a population� “True measure” (Reality)� Sometimes theoretical in nature

� Since we most often use statistics, not parameters, our goal in data analysis is to determine, with a fair degree of certainty, how closely our statistics represent the population measures (parameters)

What What are Statisticsare Statistics??What What are Statisticsare Statistics??

� Mathematical methods to collect, organize, summarize and analyze data

� Provide valid and reliable results only when the data collection and research methods follow established

� Mathematical methods to collect, organize, summarize and analyze data

� Provide valid and reliable results only when the data collection and research methods follow established scientific procedures

� In other words…

� The results are only as good as the data

� The data are only as good as the measurement tool and the sampling procedures

� You cannot fix bad methods with good data analysis

� Note: Data - Plural

scientific procedures

� In other words…

� The results are only as good as the data

� The data are only as good as the measurement tool and the sampling procedures

� You cannot fix bad methods with good data analysis

� Note: Data - Plural

Basic Types of StatisticsBasic Types of StatisticsBasic Types of StatisticsBasic Types of Statistics

� Descriptive Statistics

� Summary Statistics

� Central Tendency

Dispersion

� Descriptive Statistics

� Summary Statistics

� Central Tendency

Dispersion� Dispersion� Dispersion

Descriptive StatisticsDescriptive StatisticsDescriptive StatisticsDescriptive Statistics

� Condense data sets to allow for easier interpretation

� Used to describe the data

� Allow researchers to take random data and

organize them into some type of order

� Condense data sets to allow for easier interpretation

� Used to describe the data

� Allow researchers to take random data and

organize them into some type of orderorganize them into some type of order

Raw Frequencies/Distribution

Percentages

organize them into some type of order

Raw Frequencies/Distribution

Percentages

How do you make sense of this?

DistributionsDistributionsDistributionsDistributions

� Collection of numbers

� Range of frequencies – counting how many

subjects fall into particular variable levels

� Collection of numbers

� Range of frequencies – counting how many

subjects fall into particular variable levels

DistributionsDistributionsDistributionsDistributions

Test 1 Midterm Test 2

100 192 94

88 172 88

97 188 91

92 164 85

109 184 100

97 184 79

Test 1

1 1.5 1.5 1.5

1 1.5 1.5 2.9

3 4.4 4.4 7.4

1 1.5 1.5 8.8

52

64

73

74

ValidFrequency Percent Valid Percent

Cumulative

Percent

How many tests achieved these grades?

What % of the total sample does this represent?

97 184 79

97 180 91

106 188 91

88 176 103

91 196 94

97 180 106

109 184 94

94 156 91

100 192 97

73 152 88

97 156 106

1 1.5 1.5 8.8

4 5.9 5.9 14.7

2 2.9 2.9 17.6

4 5.9 5.9 23.5

5 7.4 7.4 30.9

1 1.5 1.5 32.4

4 5.9 5.9 38.2

6 8.8 8.8 47.1

1 1.5 1.5 48.5

4 5.9 5.9 54.4

9 13.2 13.2 67.6

10 14.7 14.7 82.4

6 8.8 8.8 91.2

3 4.4 4.4 95.6

3 4.4 4.4 100.0

68 100.0 100.0

74

76

79

82

85

87

88

91

92

94

97

100

103

106

109

Total

Central Tendency StatisticsCentral Tendency StatisticsCentral Tendency StatisticsCentral Tendency Statistics

� Answers the question: “What is a typical

score?” (or, What is the tendency of the

data?)

� Answers the question: “What is a typical

score?” (or, What is the tendency of the

data?)

�� MeanMean – Average score (Sum / # of scores)

�� Median Median – Midpoint of the scores

�� ModeMode – Most frequently occurring score

�� MeanMean – Average score (Sum / # of scores)

�� Median Median – Midpoint of the scores

�� ModeMode – Most frequently occurring score

MeanMeanMeanMean

� Average of all available scores

� Takes all values into account, so especially

sensitive to extreme scores (outliers)

� Only measure of CT that can be defined

algebraically

� Average of all available scores

� Takes all values into account, so especially

sensitive to extreme scores (outliers)

� Only measure of CT that can be defined

algebraicallyalgebraically

� Good for interval and ratio level data

algebraically

� Good for interval and ratio level data

Descriptive Statistics

68 91.37

68 171.15

68 89.94

68

Test 1

Midterm

Test 2

Valid N (listwise)

N Mean

MedianMedianMedianMedian

� Midpoint of the distribution

� If odd # of scores � middle score

If even # of scores

� Midpoint of the distribution

� If odd # of scores � middle score

If even # of scores

737376767682828285858585888891 Median

� If even # of scores � ½ way between two scores

� To find the median…� Sort scores

� Count mid-way

� Good with ordinal, interval and ratio

� If even # of scores � ½ way between two scores

� To find the median…� Sort scores

� Count mid-way

� Good with ordinal, interval and ratio

91919191919194949797979797

100100

Median

ModeModeModeMode

� Most frequently occurring score in a distribution

� Focuses attention on

� Most frequently occurring score in a distribution

� Focuses attention on

Level of Mass Media/Market Level of Mass Media/Market

Research KnowledgeResearch Knowledge

18%

44%

I know nothing

about this topic

I know very little

about this topicFocuses attention on only one possible score

� Only way of summarizing nominal data, also works for ordinal, interval or ratio

Focuses attention on only one possible score

� Only way of summarizing nominal data, also works for ordinal, interval or ratio

37%

2%

0%

about this topic

I know some

information about

this topic

I know a great deal

of information

about this topic

I am an expert on

this topic

Dispersion StatisticsDispersion StatisticsDispersion StatisticsDispersion Statistics

� Answers the question: “How are scores

spread around the central point?” (or, How

much variability do I have in my data?)

� Answers the question: “How are scores

spread around the central point?” (or, How

much variability do I have in my data?)

�� RangeRange – Space between highest and lowest

scores

�� Standard Deviation Standard Deviation – How far any one score

is from the central tendency

�� VarianceVariance – Degree to which scores deviate

from the mean

�� RangeRange – Space between highest and lowest

scores

�� Standard Deviation Standard Deviation – How far any one score

is from the central tendency

�� VarianceVariance – Degree to which scores deviate

from the mean

RangeRangeRangeRange

� Difference between highest and lowest scores in a distribution

� Difference between highest and lowest scores in a distribution

737376767682828285858585888891 Range

� Not particularly descriptive

� Range naturally increases with larger samples, due to the tendency to include outliers

� Not particularly descriptive

� Range naturally increases with larger samples, due to the tendency to include outliers

91919191919194949797979797

100100

Range

100-73 = 27

Standard DeviationStandard DeviationStandard DeviationStandard Deviation

� Distance of a given score from the mean of a

distribution

� Each element’s average distance from the

mean of the data set

� Distance of a given score from the mean of a

distribution

� Each element’s average distance from the

mean of the data setmean of the data setmean of the data set

Standard DeviationStandard DeviationStandard DeviationStandard Deviation

Test 1 Scores

90

100

110

Standard Deviation

40

50

60

70

80

90Mean

Descriptive Statistics

68 91.37 11.543

68

Test 1

Valid N (listwise)

N Mean Std. Deviation

VarianceVarianceVarianceVariance

� Mathematical index

of the degree to

which scores deviate

from, or are at

variance with, the

� Mathematical index

of the degree to

which scores deviate

from, or are at

variance with, the variance with, the

mean

� How wide or narrow

is your distribution?

� Variance = SD2

variance with, the

mean

� How wide or narrow

is your distribution?

� Variance = SD2

Basic Statistical ProceduresBasic Statistical ProceduresBasic Statistical ProceduresBasic Statistical Procedures

Purposes of Statistics (Tukey, 1986)Purposes of Statistics (Tukey, 1986)Purposes of Statistics (Tukey, 1986)Purposes of Statistics (Tukey, 1986)

� To aid in summarization

� To aid in “getting what is going on”

� To aid in extracting “information” from the data

To aid in communication

� To aid in summarization

� To aid in “getting what is going on”

� To aid in extracting “information” from the data

To aid in communication� To aid in communication� To aid in communication

Parametric or NonparametricParametric or NonparametricParametric or NonparametricParametric or Nonparametric

ParametricParametric� Appropriate for interval and

ParametricParametric� Appropriate for interval and

� Statistical methods are commonly divided into

two broad categories…

� Statistical methods are commonly divided into

two broad categories…

NonparametricNonparametric� Appropriate only with

NonparametricNonparametric� Appropriate only with Appropriate for interval and

ratio data

� Only possible way to

generalize findings to the

population

� Parametric statistics

assume normal distribution

of the population

parameters

Appropriate for interval and

ratio data

� Only possible way to

generalize findings to the

population

� Parametric statistics

assume normal distribution

of the population

parameters

Appropriate only with

nominal and ordinal level

data

� Results cannot be

generalized to the

population

� Make no assumption about

normality

Appropriate only with

nominal and ordinal level

data

� Results cannot be

generalized to the

population

� Make no assumption about

normality

Nonparametric: ChiNonparametric: Chi--SquareSquareNonparametric: ChiNonparametric: Chi--SquareSquare

� Tests for “goodness of fit”

� How good does data fit into assigned

categories?

Common in MMR

� Tests for “goodness of fit”

� How good does data fit into assigned

categories?

Common in MMR� Common in MMR

� Used to compare the observed frequencies of

a phenomenon with the frequencies that might

be expected or hypothesized

� Symbol = χ2

� Common in MMR

� Used to compare the observed frequencies of

a phenomenon with the frequencies that might

be expected or hypothesized

� Symbol = χ2

Nonparametric: ChiNonparametric: Chi--SquareSquareNonparametric: ChiNonparametric: Chi--SquareSquare

� Less and less helpful the

more nominal categories

you have (cells)

� Tells you if the distribution

of students’ eye colors

deviates from the expected

� Less and less helpful the

more nominal categories

you have (cells)

� Tells you if the distribution

of students’ eye colors


Eye Color

20 23.7 -3.7

30 23.7 6.3

21 23.7 -2.7

71

Brown

Blue

Green

Total

Observed N Expected N Residual


� Observed – Expected

(Default is that they are

equal across groups)

� Cannot conclude that men

are more likely than

women to have blue eyes

� Need a follow-up test


� Observed – Expected

(Default is that they are

equal across groups)

� Cannot conclude that men

are more likely than

women to have blue eyes

� Need a follow-up test

Test Statistics

2.563

2

.278

Chi-Squarea

df

Asymp. Sig.

Eye Color

0 cells (.0%) have expected fra.

Nonparametric: CrossNonparametric: Cross--TabulationsTabulationsNonparametric: CrossNonparametric: Cross--TabulationsTabulations

� aka: Cross-Tabs

� Still compares frequencies, but now along two

dimensions

� Two or more variables tested simultaneously

� aka: Cross-Tabs

� Still compares frequencies, but now along two

dimensions

� Two or more variables tested simultaneously

� Still use Chi-Square goodness of fit

� Test distribution of eye color among men and

women to investigate difference

� Still use Chi-Square goodness of fit



Nonparametric: CrossNonparametric: Cross--TabulationsTabulationsNonparametric: CrossNonparametric: Cross--TabulationsTabulations





Gender * Eye Color Crosstabulation

Brown Blue Green

Eye Color

Total

14 17 6 37

10.4 15.6 10.9 37.0

37.8% 45.9% 16.2% 100.0%

70.0% 56.7% 28.6% 52.1%

6 13 15 34

9.6 14.4 10.1 34.0

17.6% 38.2% 44.1% 100.0%

30.0% 43.3% 71.4% 47.9%

20 30 21 71

20.0 30.0 21.0 71.0

28.2% 42.3% 29.6% 100.0%

100.0% 100.0% 100.0% 100.0%

Count

Expected Count

% within Gender

% within Eye Color

Count

Expected Count

% within Gender

% within Eye Color

Count

Expected Count

% within Gender

% within Eye Color

Male

Female

Gender

Total

Brown Blue Green Total

Parametric: TParametric: T--TestTestParametric: TParametric: T--TestTest

� Tests for differences between two groups

along some dimension

� Test-and-control for treatment effects or

between-group differences

� Tests for differences between two groups

along some dimension

� Test-and-control for treatment effects or

between-group differencesbetween-group differences

� Most elementary method for comparing two

groups’ mean scores

� Assumes that the variables in the populations

from which the samples are drawn are

normally distributed

� Only works with 2 groups

between-group differences

� Most elementary method for comparing two

groups’ mean scores

� Assumes that the variables in the populations

from which the samples are drawn are

normally distributed

� Only works with 2 groups

Parametric: TParametric: T--TestTestParametric: TParametric: T--TestTest

� Do men or women score better on MMC3420

tests?

� Do men or women score better on MMC3420

tests?

Group Statistics

Std. Error

34 90.38 10.985 1.884

34 92.35 12.160 2.085

GenderMale

Female

Test 1N Mean Std. Deviation

Std. Error

Mean

Parametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of Variance

� ANOVA

� Like T-Test, but can be used with more than

two groups

Test to determine whether difference between

� ANOVA

� Like T-Test, but can be used with more than

two groups

Test to determine whether difference between � Test to determine whether difference between

groups are more significant than the difference

within the group

� Can also be used to test interaction effect of

multiple independent variables

� Test to determine whether difference between

groups are more significant than the difference

within the group

� Can also be used to test interaction effect of

multiple independent variables

Parametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of Variance

� Do test scores vary by eye color?� Do test scores vary by eye color?

Eye Color

Dependent Variable: Test 1

95% Confidence Interval

93.250 2.602 88.054 98.446

91.207 2.161 86.892 95.522

89.632 2.669 84.301 94.962

Eye ColorBrown

Blue

Green

Mean Std. Error Lower Bound Upper Bound

95% Confidence Interval

Parametric: CorrelationsParametric: CorrelationsParametric: CorrelationsParametric: Correlations

� Tests for the strength of relationship between

to variables (interval or ratio)

� Degree to which variables change in

relationship to one another

� Tests for the strength of relationship between

to variables (interval or ratio)

� Degree to which variables change in

relationship to one anotherrelationship to one another

� Higher the number, the stronger the

relationship

relationship to one another

� Higher the number, the stronger the

relationship


Strong Positive

Weak Positive

Strong Positive

Flat (No Relationship)

Strong Negative

Weak Negative


Correlations

Test 1 Test 2

� Are grades one Test 1 related to grades on

Test 2?

� Are grades one Test 1 related to grades on

Test 2?

1 .448**

.000

68 68

.448** 1

.000

68 68

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Test 1

Test 2

Test 1 Test 2

Correlation is significant at the 0.01 level

(2-tailed).

**.

Note: Correlation and CausationNote: Correlation and CausationNote: Correlation and CausationNote: Correlation and Causation

� Just because variables are related does not

mean that one variables causes another

� Correlations do NOT tell you the direction of

the relationship between two variables

� Just because variables are related does not

mean that one variables causes another

� Correlations do NOT tell you the direction of

the relationship between two variablesthe relationship between two variablesthe relationship between two variables

Note: Correlation and CausationNote: Correlation and CausationNote: Correlation and CausationNote: Correlation and Causation

Parametric: RegressionParametric: RegressionParametric: RegressionParametric: Regression

� Related to correlations and ANOVA

� But can be used to establish causation

� One of the only ways to use data to predict the

nature of relationships

� Related to correlations and ANOVA

� But can be used to establish causation

� One of the only ways to use data to predict the

nature of relationshipsnature of relationships

� Tests the degree to which varying one variable

will cause another variable to fluctuate at a

predictable rate

nature of relationships

� Tests the degree to which varying one variable

will cause another variable to fluctuate at a

predictable rate

How do I know what statistic to use?How do I know what statistic to use?How do I know what statistic to use?How do I know what statistic to use?

Visual Data RepresentationVisual Data RepresentationVisual Data RepresentationVisual Data Representation

HistogramHistogramHistogramHistogram

� Displays frequencies occurring within a specified range� Displays frequencies occurring within a specified range

Normal CurveNormal CurveNormal CurveNormal Curve

� Scores are normally distributed around the mean� Scores are normally distributed around the mean

SkewnessSkewnessSkewnessSkewness

� Concentration of

scores around a

particular point on the

x-axis

� Direction of tail

indicates direction of

� Concentration of

scores around a

particular point on the

x-axis

� Direction of tail

indicates direction of indicates direction of

skew

indicates direction of

skew

SkewnessSkewnessSkewnessSkewness

Frequency ChartFrequency ChartFrequency ChartFrequency Chart

� Histogram data, but

turned on its side

� Useful when levels

of variable or

characteristics are

� Histogram data, but

turned on its side

� Useful when levels

of variable or

characteristics are

Level of Mass Media/Market Level of Mass Media/Market

Research KnowledgeResearch Knowledge

18%I know nothing

about this topic

I know very littlecharacteristics are

interrelated

characteristics are

interrelated44%

37%

2%

0%

I know very little

about this topic

I know some

information about

this topic

I know a great deal

of information

about this topic

I am an expert on

this topic

Waterfall ChartWaterfall ChartWaterfall ChartWaterfall Chart

� Visually represents frequencies, but ordered from most frequent to least frequently occurring characteristics

� Visually represents frequencies, but ordered from most frequent to least frequently occurring characteristics

Interest in MMR TopicsInterest in MMR Topics% Extremely Interested / % Very Interested% Extremely Interested / % Very Interested

43%

33%

32%

30%

Internet Research

Online Surveys

Non-Traditional Research Methods

Focus Groupscharacteristics

� Helpful for checklist data or data where there is no concern about relationship between characteristics

characteristics

� Helpful for checklist data or data where there is no concern about relationship between characteristics

30%

29%

29%

27%

25%

25%

21%

16%

Focus Groups

Research Ethics

Observational Methods/Ethnography

Research Vendors/Suppliers

Qualitative Interviewing

Reading and Interpreting Research Reports

Study Design

Paper-and-Pencil Surveys

ScatterplotScatterplotScatterplotScatterplot

Relationship Between Test Scores

90

100

110

� Shows the

relationship

between to

variables

The more tightly

� Shows the

relationship

between to

variables

The more tightly

40

50

60

70

80

40 50 60 70 80 90 100 110

Test 1 Scores

Test

2 S

co

res

� The more tightly

clustered the dots,

the more closely

related the two

variables

� Easiest way to

show correlations

� The more tightly

clustered the dots,

the more closely

related the two

variables

� Easiest way to

show correlations

Documents

Introduction to Data AnalysisProvide valid and reliable results only when the data collection and research methods follow established scientific procedures In other words… The results