Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction to Data AnalysisIntroduction to Data AnalysisIntroduction to Data AnalysisIntroduction to Data Analysis
What is What is a Statistica Statistic??What is What is a Statistica Statistic??
StatisticStatistic� Measure based on a sample� Different than a parameter
ParameterParameterMeasure taken from a population
StatisticStatistic� Measure based on a sample� Different than a parameter
ParameterParameterMeasure taken from a population� Measure taken from a population
� “True measure” (Reality)� Sometimes theoretical in nature
� Since we most often use statistics, not parameters, our goal in data analysis is to determine, with a fair degree of certainty, how closely our statistics represent the population measures (parameters)
� Measure taken from a population� “True measure” (Reality)� Sometimes theoretical in nature
� Since we most often use statistics, not parameters, our goal in data analysis is to determine, with a fair degree of certainty, how closely our statistics represent the population measures (parameters)
What What are Statisticsare Statistics??What What are Statisticsare Statistics??
� Mathematical methods to collect, organize, summarize and analyze data
� Provide valid and reliable results only when the data collection and research methods follow established
� Mathematical methods to collect, organize, summarize and analyze data
� Provide valid and reliable results only when the data collection and research methods follow established scientific procedures
� In other words…
� The results are only as good as the data
� The data are only as good as the measurement tool and the sampling procedures
� You cannot fix bad methods with good data analysis
� Note: Data - Plural
scientific procedures
� In other words…
� The results are only as good as the data
� The data are only as good as the measurement tool and the sampling procedures
� You cannot fix bad methods with good data analysis
� Note: Data - Plural
Basic Types of StatisticsBasic Types of StatisticsBasic Types of StatisticsBasic Types of Statistics
� Descriptive Statistics
� Summary Statistics
� Central Tendency
Dispersion
� Descriptive Statistics
� Summary Statistics
� Central Tendency
Dispersion� Dispersion� Dispersion
Descriptive StatisticsDescriptive StatisticsDescriptive StatisticsDescriptive Statistics
� Condense data sets to allow for easier interpretation
� Used to describe the data
� Allow researchers to take random data and
organize them into some type of order
� Condense data sets to allow for easier interpretation
� Used to describe the data
� Allow researchers to take random data and
organize them into some type of orderorganize them into some type of order
Raw Frequencies/Distribution
Percentages
organize them into some type of order
Raw Frequencies/Distribution
Percentages
How do you make sense of this?
DistributionsDistributionsDistributionsDistributions
� Collection of numbers
� Range of frequencies – counting how many
subjects fall into particular variable levels
� Collection of numbers
� Range of frequencies – counting how many
subjects fall into particular variable levels
DistributionsDistributionsDistributionsDistributions
Test 1 Midterm Test 2
100 192 94
88 172 88
97 188 91
92 164 85
109 184 100
97 184 79
Test 1
1 1.5 1.5 1.5
1 1.5 1.5 2.9
3 4.4 4.4 7.4
1 1.5 1.5 8.8
52
64
73
74
ValidFrequency Percent Valid Percent
Cumulative
Percent
How many tests achieved these grades?
What % of the total sample does this represent?
97 184 79
97 180 91
106 188 91
88 176 103
91 196 94
97 180 106
109 184 94
94 156 91
100 192 97
73 152 88
97 156 106
1 1.5 1.5 8.8
4 5.9 5.9 14.7
2 2.9 2.9 17.6
4 5.9 5.9 23.5
5 7.4 7.4 30.9
1 1.5 1.5 32.4
4 5.9 5.9 38.2
6 8.8 8.8 47.1
1 1.5 1.5 48.5
4 5.9 5.9 54.4
9 13.2 13.2 67.6
10 14.7 14.7 82.4
6 8.8 8.8 91.2
3 4.4 4.4 95.6
3 4.4 4.4 100.0
68 100.0 100.0
74
76
79
82
85
87
88
91
92
94
97
100
103
106
109
Total
Central Tendency StatisticsCentral Tendency StatisticsCentral Tendency StatisticsCentral Tendency Statistics
� Answers the question: “What is a typical
score?” (or, What is the tendency of the
data?)
� Answers the question: “What is a typical
score?” (or, What is the tendency of the
data?)
�� MeanMean – Average score (Sum / # of scores)
�� Median Median – Midpoint of the scores
�� ModeMode – Most frequently occurring score
�� MeanMean – Average score (Sum / # of scores)
�� Median Median – Midpoint of the scores
�� ModeMode – Most frequently occurring score
MeanMeanMeanMean
� Average of all available scores
� Takes all values into account, so especially
sensitive to extreme scores (outliers)
� Only measure of CT that can be defined
algebraically
� Average of all available scores
� Takes all values into account, so especially
sensitive to extreme scores (outliers)
� Only measure of CT that can be defined
algebraicallyalgebraically
� Good for interval and ratio level data
algebraically
� Good for interval and ratio level data
Descriptive Statistics
68 91.37
68 171.15
68 89.94
68
Test 1
Midterm
Test 2
Valid N (listwise)
N Mean
MedianMedianMedianMedian
� Midpoint of the distribution
� If odd # of scores � middle score
If even # of scores
� Midpoint of the distribution
� If odd # of scores � middle score
If even # of scores
737376767682828285858585888891 Median
� If even # of scores � ½ way between two scores
� To find the median…� Sort scores
� Count mid-way
� Good with ordinal, interval and ratio
� If even # of scores � ½ way between two scores
� To find the median…� Sort scores
� Count mid-way
� Good with ordinal, interval and ratio
91919191919194949797979797
100100
Median
ModeModeModeMode
� Most frequently occurring score in a distribution
� Focuses attention on
� Most frequently occurring score in a distribution
� Focuses attention on
Level of Mass Media/Market Level of Mass Media/Market
Research KnowledgeResearch Knowledge
18%
44%
I know nothing
about this topic
I know very little
about this topicFocuses attention on only one possible score
� Only way of summarizing nominal data, also works for ordinal, interval or ratio
Focuses attention on only one possible score
� Only way of summarizing nominal data, also works for ordinal, interval or ratio
37%
2%
0%
about this topic
I know some
information about
this topic
I know a great deal
of information
about this topic
I am an expert on
this topic
Dispersion StatisticsDispersion StatisticsDispersion StatisticsDispersion Statistics
� Answers the question: “How are scores
spread around the central point?” (or, How
much variability do I have in my data?)
� Answers the question: “How are scores
spread around the central point?” (or, How
much variability do I have in my data?)
�� RangeRange – Space between highest and lowest
scores
�� Standard Deviation Standard Deviation – How far any one score
is from the central tendency
�� VarianceVariance – Degree to which scores deviate
from the mean
�� RangeRange – Space between highest and lowest
scores
�� Standard Deviation Standard Deviation – How far any one score
is from the central tendency
�� VarianceVariance – Degree to which scores deviate
from the mean
RangeRangeRangeRange
� Difference between highest and lowest scores in a distribution
� Difference between highest and lowest scores in a distribution
737376767682828285858585888891 Range
� Not particularly descriptive
� Range naturally increases with larger samples, due to the tendency to include outliers
� Not particularly descriptive
� Range naturally increases with larger samples, due to the tendency to include outliers
91919191919194949797979797
100100
Range
100-73 = 27
Standard DeviationStandard DeviationStandard DeviationStandard Deviation
� Distance of a given score from the mean of a
distribution
� Each element’s average distance from the
mean of the data set
� Distance of a given score from the mean of a
distribution
� Each element’s average distance from the
mean of the data setmean of the data setmean of the data set
Standard DeviationStandard DeviationStandard DeviationStandard Deviation
Test 1 Scores
90
100
110
Standard Deviation
40
50
60
70
80
90Mean
Descriptive Statistics
68 91.37 11.543
68
Test 1
Valid N (listwise)
N Mean Std. Deviation
VarianceVarianceVarianceVariance
� Mathematical index
of the degree to
which scores deviate
from, or are at
variance with, the
� Mathematical index
of the degree to
which scores deviate
from, or are at
variance with, the variance with, the
mean
� How wide or narrow
is your distribution?
� Variance = SD2
variance with, the
mean
� How wide or narrow
is your distribution?
� Variance = SD2
Basic Statistical ProceduresBasic Statistical ProceduresBasic Statistical ProceduresBasic Statistical Procedures
Purposes of Statistics (Tukey, 1986)Purposes of Statistics (Tukey, 1986)Purposes of Statistics (Tukey, 1986)Purposes of Statistics (Tukey, 1986)
� To aid in summarization
� To aid in “getting what is going on”
� To aid in extracting “information” from the data
To aid in communication
� To aid in summarization
� To aid in “getting what is going on”
� To aid in extracting “information” from the data
To aid in communication� To aid in communication� To aid in communication
Parametric or NonparametricParametric or NonparametricParametric or NonparametricParametric or Nonparametric
ParametricParametric� Appropriate for interval and
ParametricParametric� Appropriate for interval and
� Statistical methods are commonly divided into
two broad categories…
� Statistical methods are commonly divided into
two broad categories…
NonparametricNonparametric� Appropriate only with
NonparametricNonparametric� Appropriate only with Appropriate for interval and
ratio data
� Only possible way to
generalize findings to the
population
� Parametric statistics
assume normal distribution
of the population
parameters
Appropriate for interval and
ratio data
� Only possible way to
generalize findings to the
population
� Parametric statistics
assume normal distribution
of the population
parameters
Appropriate only with
nominal and ordinal level
data
� Results cannot be
generalized to the
population
� Make no assumption about
normality
Appropriate only with
nominal and ordinal level
data
� Results cannot be
generalized to the
population
� Make no assumption about
normality
Nonparametric: ChiNonparametric: Chi--SquareSquareNonparametric: ChiNonparametric: Chi--SquareSquare
� Tests for “goodness of fit”
� How good does data fit into assigned
categories?
Common in MMR
� Tests for “goodness of fit”
� How good does data fit into assigned
categories?
Common in MMR� Common in MMR
� Used to compare the observed frequencies of
a phenomenon with the frequencies that might
be expected or hypothesized
� Symbol = χ2
� Common in MMR
� Used to compare the observed frequencies of
a phenomenon with the frequencies that might
be expected or hypothesized
� Symbol = χ2
Nonparametric: ChiNonparametric: Chi--SquareSquareNonparametric: ChiNonparametric: Chi--SquareSquare
� Less and less helpful the
more nominal categories
you have (cells)
� Tells you if the distribution
of students’ eye colors
deviates from the expected
� Less and less helpful the
more nominal categories
you have (cells)
� Tells you if the distribution
of students’ eye colors
deviates from the expected
Eye Color
20 23.7 -3.7
30 23.7 6.3
21 23.7 -2.7
71
Brown
Blue
Green
Total
Observed N Expected N Residual
deviates from the expected
� Observed – Expected
(Default is that they are
equal across groups)
� Cannot conclude that men
are more likely than
women to have blue eyes
� Need a follow-up test
deviates from the expected
� Observed – Expected
(Default is that they are
equal across groups)
� Cannot conclude that men
are more likely than
women to have blue eyes
� Need a follow-up test
Test Statistics
2.563
2
.278
Chi-Squarea
df
Asymp. Sig.
Eye Color
0 cells (.0%) have expected fra.
Nonparametric: CrossNonparametric: Cross--TabulationsTabulationsNonparametric: CrossNonparametric: Cross--TabulationsTabulations
� aka: Cross-Tabs
� Still compares frequencies, but now along two
dimensions
� Two or more variables tested simultaneously
� aka: Cross-Tabs
� Still compares frequencies, but now along two
dimensions
� Two or more variables tested simultaneously
� Still use Chi-Square goodness of fit
� Test distribution of eye color among men and
women to investigate difference
� Still use Chi-Square goodness of fit
� Test distribution of eye color among men and
women to investigate difference
Nonparametric: CrossNonparametric: Cross--TabulationsTabulationsNonparametric: CrossNonparametric: Cross--TabulationsTabulations
� Test distribution of eye color among men and
women to investigate difference
� Test distribution of eye color among men and
women to investigate difference
Gender * Eye Color Crosstabulation
Brown Blue Green
Eye Color
Total
14 17 6 37
10.4 15.6 10.9 37.0
37.8% 45.9% 16.2% 100.0%
70.0% 56.7% 28.6% 52.1%
6 13 15 34
9.6 14.4 10.1 34.0
17.6% 38.2% 44.1% 100.0%
30.0% 43.3% 71.4% 47.9%
20 30 21 71
20.0 30.0 21.0 71.0
28.2% 42.3% 29.6% 100.0%
100.0% 100.0% 100.0% 100.0%
Count
Expected Count
% within Gender
% within Eye Color
Count
Expected Count
% within Gender
% within Eye Color
Count
Expected Count
% within Gender
% within Eye Color
Male
Female
Gender
Total
Brown Blue Green Total
Parametric: TParametric: T--TestTestParametric: TParametric: T--TestTest
� Tests for differences between two groups
along some dimension
� Test-and-control for treatment effects or
between-group differences
� Tests for differences between two groups
along some dimension
� Test-and-control for treatment effects or
between-group differencesbetween-group differences
� Most elementary method for comparing two
groups’ mean scores
� Assumes that the variables in the populations
from which the samples are drawn are
normally distributed
� Only works with 2 groups
between-group differences
� Most elementary method for comparing two
groups’ mean scores
� Assumes that the variables in the populations
from which the samples are drawn are
normally distributed
� Only works with 2 groups
Parametric: TParametric: T--TestTestParametric: TParametric: T--TestTest
� Do men or women score better on MMC3420
tests?
� Do men or women score better on MMC3420
tests?
Group Statistics
Std. Error
34 90.38 10.985 1.884
34 92.35 12.160 2.085
GenderMale
Female
Test 1N Mean Std. Deviation
Std. Error
Mean
Parametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of Variance
� ANOVA
� Like T-Test, but can be used with more than
two groups
Test to determine whether difference between
� ANOVA
� Like T-Test, but can be used with more than
two groups
Test to determine whether difference between � Test to determine whether difference between
groups are more significant than the difference
within the group
� Can also be used to test interaction effect of
multiple independent variables
� Test to determine whether difference between
groups are more significant than the difference
within the group
� Can also be used to test interaction effect of
multiple independent variables
Parametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of VarianceParametric: Analysis of Variance
� Do test scores vary by eye color?� Do test scores vary by eye color?
Eye Color
Dependent Variable: Test 1
95% Confidence Interval
93.250 2.602 88.054 98.446
91.207 2.161 86.892 95.522
89.632 2.669 84.301 94.962
Eye ColorBrown
Blue
Green
Mean Std. Error Lower Bound Upper Bound
95% Confidence Interval
Parametric: CorrelationsParametric: CorrelationsParametric: CorrelationsParametric: Correlations
� Tests for the strength of relationship between
to variables (interval or ratio)
� Degree to which variables change in
relationship to one another
� Tests for the strength of relationship between
to variables (interval or ratio)
� Degree to which variables change in
relationship to one anotherrelationship to one another
� Higher the number, the stronger the
relationship
relationship to one another
� Higher the number, the stronger the
relationship
Parametric: CorrelationsParametric: CorrelationsParametric: CorrelationsParametric: Correlations
Strong Positive
Weak Positive
Strong Positive
Flat (No Relationship)
Strong Negative
Weak Negative
Parametric: CorrelationsParametric: CorrelationsParametric: CorrelationsParametric: Correlations
Correlations
Test 1 Test 2
� Are grades one Test 1 related to grades on
Test 2?
� Are grades one Test 1 related to grades on
Test 2?
1 .448**
.000
68 68
.448** 1
.000
68 68
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Test 1
Test 2
Test 1 Test 2
Correlation is significant at the 0.01 level
(2-tailed).
**.
Note: Correlation and CausationNote: Correlation and CausationNote: Correlation and CausationNote: Correlation and Causation
� Just because variables are related does not
mean that one variables causes another
� Correlations do NOT tell you the direction of
the relationship between two variables
� Just because variables are related does not
mean that one variables causes another
� Correlations do NOT tell you the direction of
the relationship between two variablesthe relationship between two variablesthe relationship between two variables
Note: Correlation and CausationNote: Correlation and CausationNote: Correlation and CausationNote: Correlation and Causation
Parametric: RegressionParametric: RegressionParametric: RegressionParametric: Regression
� Related to correlations and ANOVA
� But can be used to establish causation
� One of the only ways to use data to predict the
nature of relationships
� Related to correlations and ANOVA
� But can be used to establish causation
� One of the only ways to use data to predict the
nature of relationshipsnature of relationships
� Tests the degree to which varying one variable
will cause another variable to fluctuate at a
predictable rate
nature of relationships
� Tests the degree to which varying one variable
will cause another variable to fluctuate at a
predictable rate
How do I know what statistic to use?How do I know what statistic to use?How do I know what statistic to use?How do I know what statistic to use?
Visual Data RepresentationVisual Data RepresentationVisual Data RepresentationVisual Data Representation
HistogramHistogramHistogramHistogram
� Displays frequencies occurring within a specified range� Displays frequencies occurring within a specified range
Normal CurveNormal CurveNormal CurveNormal Curve
� Scores are normally distributed around the mean� Scores are normally distributed around the mean
SkewnessSkewnessSkewnessSkewness
� Concentration of
scores around a
particular point on the
x-axis
� Direction of tail
indicates direction of
� Concentration of
scores around a
particular point on the
x-axis
� Direction of tail
indicates direction of indicates direction of
skew
indicates direction of
skew
SkewnessSkewnessSkewnessSkewness
Frequency ChartFrequency ChartFrequency ChartFrequency Chart
� Histogram data, but
turned on its side
� Useful when levels
of variable or
characteristics are
� Histogram data, but
turned on its side
� Useful when levels
of variable or
characteristics are
Level of Mass Media/Market Level of Mass Media/Market
Research KnowledgeResearch Knowledge
18%I know nothing
about this topic
I know very littlecharacteristics are
interrelated
characteristics are
interrelated44%
37%
2%
0%
I know very little
about this topic
I know some
information about
this topic
I know a great deal
of information
about this topic
I am an expert on
this topic
Waterfall ChartWaterfall ChartWaterfall ChartWaterfall Chart
� Visually represents frequencies, but ordered from most frequent to least frequently occurring characteristics
� Visually represents frequencies, but ordered from most frequent to least frequently occurring characteristics
Interest in MMR TopicsInterest in MMR Topics% Extremely Interested / % Very Interested% Extremely Interested / % Very Interested
43%
33%
32%
30%
Internet Research
Online Surveys
Non-Traditional Research Methods
Focus Groupscharacteristics
� Helpful for checklist data or data where there is no concern about relationship between characteristics
characteristics
� Helpful for checklist data or data where there is no concern about relationship between characteristics
30%
29%
29%
27%
25%
25%
21%
16%
Focus Groups
Research Ethics
Observational Methods/Ethnography
Research Vendors/Suppliers
Qualitative Interviewing
Reading and Interpreting Research Reports
Study Design
Paper-and-Pencil Surveys
ScatterplotScatterplotScatterplotScatterplot
Relationship Between Test Scores
90
100
110
� Shows the
relationship
between to
variables
The more tightly
� Shows the
relationship
between to
variables
The more tightly
40
50
60
70
80
40 50 60 70 80 90 100 110
Test 1 Scores
Test
2 S
co
res
� The more tightly
clustered the dots,
the more closely
related the two
variables
� Easiest way to
show correlations
� The more tightly
clustered the dots,
the more closely
related the two
variables
� Easiest way to
show correlations