View
1
Download
0
Category
Preview:
Citation preview
uoft-logo
STA 490H1SInitial Examination of Data
Alison L. Gibbs
Department of StatisticsUniversity of Toronto
Winter 2011
Gibbs STA 490H1S
uoft-logo
Course mantra
It’s OK not to know.
Expressing ignorance is encouraged.
It’s not OK to not have a willingness to learn.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.
Types of variables:I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.Types of variables:
I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.Types of variables:
I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.Types of variables:
I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.Types of variables:
I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.Types of variables:
I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
Initial Examination of Data
Purpose:
I Understand the structure of the data.Types of variables:
I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned
quantitative), binary
I Check the quality of the data.
I Find errors (data cleaning). Check for credibility, consistency,completeness.
I Identify potential outliers.I Are there missing observations?
I Clear up any problems.
I Get ideas for more sophisticated analyses.
I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.
Gibbs STA 490H1S
uoft-logo
IDA
I Should be motivated by original research questions.
I Avoid data dredging. (Look long enough and you’ll find somemeaningless pattern.)
I Trivial? Requires judgment and common sense.
Gibbs STA 490H1S
uoft-logo
IDA
I Should be motivated by original research questions.
I Avoid data dredging. (Look long enough and you’ll find somemeaningless pattern.)
I Trivial? Requires judgment and common sense.
Gibbs STA 490H1S
uoft-logo
IDA
I Should be motivated by original research questions.
I Avoid data dredging. (Look long enough and you’ll find somemeaningless pattern.)
I Trivial? Requires judgment and common sense.
Gibbs STA 490H1S
uoft-logo
Types of Missing Data
1. Missing Completely At Random (MCAR)The probability that a data value is missing does not dependon the missing value, nor on the values of all other variables.
2. Missing At Random (MAR)The probability that a data value is missing, conditional onthe values of the other variables for the observation, is notrelated to the missing value.
3. Informative / Non-ignorable (NMAR)Difficult to deal with.
Gibbs STA 490H1S
uoft-logo
Types of Missing Data
1. Missing Completely At Random (MCAR)The probability that a data value is missing does not dependon the missing value, nor on the values of all other variables.
2. Missing At Random (MAR)The probability that a data value is missing, conditional onthe values of the other variables for the observation, is notrelated to the missing value.
3. Informative / Non-ignorable (NMAR)Difficult to deal with.
Gibbs STA 490H1S
uoft-logo
Types of Missing Data
1. Missing Completely At Random (MCAR)The probability that a data value is missing does not dependon the missing value, nor on the values of all other variables.
2. Missing At Random (MAR)The probability that a data value is missing, conditional onthe values of the other variables for the observation, is notrelated to the missing value.
3. Informative / Non-ignorable (NMAR)Difficult to deal with.
Gibbs STA 490H1S
uoft-logo
Tools for IDA
I 5 number summary (for all data and for subsets).
I Other summary statistics, e.g., mean and s.d.
I Histograms / stem-and-leaf plots.
I Frequency tables (1- and 2-way) for categorical variables
I Scatterplots.
I Correlations.
Gibbs STA 490H1S
uoft-logo
Tools for IDA
I 5 number summary (for all data and for subsets).
I Other summary statistics, e.g., mean and s.d.
I Histograms / stem-and-leaf plots.
I Frequency tables (1- and 2-way) for categorical variables
I Scatterplots.
I Correlations.
Gibbs STA 490H1S
uoft-logo
Tools for IDA
I 5 number summary (for all data and for subsets).
I Other summary statistics, e.g., mean and s.d.
I Histograms / stem-and-leaf plots.
I Frequency tables (1- and 2-way) for categorical variables
I Scatterplots.
I Correlations.
Gibbs STA 490H1S
uoft-logo
Tools for IDA
I 5 number summary (for all data and for subsets).
I Other summary statistics, e.g., mean and s.d.
I Histograms / stem-and-leaf plots.
I Frequency tables (1- and 2-way) for categorical variables
I Scatterplots.
I Correlations.
Gibbs STA 490H1S
uoft-logo
Tools for IDA
I 5 number summary (for all data and for subsets).
I Other summary statistics, e.g., mean and s.d.
I Histograms / stem-and-leaf plots.
I Frequency tables (1- and 2-way) for categorical variables
I Scatterplots.
I Correlations.
Gibbs STA 490H1S
uoft-logo
Tools for IDA
I 5 number summary (for all data and for subsets).
I Other summary statistics, e.g., mean and s.d.
I Histograms / stem-and-leaf plots.
I Frequency tables (1- and 2-way) for categorical variables
I Scatterplots.
I Correlations.
Gibbs STA 490H1S
uoft-logo
Some More Sophisticated Tools for IDA
I Kernel Density EstimationI Smoothed function to estimate the density function.I Amount of smoothing controlled by the bandwidth.I Non-parametric (that is, doesn’t make an assumption about
the distribution).
I LOWESS (LOESS): Locally Weighted Scatterplot SmoothingI Idea: fit a simple polynomial using regression on a small ranges
of the independent variable, and smoothly join up the pieces.I Amount of smoothing controlled by a smoothing parameter.
Gibbs STA 490H1S
uoft-logo
Some More Sophisticated Tools for IDA
I Kernel Density EstimationI Smoothed function to estimate the density function.I Amount of smoothing controlled by the bandwidth.I Non-parametric (that is, doesn’t make an assumption about
the distribution).
I LOWESS (LOESS): Locally Weighted Scatterplot SmoothingI Idea: fit a simple polynomial using regression on a small ranges
of the independent variable, and smoothly join up the pieces.I Amount of smoothing controlled by a smoothing parameter.
Gibbs STA 490H1S
uoft-logo
Some More Sophisticated Tools for IDA
I Kernel Density EstimationI Smoothed function to estimate the density function.I Amount of smoothing controlled by the bandwidth.I Non-parametric (that is, doesn’t make an assumption about
the distribution).
I LOWESS (LOESS): Locally Weighted Scatterplot SmoothingI Idea: fit a simple polynomial using regression on a small ranges
of the independent variable, and smoothly join up the pieces.I Amount of smoothing controlled by a smoothing parameter.
Gibbs STA 490H1S
uoft-logo
Course mantra
It’s OK not to know.
Expressing ignorance is encouraged.
It’s not OK to not have a willingness to learn.
Gibbs STA 490H1S
uoft-logo
For Thursday:
I Hand in your meeting summary to your TA advisor.I Be ready for a discussion about a plan for the project:
I Data cleaning / IDAI What methods of analysis might be appropriate.
Gibbs STA 490H1S
uoft-logo
For next class (Tuesday, February 1)
Read Chapters 11 and 12 in Chatfield.Bring the text to class.
Gibbs STA 490H1S
Recommended