27
uoft-logo STA 490H1S Initial Examination of Data Alison L. Gibbs Department of Statistics University of Toronto Winter 2011 Gibbs STA 490H1S

STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

STA 490H1SInitial Examination of Data

Alison L. Gibbs

Department of StatisticsUniversity of Toronto

Winter 2011

Gibbs STA 490H1S

Page 2: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Course mantra

It’s OK not to know.

Expressing ignorance is encouraged.

It’s not OK to not have a willingness to learn.

Gibbs STA 490H1S

Page 3: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.

Types of variables:I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 4: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.Types of variables:

I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 5: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.Types of variables:

I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 6: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.Types of variables:

I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 7: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.Types of variables:

I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 8: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.Types of variables:

I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 9: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Initial Examination of Data

Purpose:

I Understand the structure of the data.Types of variables:

I Quantitiative: continuous or discreteI Categorical: nominal, ordinal (e.g., Likert scales or binned

quantitative), binary

I Check the quality of the data.

I Find errors (data cleaning). Check for credibility, consistency,completeness.

I Identify potential outliers.I Are there missing observations?

I Clear up any problems.

I Get ideas for more sophisticated analyses.

I Check on whether or not assumptions of more sophisticatedanalyses seem reasonable.

Gibbs STA 490H1S

Page 10: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

IDA

I Should be motivated by original research questions.

I Avoid data dredging. (Look long enough and you’ll find somemeaningless pattern.)

I Trivial? Requires judgment and common sense.

Gibbs STA 490H1S

Page 11: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

IDA

I Should be motivated by original research questions.

I Avoid data dredging. (Look long enough and you’ll find somemeaningless pattern.)

I Trivial? Requires judgment and common sense.

Gibbs STA 490H1S

Page 12: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

IDA

I Should be motivated by original research questions.

I Avoid data dredging. (Look long enough and you’ll find somemeaningless pattern.)

I Trivial? Requires judgment and common sense.

Gibbs STA 490H1S

Page 13: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Types of Missing Data

1. Missing Completely At Random (MCAR)The probability that a data value is missing does not dependon the missing value, nor on the values of all other variables.

2. Missing At Random (MAR)The probability that a data value is missing, conditional onthe values of the other variables for the observation, is notrelated to the missing value.

3. Informative / Non-ignorable (NMAR)Difficult to deal with.

Gibbs STA 490H1S

Page 14: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Types of Missing Data

1. Missing Completely At Random (MCAR)The probability that a data value is missing does not dependon the missing value, nor on the values of all other variables.

2. Missing At Random (MAR)The probability that a data value is missing, conditional onthe values of the other variables for the observation, is notrelated to the missing value.

3. Informative / Non-ignorable (NMAR)Difficult to deal with.

Gibbs STA 490H1S

Page 15: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Types of Missing Data

1. Missing Completely At Random (MCAR)The probability that a data value is missing does not dependon the missing value, nor on the values of all other variables.

2. Missing At Random (MAR)The probability that a data value is missing, conditional onthe values of the other variables for the observation, is notrelated to the missing value.

3. Informative / Non-ignorable (NMAR)Difficult to deal with.

Gibbs STA 490H1S

Page 16: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Tools for IDA

I 5 number summary (for all data and for subsets).

I Other summary statistics, e.g., mean and s.d.

I Histograms / stem-and-leaf plots.

I Frequency tables (1- and 2-way) for categorical variables

I Scatterplots.

I Correlations.

Gibbs STA 490H1S

Page 17: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Tools for IDA

I 5 number summary (for all data and for subsets).

I Other summary statistics, e.g., mean and s.d.

I Histograms / stem-and-leaf plots.

I Frequency tables (1- and 2-way) for categorical variables

I Scatterplots.

I Correlations.

Gibbs STA 490H1S

Page 18: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Tools for IDA

I 5 number summary (for all data and for subsets).

I Other summary statistics, e.g., mean and s.d.

I Histograms / stem-and-leaf plots.

I Frequency tables (1- and 2-way) for categorical variables

I Scatterplots.

I Correlations.

Gibbs STA 490H1S

Page 19: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Tools for IDA

I 5 number summary (for all data and for subsets).

I Other summary statistics, e.g., mean and s.d.

I Histograms / stem-and-leaf plots.

I Frequency tables (1- and 2-way) for categorical variables

I Scatterplots.

I Correlations.

Gibbs STA 490H1S

Page 20: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Tools for IDA

I 5 number summary (for all data and for subsets).

I Other summary statistics, e.g., mean and s.d.

I Histograms / stem-and-leaf plots.

I Frequency tables (1- and 2-way) for categorical variables

I Scatterplots.

I Correlations.

Gibbs STA 490H1S

Page 21: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Tools for IDA

I 5 number summary (for all data and for subsets).

I Other summary statistics, e.g., mean and s.d.

I Histograms / stem-and-leaf plots.

I Frequency tables (1- and 2-way) for categorical variables

I Scatterplots.

I Correlations.

Gibbs STA 490H1S

Page 22: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Some More Sophisticated Tools for IDA

I Kernel Density EstimationI Smoothed function to estimate the density function.I Amount of smoothing controlled by the bandwidth.I Non-parametric (that is, doesn’t make an assumption about

the distribution).

I LOWESS (LOESS): Locally Weighted Scatterplot SmoothingI Idea: fit a simple polynomial using regression on a small ranges

of the independent variable, and smoothly join up the pieces.I Amount of smoothing controlled by a smoothing parameter.

Gibbs STA 490H1S

Page 23: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Some More Sophisticated Tools for IDA

I Kernel Density EstimationI Smoothed function to estimate the density function.I Amount of smoothing controlled by the bandwidth.I Non-parametric (that is, doesn’t make an assumption about

the distribution).

I LOWESS (LOESS): Locally Weighted Scatterplot SmoothingI Idea: fit a simple polynomial using regression on a small ranges

of the independent variable, and smoothly join up the pieces.I Amount of smoothing controlled by a smoothing parameter.

Gibbs STA 490H1S

Page 24: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Some More Sophisticated Tools for IDA

I Kernel Density EstimationI Smoothed function to estimate the density function.I Amount of smoothing controlled by the bandwidth.I Non-parametric (that is, doesn’t make an assumption about

the distribution).

I LOWESS (LOESS): Locally Weighted Scatterplot SmoothingI Idea: fit a simple polynomial using regression on a small ranges

of the independent variable, and smoothly join up the pieces.I Amount of smoothing controlled by a smoothing parameter.

Gibbs STA 490H1S

Page 25: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

Course mantra

It’s OK not to know.

Expressing ignorance is encouraged.

It’s not OK to not have a willingness to learn.

Gibbs STA 490H1S

Page 26: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

For Thursday:

I Hand in your meeting summary to your TA advisor.I Be ready for a discussion about a plan for the project:

I Data cleaning / IDAI What methods of analysis might be appropriate.

Gibbs STA 490H1S

Page 27: STA 490H1S Initial Examination of Data · uoft-logo Initial Examination of Data Purpose: I Understand the structure of the data. Types of variables: I Quantitiative: continuous or

uoft-logo

For next class (Tuesday, February 1)

Read Chapters 11 and 12 in Chatfield.Bring the text to class.

Gibbs STA 490H1S