25
Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015 UK FHS Historical sociology (2015+)

Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

Embed Size (px)

Citation preview

Page 1: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

Quantitative Data Analysis I./II.

Missing Values (I.)Identification, assigning

and their analysis

Jiří Šafrjiri.safr(AT)seznam.cz

Last revision 22/3/2015

UK FHSHistorical sociology

(2015+)

Page 2: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

2

Missing data: definition and relevance• Missing data (also called ‘‘item nonresponse’’)

means that for some reason data on particular items or questions are not available for analysis.

• Fraction of missing data is important indicator of data quality.

• Prerequisite for the analysis of the data (and especially statistical treatment of missing data) is to understand why the data are missing. (a missing value originating from accidentally skipping a question differs from a missing value originating from reluctance of a respondent to reveal sensitive information.)

Source: [Lavrakas 2008: 467]

Page 3: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

The first step of any analysis is to examine data, i.e. search for „inappropriate“ values and to

exclude them from the range of valid values

→ MISSING VALUES

Page 4: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

4

Two types of missing values (in SPSS):

1. System = SYSMIS (in data: „ . “)

This is the very basic and simple form of delimiting missing values (and reliable format when transferred into other software), but strictly speaking there is no information why the value/s is missing. Most often it is when the record for that variable was not performed, or the variable does not apply to the case (the respondent) (e.g. A year of divorce for single / married persons).

If we have e.g. in a questionnaire further detailed information such as „Not applicable“, „Refused to answer“, „Do not know“ we code those values with a specific „inappropriate“ values which we can later assign

2. User defined = MISSING VALUESIn the data we use values out of standard range, e.g. : „9“ or „99“. We can label them, e.g. 8 = Refused to answer, 9 = Didn't know.

These values will not be included in the main part of the analyses and they will be reported separately (so far we turn „them off“ in MISSING VALUES command or in menu).

Page 5: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

5

Missing values - procedureAny time you get any new data set:1. Determine whether in the dataset are any

missing value defined and how. Don‘t count on the documentation, e.g. codebook, but always check it out yourself in the data. If not then:

2. assigning the „inappropriate“ values to missing values (or possibly recording or other data transformations)

- - - (see QDA II.)

3. substantive analysis of missing values: a) Can we ignore them? Are they missing at random?

If not: b) Analysis of their dependency on other variables- - - (very advanced strategy)

(4. imputation of missing values (estimation of values, where there are missing) and manipulation in multivariate analysis (listwise/pairwise deletion and various imputation))

Page 6: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

6

1. Inspecting data

the easiest approach to MV• Looking over the settings in Data-editor – the column MV is

not enough, we must always inspect the data.• For larger number of variables, mostly in the first step it

sometimes suffices to use simple tabular command DESCRIPTIVES → we inspect Minimum and Maximum values in the data and compare them with adequate values in "questionnaire". ? Mostly it reveals max values, but be careful it is not reliable, namely for categorical variables!

• FREQUENCIES command is the only reliable, because it lists the occurrences of all values, i.e. their (un)designation as MV. For more variables, however, we get a lot of tables.

• Clearly we also show the number of MV (but not in detail what

values) by command MVA (MISSING VALUE ANALYSIS). For detecting MV slightly better strategy than DESCRIPTIVES, but it is not available in the Base version of SPSS.

Page 7: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

7

Missing values – identification (DESC, FREQ, MVA)

DESCRIPTIVES PI.1a.

→ not always reliable!

FREQUENCIES PI.1a.→ complete information about all values/ categories

MVA PI.1a.

Page 8: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

8

2. Assigning missing values

MISSING VALUES Var1 possibly other Var2 Var3 …(0 8 9). → you can set max. 3 specific values or:

range (LOWEST THRU 5). or (8 THRU HIGHEST).

or combination of an range and one value:

(5 8 thru Highest).

It can also be done in Data editor (by clicking the mouse)

but the syntax provides easy checking and documentation of data manipulation

which can be used later in other data.

Page 9: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

9

Identification and assigning missing valuesExample: „Age of university students at HiSo“

FREQUENCIES vek.Values 12 and 92 are out of the rational age range of university (HiSo) students, so we assign them as missing. Use the command syntax:

MISSING VALUES age (12 92).Or in the data editor (click by mouse in menu)

At the same time, we see that in the data set no User Missings been defined yet. (There are only 2 cases of System Missings SYSMIS).

Note: Once the MV is assigned, nothing seems to happen; in fact we only marked MV in the data. So it is good to print again the table with

frequencies: FREQUENCIES age.

Page 10: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

10

Assigning range of MVIt is clever because it will apply to any possible future addition of new cases (or data manipulation)

• from minimum to specific value:

MISSING VALUES age (LOWEST THRU 20).• from specific value to maximum:

MISSING VALUES age (50 THRU HIGHEST).• and we can add one specific value to the range:

MISSING VALUES age (50 THRU HIGHEST 12).

Assigning missing values as a range Example: „Age of university students at HiSo“

Page 11: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

11

„Switching off/on“ of missing values in Syntax

• Missing can by „switched off” simply→ leave the brackets empty.

MISSING VALUES age ( ).

FREQUENCIES age.Now all values are included in the analysis.(Of course it does not apply to System Missing values they remain excluded)

And again we van simply „switch them on”.

MISSING VALUES age (12 92).

FREQUENCIES age.

Page 12: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

Why is it so important to detect and exclude user defined missing

values?

„Inappropriate“ values can distort our estimates!

particularly mean, variance and correlation

The risk of obtaining biased results is not so high in case of categorical variables which we present in the table of frequencies where we usually notice them.

Page 13: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

13

Example: Number of children born → meanwithout/with missing values assigned (population census data)

No missing values defined Missing values defined

Children Ever-Born (Live-Births)

20151050-5

Fre

qu

ency

500 000

400 000

300 000

200 000

100 000

0

Histogram

Mean = 1,79Std. Dev. = 1,327N = 1 135 847

Page 14: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

14

Some notes about Missing Values (1)• It can happen that we assigned some missing values and they in

fact don’t occur in the table of frequencies. It is logical result because in the analysis (such as Frequencies in the MV section) only real occurrence of any (incl. missing) values appears (However, SPSS still keeps our definition of potential Missings prepared). The information about the missing values actually assigned can be viewed by DISPLAY.

DISPLAY DICTIONARY /VARIABLES = age.

• Note also the situation when in the table Frequencies a certain value appears several times, e.g.: 1 1 1 is in effect, e.g., 0.9 and 0.6, and 1 (0.9 a 0.6 are rounded to 1 so they appear in a variable format with no decimal places as the unique value of 1) → Change the format of variable FORMATS friends (F8.1).

Number of friends Number of friends

Page 15: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

15

• If there are more values which are not in continuous range we need to recode them first and then use the range definition. For example, Income per month (in thousands) with illogical values „-9 -7“ and also „8888888 99999999“ can be treated in this way: RECODE Income (8888888= -8888888) (9999999= -99999999) (ELSE = COPY). MISSING VALUES Income (LOW THRU -1 ).

• It is reasonable to use negative values for coding/recording missing values (-9 -8 instead 9 8) → they are more visible.

Some notes about Missing Values (2)

Page 16: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

16

Missing values: How to treat /analyse them - rule of thumb

• If relative number of missing values is less than ca 5%, we can mostly ignore them (in „large enough“ sample). But carefully with their intersection in bivariate (and higher level) analysis (5% at var1 and 5% at var2 can result in 5 % in total as well as 10

% !).• If the number of missing values exceeds this threshold

(>5%), then the analysis of missing values,i.e. dependence on other variables is necessary (→ causes of MV), i.e. we should ask:„Who doesn't answer to our questions?“And perhaps „Is our question form valid?“

• >5% incidence of MV does not have to be only at random (i.e. randomly distributed in the population with almost no harm to our results) which needs to be verified (and when

appropriate we consider the imputation of missing values).

Page 17: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

Missing values

Step 3.– their analysis

Page 18: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

18

Inspecting the Structureand Patterns of Missing Data

• The first step in the analysis of incomplete data is to inspect the data.

• If most of the missing values concern only one specific variable (e.g., household or personal income) and such variable is not central to the analysis, we may decide to delete it.

• The same applies to a single respondent with many missing values.

• However, missing values are usually scattered throughout the entire data matrix.

• In that case, we should know if the missing data form a pattern and if missingness is related to some other variables.

Source: [Lavrakas 2008: 469]

Page 19: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

19

Analysis of dependency & mutual interdependence of missing values

We are dealing with two issues: a) How are the missing values intertwined between (dependent) variables (e.g. in an item battery)

b) Whether they are somehow dependent on sorting factors (e.g. age, education, income or filtering question)

1. The simplest procedure: „switching off“ of the missing values (they will be included), and analysis of the relevant categories, e.g. in the contingency table.

2. MVA (Missing Value Analysis) (in advanced version of SPSS only)

3. Construction of a new variable with information about the missing value (/ values for several variables) and its separate analysis (or inclusion in a model).Dichotomised variable indicating missing vs. valid value or count variable indicating how many times missing values occurred within a set of variables (e.g. in an item battery).

Page 20: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

20

MVA – Missing Value Analysis

• MVA can reveal missing values patterns occurring simultaneously at a set of variables

Note MVA is not available in the Base version of SPSS (but we can help ourselves by some tricks).

Page 21: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

21

MVA - notes

• Don‘t use weighting – if there is any, switch the weight first.→ WEIGHT OFF.

• Basic features: a description of the missing values + missing values patterns

• It distinguishes numeric and categorical variables

MVA age income gender region /CATEGORICAL gender region.

Page 22: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

22

MVA Output (1)

Basic output (with request for categorical variables)

MVA age income gender region /CATEGORICAL gender region.

Page 23: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

23

MVA Output (2)• Patterns of missing values among set

of variables → How many respondents did not answered on how many items from the battery of questions?

5= Don‘t know

6= No answer

There are many other settings and outputs in MVA.

This table only shows coding MV at a set of variables not the occurrence of patterns as such.

Page 24: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

24

Missing data can be:• Missing completely at random (MCAR)

→ ideal situation, the results are not biased

• Missing at random (MAR)→ missing values are only at some of the variables, but they are not mutually systematically affected

• Not missing at random (NMAR) → missing values are conditional (non-randomly) by some process or factor→ the problem of bias of the results

Page 25: Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

Intersection of valid cases restricting the analysis to complete cases with valid data

Analyses in the text (e.g. a report or thesis) should be made on a consistent subset with the same

number of valid cases across variables.→ In successive bivariate analyses there should be the same basis of valid cases. According to the

principle LISTWISE = missing values intersect across all the variables,

(i.e. in a survey only those who answered to all questions are reported) But this can be highly problematic namely if there are a lot missing values

(>5%) and/or they are unique (not overlapping) at different variables.

In effect it discards a lot of information).

QDA I./II.