Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Copy right © SAS Inst i tute Inc. Al l rights reserved.Copy right © SAS Inst i tute Inc. Al l rights reserved.
Missing Data? Two SAS Procedures to the RescueHPIMPUTE and SURVEYIMPUTE
Melodie RushCustomer Success Principal Data ScientistConnect with me:LinkedIn: https://www.linkedin.com/in/melodierushTwitter: @Melodie_Rush
Copy right © SAS Inst i tute Inc. Al l rights reserved.
AGENDA
Introduction
Proc HPIMPUTE
Proc SURVEYIMPUTE
What, Why and How
Syntax, Imputation Options, Examples
Syntax, Imputation Options, Examples
Copy right © SAS Inst i tute Inc. Al l rights reserved.
What is Missing Data?Definition
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. - Wikipedia
Copy right © SAS Inst i tute Inc. Al l rights reserved.
What is Missing Data?SAS
Missing Value
• is a value that indicates that no data value is stored for the variable in the current observation. There are three kinds of missing values:
• numeric
• character
• special numeric
By default, SAS prints a missing numeric value as a single period (.) and a missing character value as a blank space. See Creating Special Missing Values for more information about special numeric missing values.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Replace with Constant or Zero
Replace with mean or mode
Replace using an imputation method
Remove observation(s)
Wh
at s
ho
uld
yo
u d
o a
bo
ut
mis
sin
g va
lues
?
Copy right © SAS Inst i tute Inc. Al l rights reserved.Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTE
Copy right © SAS Inst i tute Inc. Al l rights reserved.
1. Syntax
2. Imputation Options
3. Other Options
Proc HPIMPUTE
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTE
The HPIMPUTE procedure executes high-performancenumeric variable imputation.
• takes only numeric variables.
• runs in either single-machine mode or distributed mode.
HPIMPUTE Procedure Documentation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTESyntax
proc hpimpute options;
input variables;
impute variables <options>;
performance <performance options>;
id variables;
freq variables;
code <options>;<…>run;
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTE
• VALUE
– Replaces missing values with the specified value
• MEAN
– Replaces missing values with the algebraic mean of the variable
• RANDOM
– Replaces missing values with a random value that is drawn between the minimum and the maximum of the variable
• PMEDIAN
– Replaces missing values with the pseudomedian of the variable
Imputation Methods
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTE
• 6 variables
• First 4 have missing values
• Fifth is the frequency variable
• Last is an index variable
Example Data
Example Code and Documentation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Code – Value Method
Replaces missing values with the specified value
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Results – Value Method
Variable Name
Indicator Name
ImputedVariable Name
Number Imputed
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Output Data – Value Method
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Code – Mean Method
Replaces missing values with the algebraic mean of the variable
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Results – Mean Method
Variable Name
Indicator Name
ImputedVariable Name
Number Imputed
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Output Data – Mean Method
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Code – Random Method
Replaces missing values with a random value that is drawn between the minimum and the maximum of the variable
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Results – Random Method
Variable Name
Indicator Name
ImputedVariable Name
Number Imputed
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Output Data – Random Method
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Code – Pseudo Median Method
Replaces missing values with the pseudo median of the variable
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Results – Pseudo Median Method
Variable Name
Indicator Name
ImputedVariable Name
Number Imputed
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEExample Output Data – Pseudo Median Method
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEID Statement
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEID Statement
• The optional ID statement lists one or more variables from the input data set that are transferred to the output data set.
• The ID statement accepts numeric and character variables.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEFREQ Statement
• The variable in the FREQ statement identifies a numeric variable in the data set that contains the frequency of occurrence for each observation.
• PROC HPIMPUTE treats each observation as if it appeared n times, where n is the value of the FREQ variable for the observation.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEFREQ Statement
• If the frequency value is not an integer, it is truncated to an integer.
• If the frequency value is less than 1 or missing, the observation is not used in the analysis.
• When the FREQ statement is not specified, each observation is assigned a frequency of 1.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEFREQ Statement Results
Results with FREQ Statement Without
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEFREQ Statement Results
Results with FREQ Statement Without
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTESyntax - CODE Statement
proc hpimpute data=ex1 out=out1;
id id;
input a b c d;
impute a / value=0.1;
impute b / method=pmedian;
impute c / method=random;
impute d / method=mean;
code file='c:/temp/hpimpute.sas';
run;
The CODE statement generates SAS DATA step code that mimics the computations that are performed when the IMPUTE statement runs in
single-machine mode and uses a single thread.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEResults - CODE Statement
%let HPDM_seed=5;
if a = . then do;
M_a = 1;
IM_a = 0.1;
end;
else do;
M_a = 0;
IM_a = a;
end;
length M_a IM_a 8;
if b = . then do;
M_b = 1;
IM_b = 3;
end;
else do;
M_b = 0;
IM_b = b;
end;
length M_b IM_b 8;
A B
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEResults - CODE Statement
HPDM_vmin = 1;
HPDM_vmax = 10;
if c = . then do;
M_c = 1;
IM_c = HPDM_vmin + (HPDM_vmax –
HPDM_vmin)*ranuni(&HPDM_seed);
end;
else do;
M_c = 0;
IM_c = c;
end;
length M_c IM_c 8;
if d = . then do;
M_d = 1;
IM_d = 5.5;
end;
else do;
M_d = 0;
IM_d = d;
end;
length M_d IM_d 8;
CD
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTESyntax - PERFORMANCE Statement
proc hpimpute data=ex1 out=out1;
id id;
input a b c d;
impute a / value=0.1;
impute b / method=pmedian;
impute c / method=random;
impute d / method=mean;
performance nodes=0;
run;
• Defines performance parameters for multithreaded and distributed computing, passes variables that describe the distributed computing environment, and requests detailed results about the performance characteristics of the HPIMPUTE procedure.
• Also use the PERFORMANCE statement to control whether the HPIMPUTE procedure executes in single-machine or distributed mode.
Performance Statement Documentation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEResults – Performance Statement
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTESyntax – Performance Statement
Running in a high-performance environment
option set=GRIDHOST="&GRIDHOST";
option set=GRIDINSTALLLOC="&GRIDINSTALLLOC";
proc hpimpute data=ex1 out=out1;
id id;
input a b c d;
impute a / value=0.1;
impute b / method=pmedian;
impute c / method=random;
impute d / method=mean;
performance nodes=2 details
host="&GRIDHOST" install="&GRIDINSTALLLOC";
run;
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc HPIMPUTEResults – Performance Statement
Running in a high-performance environment
Copy right © SAS Inst i tute Inc. Al l rights reserved.Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTE
Copy right © SAS Inst i tute Inc. Al l rights reserved.
SURVEY Procedures
➢SURVEYSELECT
➢SURVEYIMPUTE
➢SURVEYMEANS
➢SURVEYFREQ
➢SURVEYREG
➢SURVEYLOGISTIC
➢SURVEYPHREG
Sample selection
Imputation
Descriptive statistics
Frequency tables
Linear models
Logistic regression
Proportional hazards
SAS/Stat
Copy right © SAS Inst i tute Inc. Al l rights reserved.
1. Syntax
2. Imputation Options
3. Analyzing Results
Proc SURVEYIMPUTE
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey Data
• How are the data collected?
• How are the missing values imputed?
Different imputation methods require different analysis techniques
Analysis of Imputed Data
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey DataThe Nonresponse Problem
ID Income
1 40
2 120
3 60
4 80
5
6 370
7 210
• Prevention is the best solution for nonresponse
• Information is the best tool for imputation
Average household income = 147
Average household income = 190
450
Tax Return
42
116
55
84
410
320
230
Copy right © SAS Inst i tute Inc. Al l rights reserved.
PROC SURVEYIMPUTE
The SURVEYIMPUTE procedure imputes missing values of an item in a sample survey by replacing them with observed values from the same item.
Imputation methods include • Single and Multiple Hot-Deck Imputation• Approximate Bayesian Bootstrap (ABB) Imputation• Fully Efficient Fractional Imputation (FEFI)• Fractional Hot-deck Imputation (FHDI)
PROC SURVEYIMPUTE Documentation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey DataPROC SURVEYIMPUTE Syntax
proc surveyimpute options;
cluster variables;
repweights variables;
strata variables;
weight variable;
cells variables;
var variables;
by variables;
class variables;
id variable;
output options;
<…>run;
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTESyntax - Method=HotDeck
Imputation techniques that use observed values from the sample to impute (fill in) missing values are known as hot-deck imputation.
proc surveyimpute data=work.surveyimpute;
var income;
output out=hotdeck;
run;
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Results – Method=HotDeck
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Output Data – Method=HotDeck
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEHot-Deck Imputation
9
5
5
7
2
1
4
9
55
7
21
4
8 87
4
7
4
Data Imputation Cells Donors Recipients
Copy right © SAS Inst i tute Inc. Al l rights reserved.
proc surveyimpute data=work.surveyimpute
method=hotdeck(selection=SRSWOR)
ndonors=1 seed=8523;
cells cell2;
var income;
id ID;
output out=hotdeck donorid;
run;
Proc SURVEYIMPUTESyntax Method=HotDeck
The SELECTION= option modifies the donor selection
Imputation techniques that use observed values from the sample to impute (fill in) missing values are known as hot-deck imputation.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Results – Method=HotDeck
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Output Data – Method=HotDeck
Copy right © SAS Inst i tute Inc. Al l rights reserved.
proc surveyimpute data=work.surveyimpute
method=hotdeck(selection=abb)
ndonors=1 seed=8523;
cells cell2;
var income;
id ID;
output out=hotdeckb donorid;
run;
Proc SURVEYIMPUTESyntax Method=HotDeck Selection=ABB
SELECTION= option modifies the donor selection
Hot Deck that requests donor selection by using the approximate Bayesian bootstrap method. For more information, see the section Approximate Bayesian Bootstrap
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEApproximate Bayesian Bootstrap
9
55
7
21
4
85
4
5
4
5 5
9
9
8
42
2
Donor Pool Donors
SRSWR
SRSWR
SRSWR
SRSWR
Imputation Cells Recipients
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Results – Method=HotDeck Selection=ABB
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Output Data – Method=HotDeck Selection=ABB
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTE
• Uses multiple donor units for a recipient unit.
• The number of donor units for a recipient unit is equal to the number of observed levels for the missing items.
• Each donor donates a fraction of the original weight of the recipient unit such that the sum of the fractional weights from all the donors is equal to the original weight of the recipient.
• Does not introduce additional variability that is caused by the selection of donor units.
• One disadvantage is that it can greatly increase the size of the imputed data set.
Fully Efficient Fractional Imputation (FEFI)
Fully Efficient Fractional Imputation Documentation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
proc surveyimpute data=work.surveyimpute
method=FEFI;
cells cell2;
var income;
class income;
id ID;
output out=FEFI;
run;
Proc SURVEYIMPUTESyntax Method=FEFI
The Class Statement required for FEFI
Fully Efficient Fractional Imputation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey DataFully Efficient Fractional Imputation
9
55
7
21
4
8
5 7
1
8 9
42
9
55
7
21
4
8
5 7
1
8 9
42
Imputation Cells Donors Imputed DataRecipients
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Results – Method=FEFI
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Output Data – Method=FEFI
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTE
• Uses multiple donor units for a recipient unit. • Each donor donates a fraction of the original weight of the recipient unit such
that the sum of the fractional weights from all the donors is equal to the original weight of the recipient.
• The fraction of the recipient weight that a donor unit contributes to the recipient unit is known as the fractional weight.
• The donors are selected by using probability proportional to size (PPS) selection in which the two-stage FEFI weights are used as the size measure.
• FHDI is useful for reducing the size of the imputed data when two-stage FEFI creates many imputed rows. – FHDI follows the same imputation steps as those of two-stage FEFI, but FHDI selects
a subset of second-stage donor cells from all possible second-stage donor cells for the imputation.
Fractional Hot-Deck Imputation (FHDI)
Fractional Hot-Deck Imputation Documentation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
proc surveyimpute data=work.surveyimpute2
method=FHDI ndonors=3 seed=8523;
cells cell2;
var income age (clevvar=agegroup);
class income;
id ID;
output out=FHDI;
run;
Proc SURVEYIMPUTESyntax Method=FHDI
The At least 2 missing values for each row (one continuous with a binned version)
Fractional Hot-Deck Imputation
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey DataData - Method=FHDI
450
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SurveyImputeMethod=FHDI
1
2
345678
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Results – Method=FHDI
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Output Data – Method=FHDI
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Proc SURVEYIMPUTEExample Output Data – Method=FHDI
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey Data
Ignore the imputation variance
Hot-Deck Analysis: Statements
proc surveymeans data=hotdeck3;
var income;
repweights RepWt_: /Jkcoefs=0.857;
run;
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey DataFEFI Analysis: Statements
Use the WEIGHT and REPWEIGHTS statements
proc surveymeans data=fefi;
var income;
weight ImpWt;
repweights ImpRepWt_: / jkcoefs=0.857;
run;
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey DataComparing the Estimates
Estimates for Average Income
Imputation Method
Estimate Standard Error
No Missing 190.00 61.10
No Imputation 146.70 50.97
Hot-Deck 178.57 53.60
FEFI 159.04 54.43
*FHDI 167.71 27.25
▪ Same analysis but different results
* FHDI based on different data set with 20 rows versus 7 in other methods
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Handling Missing Values in Survey Data
• PROC SURVEYIMPUTE is the tool for imputing missing values from complex surveys
• FEFI introduces no additional variability from the imputation and is the preferred method for survey data
• FHDI is the preferred method for continuous data
• The analysis technique should be tailored to both the survey design and the imputation method
Handling Nonresponse in SAS/STAT
Copy right © SAS Inst i tute Inc. Al l rights reserved.Copy right © SAS Inst i tute Inc. Al l rights reserved.
ResourcesWhere to learn more
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Where to learn more?SAS Documentation
• Working with Missing Data in SAS
• Proc HPIMPUTE Documentation
• Proc SURVEYIMPUTE Documentation
• Handling Missing Values in Survey Data((Video)
• Proc SURVEYIMPUTE References
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Where to learn more?Papers
• Mukhopadhyay, P. K. (2016). “Survey Data Imputation with PROC SURVEYIMPUTE” In Proceedings of the SAS Global Forum 2016 Conference. Cary, NC: SAS Institute Inc.
• Stokes, Maura (and Statistical R&D Staff). “SAS/STAT 14.1: Methods for Massive, Missing, or Multifaceted Data” In Proceedings of the SAS Global Forum 2015 Conference. Cary NC: SAS Institute Inc.
• Cutler, D. Richard. “Machine Learning and Predictive Analytics in SAS® Enterprise Miner™ and SAS/STAT® Software” In the Proceedings of the SAS Global Forum 2019 Conference. Cary NC: SAS Institute Inc.
Copy right © SAS Inst i tute Inc. Al l rights reserved.
Where to learn more?Book
Complex Survey Data Analysis with SAS
FIND YOUR
USER GROUP
sas.com/usersgroups
You should do the following (if you’re not already):
◊ Tap into local resources◊ Learn from other SAS Users’
experiences◊ Connect with the local SAS
Users’ network
ARE YOU AN EXPLORERWhether you’re a modeler, programmer, administrator, everyone is welcome on SAS Analytics Explorers!
More ways to:◊ Learn SAS◊ Get support◊ Connect with users across the US
Ready to become an explorer? Got questions?explorers.sas.com
?
ASK THE EXPERTDON’T BE SHY,
Tips & tricks webinars on a variety of SAS topics plus get all your questions answered by the SAS expert, live.
sas.com/asktheexpert
Copy right © SAS Inst i tute Inc. Al l rights reserved.Copy right © SAS Inst i tute Inc. Al l rights reserved.
sas.com
Thank you for your time and attention!Questions?
Connect with me:LinkedIn: https://www.linkedin.com/in/melodierushTwitter: @Melodie_Rush