Editing and Imputation for Quantitative Survey Data

Editing and Imputation for Quantitative Survey DataAuthor(s): Roderick J. A. Little and Philip J. SmithSource: Journal of the American Statistical Association, Vol. 82, No. 397 (Mar., 1987), pp. 58-68Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2289125 .

Accessed: 15/06/2014 05:58

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 195.78.108.185 on Sun, 15 Jun 2014 05:58:51 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/2289125?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Editing and Imputation for Quantitative Survey Data

RODERICK J. A. LITTLE and PHILIP J. SMITH*

This article develops a three-stage strategy for cleaning survey data with missing and outlying values: (a) detection of outlying cases, (b) detection of outlying values within outlying cases, and (c) imputation of likely values for missing and/or outlying and edited values. Methodological tools include distance measures, graphical procedures, and maximum likelihood and robust estimation for incomplete multivariate normal data. Data from the Annual Survey of Manufactures (ASM) are used to illustrate the method. KEY WORDS: Sweep operator; EM algorithm; Graphical methods; Incomplete data; Mahalanobis distance; Maximum likelihood estimation; Multivariate data; Outlier detection; Robust estimation.

1. INTRODUCTION Hartley (1980) noted that nonsampling errors create the

most important problems in the analysis of data from sample surveys. In particular, in multivariate surveys all variables are potentially subject to (a) spurious responses and (b) missing data. This article presents methodology for dealing with these problems in the context of multivariate data with continuous variables. The article consid- ers (a) robust estimation of a vector mean and correlation matrix from data containing outliers and missing data (Sec. 2.3), (b) identification of cases containing outlying values (Sec. 2.4), (c) identification of outlying values within these cases (Sec. 2.5), and (d) the creation of imputed values for missing data resulting from nonresponse or from deletion in the editing process (Sec. 3). With regard to (d), imputation is commonly adopted by federal statistical agencies to provide rectangular data sets for public use microdata files; for justifications of the method see Kalton and Kasprzyk (1982) and Rubin (1987).

In developing our methods, we combine ideas from three areas of statistical research-multivariate robust estimation, multivariate outlier detection, and the analysis of incomplete data. The multivariate robust estimation problem has been discussed by Maronna (1976), Huber (1977), Campbell (1980), and Devlin, Gnanadesikan, and Ket- tenring (1981). Research on outlier methods has been re- cently reviewed by Barnett and Lewis (1978) and Beckman and Cook (1983). The literature on incomplete data was reviewed by Hartley and Hocking (1971), Orchard and Woodbury (1972), Dempster, Laird, and Rubin (1977),

* Roderick J. A. Little is Associate Professor of Biomathematics, School of Medicine, University of California, Los Angeles, CA 90024. Philip J. Smith is Biometrician, International Pacific Halibut Commis- sion, Seattle, WA 98145-2009. Little's contribution to this article was supported in part by the U.S. Bureau of the Census and by a grant from the National Science Foundation while he was an ASA Research Fellow at the U.S. Bureau of the Census and in part by National Institute of Mental Health Grant MH 37188. Smith's contribution was conducted while he was a Mathematical Statistician in the Statistical Research Di- vision of the U.S. Bureau of the Census. The authors wish to thank Brian Greenberg, Preston J. Waite, an associate editor, two referees, and in particular the editor for constructive criticisms of earlier drafts of the article.

and Little and Rubin (1987). Imputation methods for survey data were discussed by Kalton and Kasprzyk (1982), Little (1982), Sande (1982), Sedransk and Titterington (1980), Chapman (1976), Little and Rubin (1987), and Rubin (1987).

Shih and Weisberg (1983) discussed the combination of outlier detection methods and methods for incomplete data in the regression context in which one variable is depen- dent on the remaining variables. In contrast, we are con- cerned with methods that treat all of the variables sym- metrically. Our approach builds on the work of Frane (1976, 1978) in which outlying cases are identified by large values of the Mahalanobis distance, with estimates of the mean and covariance matrix estimated by maximum likelihood for incomplete multivariate normal data, using the expec- tation-maximization (EM) algorithm. The use of the Ma- halanobis distance in editing has been previously suggested by Wilks (1963), Gnanadesikan and Kettenring (1972), and Hawkins (1974, 1980). The EM algorithm for normal data was discussed by Orchard and Woodbury (1972), Beale and Little (1975), and Dempster et al. (1977); the algorithm was stated incorrectly in Frane's 1976 article. We extend Frane's work by (a) providing a graphical procedure for detecting outlying cases; (b) modifying the EM algorithm to provide robust estimates of the mean and covariance matrix analogous to those given by Campbell (1980); and (c) providing an efficient stepwise procedure for identifying outlying values within outlying cases. Our outlier identification procedure differs from (and in our view improves on) the discriminant analysis suggested by Frane. Some of Frane's procedures are available in the BMDPAM computer program (Dixon 1983).

Our methods are illustrated using 152 cases for a particular industry (not specified for purposes of confiden- tiality) from the Annual Survey of Manufactures (U.S. Department of Commerce, Bureau of the Census 1981). Fourteen variables were selected for analysis, seven current year variables and the seven corresponding variables from the previous year. The variables are the number of production workers (PW), the number of all other em- ployees (OE), legally required fringe benefits paid (LE), voluntary payments to fringe benefit programs (VP), total man-hours worked by production workers (MH), production workers' wages (WW), and all other salaries and wages (OW). In this article, when these variable labels are prefaced by an "A" they refer to current year information, and when prefaced by a "B" they refer to prior year information.

Two important limitations of our methods should be

? 1987 American Statistical Association Journal of the American Statistical Association

March 1987. Vol. 82. No. 397. ADolications

58



Little and Smith: Editing and Imputation 59

mentioned. First, they are intended for use with quantitative variables and thus do not in general apply when categorical variables are included in the data set. Second, our methods require modification to deal with logical editing constraints, such as the requirement that a total equals the sum of its parts. Such constraints are an important feature of the Annual Survey of Manufactures (ASM) data used in our illustration and need to be taken into account in a comprehensive editing system for these data. The analysis of a complex set of editing constraints was discussed by Fellegi and Holt (1976), Greenberg (1981), and Greenberg and Surdi (1984). The adaptation of our methodology to incorporate simple editing constraints in the ASM context was discussed by Little and Smith (1983). A general synthesis of logical or mathematical approaches to editing with the statistical methods we propose is a chal- lenging task.

2. ESTIMATION AND OUTLIER DETECTION

2.1 Preliminaries

As a preliminary to editing and imputation, rows and columns of the data matrix X are rearranged to group similar patterns of missing data together. Figure 1 illustrates the results of this operation for all records in the data set with missing items. This step clarifies the pattern of missing data and reduces the computing time required in subsequent calculations. Another useful preliminary step is displaying the marginal distributions of the observed values to detect asymmetry and identify clearly incorrect (e.g., negative) values. Figure 2 presents the marginal distributions of two of the ASM variables, AWW and APW.

One might consider applying methods for univariate outlier detection to the marginal distributions of the variables. But there are two reasons why this is not appropriate for economic data. First, the distributions of many of these variables are skewed. Thus standard methods assuming normality cannot be applied without a preliminary transformation of the data. Second, even if an appropriate transformation can be applied, it is the lack of association between variables that will indicate that a value is outlying rather than the value's location in the variable's marginal distribution.

In particular, current methods for editing ASM data are based on ratio edits of the form

1 a/b - u, (1) where a and b are the values of two variables A and B (e.g., A = production workers' wages and B = number of production workers) and u and I are lower and upper limits for the ratio as specified by subject matter special- ists. Ratios also form the basis for imputing missing values. For example, when A is missing and B is present for the current year, the imputed value of A is

a= (a'/b')b, (2)

where b is the current year value of B and a', b' are the prior year values of A and B, respectively.

The procedure just described is based on predetermined

Variable Code name A BPW B BWW C BLE D BVP E BMH F APW G AWW H BOW I ALE J AVP K AMH L BOE M AOE N AOW

Case Missing data no. pattem

ABCDEFGH I JKLMN 31 --- -

121 --------- M- - -- 92 ---------- M --- 44 ---------- M - 78 .....M.. .- M -

133 ---------- M - 35 ----------- M --

153 ------- - - - - 70 -MM-- -- 42 -- - -M- -MM- -M- 43 ------- - - MMM 53 -- - I ----M- -MMM 40 --- I-- - - MMM 55 ------- M - -MMM 41 ------- M - -MMM 29 M- - I --M-MMM- -M 25 ------ M-MMM- -M

152 MMMMM- -M - -- M-- 146 --- MM- MMM -MM

Figure 1. Missing Data Pattern Analysis. "M" denotes missing values.

bivariate relationships between pairs of variables. Our objective is to replace them by multivariate relationships determined empirically from the data. This objective is promoted by transforming variables of the log scale. Tak- ing logarithms in (1) and (2) yields bivariate linear relationships between the variables to the log scale. As we shall see, our procedures involve linear combinations of two or more observed variables and hence are more closely related to (1) and (2) when variables are transformed to the log scale. A more direct justification for the log transformation is that our methods are based on multivariate normality, and we believe that this assumption is more reasonable when our variables have been transformed by the log function. There is some evidence of this in the distributions in Figure 2, which indicate that the log transformation reduces skewness in the marginal distributions.

2.2 Identification of Outlying Cases: ,u and X Known

Let yi = (yil, .. , yip) denote the vector of true values for case i (ln transformed in our application) in the absence of missing values or contamination by response errors. Let

Xi= (X11, . . ., Xip) denote the corresponding vector of (In-transformed) values for a case in which all of the components are present but are subject to response error.



60 Journal of the American Statistical Association, March 1987

40 30 0

30

20 -

o 0 z z 20-

LL~L -AW100 LnLAW

10 2

1 0

0 0 -0.5 0.0 0 .5 1.0 1.5 2.0 2.5 3.0 3.5 0500 1000 1500 2000

AWW/ 1 0000 APW

20 -30

15

20-

z z wU w 10 -L

Ui. cc.

10

5

0 0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 0.0 2.0 4.0 6.0 8.0 10.0

Ln (AWW/10000) Ln (APW)

Figure 2. Histograms of Original Scale and In Transformed A WW and APW.

Suppose that the yi are iid with mean ,u = (,. . . ,) and covariance matrix X = {Cjk}. The squared Mahalan- obis distance

A (Xi (Xi (3)

is a natural measure of the distance of case i from the mean of the multivariate distribution. If yi is multivariate normal and Xi = yi, AY has a chi-squared distribution with p df. If case i is incomplete, let X(pi) denote the vector of X variables present in case i, and let pi denote their number. An obvious adaptation of (3) is to restrict the distance Ai to the present variables and their related parameters only. That is, define

G =(X(V,) - ,U(p) ipi((i (i) 4

where ,Upi) and l(ppi) are the (pi x 1) vector of means and the (pi x pi) covariance matrix corresponding to the present variables X(pi). If yi is multivariate normal, the observed values are not contaminated (X(pi) = YCDi)), and the missing data are missing completely at random (MCAR) in the sense defined by Rubin (1976), then V2 - 2 i, and large values of V2 suggest contamination in the observed components of X(pi).

Our strategy is to replace ,u and X by estimates and to use (4) to determine outlying cases. In the next section we describe a method of obtaining robust estimates of ,u and X in the presence of missing and contaminated data. An alternative method of estimation is given in Little and Rubin (1987, sec. 10.5).

2.3 Estimation of , and E

Estimates of u and X from incomplete data are found by a modification of the iterative algorithm of Orchard and Woodbury (1972), given as an example of an EM algorithm by Dempster et al. (1977). In unmodified form, the algorithm consists of the following E and M steps applied at each iteration t. Letting 0(t) = (8,(t), (t)) denote estimates of u and X at iteration t, the E step calculates the expected values of the complete data sufficient statistics given the observed data and current estimates 0(t):

n A n

E EXijl X0i=, O(t)} _ ,t)

jI= 1, . . .

n n

E { z XijXik I Xpi), 0(t)} = z {X.)X(t) + C,(k)},

j,k =1,...,p, (5)

where

X(t) = xii if Xij is present = EXi I Xpi) 0(t)} if Xij is missing, (6)

and

Cj(k = 0 if Xij orXik are present,

= Cov{Xij, Xik X(pi), 0(t)} if Xij and X,k are missing.




The imputed values E{Xij I X(pi), a(t)} and the adjustments r(k) are easily found by applying the sweep operator (Bea- ton 1964; Clarke 1982; Goodnight 1979) to l(t) (Beale and Little 1975). The M step of the algorithm computes new estimates 1(t+1) = ((t+`1), (t+1)), where

n

n -(+l =fl E>Xt)(k + C( )} -H IHk

i=1

n- 7 = n {(X() - J( 1))(X(k) - ( +1)) + Cki} (7)

Iteration between the E step (5) and the M step (7) proceeds until the estimates 0(t) converge.

If the data are uncontaminated and multivariate normal and the missing data are missing at random (Rubin 1976), then this algorithm yields maximum likelihood (ML) estimates of ,u and I (Orchard and Woodbury 1972). More- over, if the data are uncontaminated and the data are missing completely at random (Rubin 1976), the algorithm yields consistent estimates of ,u and E for any distribution with finite fourth moments. These properties clearly fail to hold, however, if the data are contaminated by spurious values.

In the presence of contamination, Frane (1976) proposed calculating for each case i the estimated squared Mahalanobis distance

= (X(,D) - QDi))' (Ppi)(X(Pi) - (pi)), (8)

wherep,u(i) and (ppi) in (4) have been replaced by estimates P(pi) and i(ppi) from the EM algorithm. The parameters (,u, E) are then reestimated with cases with large values of D? excluded. In contrast, we estimate ,u and X just once, but modify the M step of the EM algorithm to downweight extreme observations. Specifically, we propose the expec- tation-robust (ER) estimation algorithm defined by combining the E step (5) with the following robust modification of (7):

n / ~~n (t+1) n E w1X E (9)

and n

EW[X( - Xt+ l)][X( t

- t+1)] + r(t) (t+l) =-

(7jk =n

i=1

(10) where wi = w(di)/di, and

d2=[X(t)) _J -())T[2) 1[X(t)i) -(i] (11)

is the squared Mahalanobis distance for variables at iteration t.

Here co denotes a two-parameter bounded-influence function (Hampel 1974) defined by

wo(d1) = diif di <doi

= d0, exp{ - (ds - d0i)2/2b22} if di > d01,

where doi = \/Pi + b1/2, pi denotes the number of present variables for case i, and b1 and b2 are quantities to be specified by the data analyst.

Observations with distances di that exceed the specified cutoff, doi, receive decreasing weights in estimating the means and covariance matrix. The choice of b1 determines the cutoff, and b2 specifies how rapidly the weights decrease beyond doi. Hampel (1973) motivated the choice of b1 and b2 and suggested that b1 = 2 and b2 = 1.25 are good choices. A similar robust covariance matrix estimate (HUB) performed well in simulations by Devlin et al. (1981) directed at comparing estimates of eigenvalues of the correlation matrix from contaminated normal data.

The ER algorithm iterates between the E step (5) and the R step (9) and (10) until the estimates converge. In the case of complete data (Pi = p, i = 1, . . . , n), the E step of the algorithm is redundant and the R step corresponds to one step of the robust estimation procedure proposed by Campbell (1980). Insofar as the EM algorithm is guaranteed to converge and, in the absence of missing values, Campbell's procedure is known to converge to robust estimates of ,u and i, we conjecture that the ER algorithm converges. When applied to the ASM data, it converged for a variety of choices of b1 and b2.

Finally, we suggest that initial estimates of ,u and I be obtained from complete observations. Table 1 shows the results of applying the ER algorithm with b1 = 3.09, b2 = 1.25 to the ASM data. Estimates of the variances of the marginal distributions are 4%-14% lower than those obtained from the EM algorithm, reflecting the down- weighting of more outlying cases in the R step of the ER algorithm.

2.4 Identification of Outlying Cases: ,u and X Estimated

Estimates ̂(pi) and lipi) from the ER algorithm are sub- stituted in (8) to provide Mahalanobis distances for each case; large values indicate potential contamination. To identify outlying cases, we propose an informal graphical procedure that extends the method of Gnanadesikan (1977) to incomplete multivariate observations.

If case i is uncontaminated, the data are normal and missing values are missing completely at random, then asymptotically Di - Xp, The Wilson-Hilferty transformation of the chi-squared distribution (see Kendall and Stuart 1969, chap. 16) yields (D?/pi))113 - N(1 - 2/fti), 2/(9pi)). Consequently, a probability plot of

Zi= [(D?/lp )1/3 - 1 + 2/(9p1)]/[2/(9p,)]112 (12)

versus standard normal order statistics should reveal atyp- ical observations. Figure 3 gives a probability plot of the Zi for the ASM data. For useful refinements of (12) when pi is small, see Hawkins and Wixley (1986).

In the interval between .01 and .90 of this figure, the Zi's plotted versus I-4-[(i - 1/2)/n] exhibit a strong linear trend typical of normal data. Beyond the expected normal statistic corresponding to the cumulative probability of .90, however, the Zi's begin to deviate greatly from the line formed by the other cases. In fact, the Z scores of cases




Table 1. Estimated Means, Variances, and Correlations of In Transformed Data, From the ER Algorithm (ASM data)

APW AOE ALE AVP AMH AWW AOW BPW BOE BLE BVP VMH BWW BOW

Means 5.12 3.18 5.82 5.65 5.77 7.70 6.28 5.14 3.19 5.77 5.55 5.79 7.67 6.20 Variances .91 1.16 1.03 1.83 .77 .97 1.15 .95 1.14 1.09 1.67 .84 1.03 1.05 Correlations

APW 1.0 AOE .68 1.0 ALE .91 .74 1.0 AVP .77 .66 .84 1.0 AMH .97 .69 .92 .78 1.0 AWW .95 .66 .95 .88 .95 1.0 AOW .66 .91 .78 .72 .68 .69 1.0 BPW .98 .67 .89 .75 .97 .93 .65 1.0 BOE .66 .98 .72 .64 .67 .64 .89 .64 1.0 BLE .89 .72 .97 .84 .91 .94 .74 .89 .70 1.0 BVP .79 .66 .85 .97 .78 .88 .72 .79 .64 .86 1.0 BMH .96 .68 .90 .76 .98 .94 .66 .98 .65 .91 .78 1.0 BWW .94 .64 .93 .86 .94 .99 .67 .95 .62 .94 .87 .96 1.0 BOW .64 .91 .75 .72 .66 .68 .96 .63 .90 .73 .72 .65 .65 1.0

beyond .90 exceed 3, suggesting that these cases are atyp- ical and outlying. Using this criterion, we conclude that 15 of the 152 cases contain potentially outlying data.

A referee felt that rejecting 10% of the sample as outlying was excessive and, to avoid overediting, was inclined to truncate at p = .95 and plot again. We note, however, that 10% of the data are not being discarded, since only particular values within the outlying cases are edited at the next stage.

Because many large-scale survey organizations require further automation of editing procedures, we describe in the Appendix a formal testing procedure for detecting outlying cases. It is also important to emphasize that cases identified as outliers need not be automatically edited.

Given sufficient resources, they may be subject to review by subject matter experts to assess their potential validity. Even in an automated procedure, one might reserve the option of leaving values in a mildly outlying case alone if the search for outlying values within the case is indecisive. This search is the subject of the next section.

2.5 Selecting Outlying Values Within Cases: The Variable Selection Procedure

In the variable selection procedure each present variable in an outlying case is ranked according to the marginal decrease in Mahalanobis distance obtained by removing that variable and all more influential variables from the computation of the distance. The ranking is obtained by

.24+002 -

.21+0 02 -

,19+002 -

.16+0 02 -

.13+002 -

. 10+002 -

.76+001 -

.48+001 -

x xxxx

.214001 XXXX xxxx xxxxxx xxxx - --------------------------xxxxx------------------------

-.63+000 - XXXXXXXXX xxxxxxxxx xxxxxxxx

x xxxxxx -.34+001 M III

.01 .05 .10 .25 .50 7.5 .90 .95 .99

CUMULATIVE PROBABILITY

Figure 3. Normal Probability Plot of Transformed Distances Calculated From Contaminated Data. Circled cases are outlying.



Limtle and Smith: Editing and Imputation 63

Table 2. Variable Selection Procedure for Case 65

Incremental Recorded value Imputed decrease Distance Z

Variable (raw scale) Rank value in distance remaining value

BWW 473 1 925 56.68 59.03 5.15 BPW 143 2 83 75.17 33.84 3.17 ALE 209 3 115 83.51 22.47 2.03 BOE 3 4 90.96 12.32 .63 BOW 58 5 95.91 5.57 -.78 BVP 36 6 97.47 3.45 -1.30 AMH 140 7 98.06 2.64 -1.38 AVP 49 8 98.25 2.38 -1.19 BLE 99 9 98.39 2.20 -.92 AOW 117 10 98.52 2.01 -.63 AOE 6 11 99.28 .97 -.88 AWW 887 12 99.56 .60 -.66 BMH 162 13 99.57 .58 .12 APW 81 14 100.00 .00

NOTE: Total distance = 136.27; Z value = 9.14.

the following algorithm, with Mahalanobis distances calculated using robust estimates (,u, E) of (,u, E):

Step 1. For outlying case i, compute for each present variable k D (k) the Mahalanobis distance with variable k omitted. As in (4), Dik) is restricted to present variables and their related parameters only.

Step 2. Find j, such that Di'ii) - D (k) for all present k. Variable j, is the single most influential variable contrib- uting to the extremity of observation i.

Step 3. Compute D&kji), the Mahalanobis distance with variables j, and k removed, for all present variables k ? JIt

Step 4. Find 12 such that D(j2j1) - D'kil) for all present k ?j.

Variable 12 iS the next most influential variable, conditional on the removal of variable j1 in Step 2. The algorithm then proceeds to find 13, the next most influential variable, and so on until all of the present variables in observation i are exhausted.

Table 2 gives a summary of this algorithm for one outlying case, i = 65. The total distance computed using robust estimates for u and X for case 65 is D 5 = 136.27, and the sixth column of the table shows the reduced D2 values at each step of the variable selection procedure. In the last column these distances are converted to normal Z values by the transformation (12), with pi replaced by the number of unedited values. Z values in the tail of the standard normal distribution suggest that the remaining values still contain outliers. Because the deleted variables were not randomly selected, the tail area p corresponding to each Z value (e.g., p = .05 for Z = 1.96) does not have the usual formal p value interpretation, even if the normality assumptions are accepted. Thus we are limited to viewing the Z values as informal measures of the extent to which a case contains outlying values. We construct tables like Table 2 for each outlying case and then identify as outliers the set of variables of highest rank that, when removed, reduce the Z value to acceptable levels (say, below 3). In Table 2, for example, if three variables (BWW, BPW, and ALE) are removed, the Z value is reduced from

its large initial value, 9.14, to an acceptable value, 2.03. The table shows imputed values for these outlying variables, computed by the method to be described in Sec- tion 3.

Our procedure can be contrasted with Frane's (1976) method, which applies a two-group backward-elimination discriminant analysis, where the first group consists of the outlier and the second group is defined by the vector of means It. Our procedure eliminates outliers first, whereas the last variable removed in Frane's discriminant analysis is the least inlying. The latter is an outlier with respect to its marginal distribution, rather than its conditional distribution given other variables present, which forms the basis of our method. Consequently, in determining the most outlying variable Frane's procedure fails to exploit the associations between the edited variables and other observed variables, which as we noted in Section 2 are a key feature of the problem. Elaborations of our procedure to stepwise selection, in which previously deleted variables are allowed to reenter if they no longer add significantly to D2, or to all possible subset selection, might be feasible for small data sets. We believe, however, that in most practical applications our approach should reveal distinctly outlying values.

Case 65 illustrates the typical efficacy of our outlier identification procedure. Table 3 gives values of important ratios involving the three variables in Case 65 that have been edited, namely, BWW, BPW, and ALE. The three prior year ratios that are changed by our method, average annual hours worked by production workers (1,132 hours per year), average annual wage for production workers ($3,308 per year), and average hourly wage rate for production workers ($2.92 per hour) are all very low indeed, both in absolute terms and in comparison with industry averages. On the other hand, current year values of these ratios are not changed by the editing, which is appropriate because they are in line with industry averages. The current year value of VP/LE (.2) is changed; it is low com- pared with both prior year data and the industry average.

Note that this example shows strong evidence for editing the values of prior year variables, which may not have




Table 3. Important Ratios Involving Edited Variables From Case 65

Current year values Prior year values

Ratio Interpretation Original value Imputed value Original value Imputed value

1,000 MH/PW Average annual hours 1,728 hrs./yr. Not imputed 1,132 hrs./yr. 1,952 hrs./yr. worked by a production worker

1,000 WW/PW Average annual wage $10,951 /yr. Not imputed $3,308/yr. $11,145/yr. for production worker

WW/MH Average hourly wage rate $6.34/hr. Not imputed $2.92/hr. $5.71 /hr. paid to a production worker

VP/LE Ratio of voluntary to .2 .4 .4 Not imputed legally required pension payments

been carefully edited previously. In general, current year data can provide new evidence about the plausibility of prior year values, and it is quite possible that inconsis- tencies between year are attributable to errors in the prior year data not detectable when the data were originally edited.

The adjusted (imputed) ratios corresponding to variables selected as outliers for case 65 are also given in Table 3 and will be discussed in Section 3.1.

The choice of an acceptable level for the Z values of edited cases is somewhat arbitrary and determines how severely the data are edited. Some guidance about the suitability of a given choice is obtained by repeating the plot in Figure 3 with the values designated as outliers removed and checking whether the points lie on a 450 line. Figure 4 shows this plot for the ASM data when Z values below 3.00 are considered acceptable. The plot lies on the 450 line up to a cumulative probability of .95 but thereafter falls below the line, suggesting slight overediting of the data.

3. IMPUTATION

3.1 Imputed Values for Outlying Variables

Imputed values for the missing and edited variables may be readily computed by reapplying the EM algorithm to the data matrix with the detected outlying variables edited. By sweeping on the final estimates of ,u and X from the E step (5) of the algorithm, parameter estimates of the regression of the missing and edited variables on the present and unedited variables in each case are obtained. Miss- ing and edited values may then be imputed from these regressions and a complete data set produced.

Alternatively, by applying the sweep operator to the robust estimates of , and X obtained previously from the ER algorithm, regression equations of missing and edited variables on present and unedited variables may be obtained for each case and imputations obtained from these regressions. This alternative has the advantage that it does not require the iterative reestimation of 4u and E. Because both these methods are based on estimates of ,u and X that

. 30+00 1 - X x x

xx x x xxxxx .23+001 XXXX xx x

xxx x .16+001 X xxx

xx xx xx .91+0 0 0 - XX

x xx xx xx

.23+000 - XX xx xx

xxx -. 45+000- XX

x xxx xx xxx

-.11+001- xx xx xx xxx xxx

-.18+001- x x xx xxxx

-.25+001_ XXX x x x

X X

-.32+001- X X

-.39+001 - I I I .01 .05 .10 .25 .50 7.5 .90 .95 .99

CUMULATIVE PROBABILITY

Fi,gure 4. Normal Probability Plot of Transformed Distances Calculated From Cleaned Data.




are protected from distortion by outliers, they should yield similar answers.

In the absence of reinterview data, our evaluation of these imputation methods is limited to looking at the data before and after editing and imputation and checking whether the imputed values are plausible. We illustrate the results of applying our method to one case from the ASM data set.

Table 2 shows observed and imputed values for case number 65, and resulting values of four important ratios are presented in Table 3. In this example the imputed ratio of the annual average hours worked by production workers in the previous year, BMH/ BPW, is 1,952 hours. This imputed ratio corresponds approximately to full-time em- ployment throughout the year.

The imputed ratio for the prior year average hourly wage rate, BWW/BMH, is $5.71 per hour. This rate is very much in line with the industry average, is well above the unedited $2.92 per hour figure, and corresponds closely to the current year hourly wage rate of $6.34. Similarly, the imputed prior year average salary is BWW/ BPW = $11,145, which is in line with the industry average.

This example and others not shown suggest that our edit/imputation method identifies cases with implausible ratios and imputes values that restore the ratios to reasonable amounts. Nevertheless, it is always possible that outlying values are in fact correct. Thus the option of retaining outlying cases after review by subject matter spe- cialists may be attractive in particular applications.

3.2 Addition of Random Residuals The imputation method described thus far fills in a case's

missing and edited values by their conditional means given the case's present and unedited variable values. Kalton (1981), Santos (1981), Kalton and Kish (1981), Sedransk and Titterington (1980), and Kalton and Kasprzyk (1982) noted that this type of mean value imputation leads to efficient estimation of univariate item means but distorts the distribution of the item; the concentration of imputed values at their conditional means creates spikes in the distribution and an artificial reduction of item variance. These properties make conditional mean imputation un- suitable when the objective is to produce a single clean data set for various and diverse statistical analyses, an important activity in government statistical agencies.

In our context, the distortions created by imputing conditional means for missing and edited variables can be corrected by adding perturbations to the predicted means. If the normality assumptions underlying the EM algorithm are accepted, then the perturbations for an observation with m missing or edited items should have a zero-centered m-variate normal distribution. The appropriate dispersion matrix is the residual covariance matrix of the m missing or edited items given the p - m present and unedited items. An estimate of this matrix is already available in the swept covariance matrix calculated in the final B step of the EM algorithm.

When these perturbations are added, the resulting filled- in data provide consistent estimates of the variances and

covariances of the variables. The distributions of imputed items may still be distorted because of departures from multivariate normality.

An alternative procedure that places less reliance on the normality assumption involves matching each incomplete case with a complete case and calculating for the complete case a vector of residuals from the regression of missing and edited variables on present and unedited variables. This vector then serves as a set of perturbations for the incomplete case in the match. In the context of univariate nonresponse, Little and Samuhel (1983) argued for matching on the predicted mean from the regression of the missing item on the observations. In our multivariate setting, a natural generalization is to match on the vector of predicted means for the set of missing items, scaled by their residual covariance matrix. Colledge, Johnson, Pare, and Sande (1978) presented a simpler matching scheme for a particular data set.

The addition of noise to the predicted means has an attractive feature in our problem, where the variables are measured on the log scale. Exponentiating the predicted means yields estimates on the original scale that are down- ward biased, whereas exponentiating the imputations with noise added yields consistent estimates. Simple adjustments for the bias in the exponentiated means can be developed (see, e.g., Eddy and Kadane 1982; Little and Samuhel 1983) but are not included in our illustrative example.

4. ASSUMPTIONS ABOUT THE MISSING DATA MECHANISM

Any missing data analysis makes assumptions about the missing data mechanisms. Rubin (1976) formalized these mechanisms in terms of the conditional distribution of missing value indicators given the hypothetical complete data matrix X. If this distribution depends on observed values of X but not missing values, the missing data are missing at random (MAR). If the distribution of indicators does not depend on observed or missing data, the data are missing completely at random (MCAR).

In applying Rubin's conditions to our problem, two missing data mechanisms need to be distinguished, the mechanism leading to incompleteness in the raw data and the process of deleting outliers in the editing process. The EM algorithm applied to the edited data provides true maximum likelihood estimates of the underlying parameters if (a) the editing process has worked, in the sense that the remaining values are correct; (b) the data are MAR-that is, both the nonresponse and the deletion mechanisms do not depend on the true values of variables that are missing or deleted; and (c) the distributional assumptions are correct. Formally, let X denote the complete data matrix of contaminated data and Y the complete data matrix of true values. Write X = (XObS, Xmis), where XObS represents data remaining after the editing process and Xmis represents the true values of missing variables and the deleted values of edited variables. Write Y -

(YObsS Ymis) for the corresponding partition of true values. Condition (a) then implies that XObS = Y0bS. Let R, D, and




I be indicator matrices with the same dimensions as X and Y, such that

Rij = 1, Xij observed

= 0, Xij missing,

D Y = 1, Xij not deleted

= 0, Xij deleted,

and Iij = Ri1Dij; thus Iij = 1 if Xij belongs to XObS and Iij = O if Xij belongs to Xmis. The nonresponse and deletion processes can be characterized by distributions p(R I Y) and p(D I Y), respectively, where these distributions are conditioned on the true data Y. These distributions imply a distribution p(I I Y) for I. Then, applying the theory of Rubin (1976), the MAR condition (b) is satisfied when XObs = Yobs and

p(I I Y) p(I I Yobs, Ymis) - p(I I Yobs);

that is, inclusion does not depend on the unobserved values Ymis.

These two conditions cannot be tested without external information on the values of Y, which is not generally available. Hence their validity (or approximate validity) is simply a matter of conjecture. With respect to the editing process, it should be noted that a deletion of a value X1, depends very directly on the observed value of Xij. The key assumption is that deletion of Xij does not depend on the true value Yi1, conditional on the unedited data in that case. The delicacy of this assumption reflects the difficulty of the editing process.

In defense of our procedures, it should be noted that similar or stronger assumptions are needed for other edit/ imputation methods. Our procedure makes full use of all of the observed values in an incomplete case, both to detect outliers and to supply imputations. Procedures based on ratio edits and imputations only use the observed variables that are involved in the ratios. Moreover, many missing data methods, including the common practice of re- stricting analyses to the completely recorded cases, assume that the data are MCAR, a stronger assumption than the MAR assumption underlying our method. The MCAR assumption can be tested from the observed data, and such tests applied to the ASM data indicate mild departures from the MCAR assumption.

5. A SIMULATION STUDY The primary purpose of editing and imputation is to

reduce nonsampling biases to produce a "clean" data set that fairly reflects important features of the underlying distribution. Because many analyses require means, variances, and correlations, we describe here a limited simulation study that assesses the performance of our methodology in reducing bias and mean squared error for these statistics.

In these simulations, 1,000 replications of four distinct sets of multivariate data were generated. Each set may be characterized as having a multivariate normal or multivariate t distribution that is either contaminated or un-

contaminated with outlying values. Each data set consisted of n = 100 cases with four variables generated from a multivariate distribution with mean 0 and covariance matrix taken from the estimated covariance matrix of the first four variables in Figure 1. Multivariate normal data were generated using subroutine GGNML from International Mathematical and Statistical Libraries, Inc. (1982), whereas multivariate t data were obtained by dividing each of the generated multivariate normal vectors by .5/6, where 6 denotes a random chi-squared deviate with 6 df and .5 is a scaling factor to leave variances unchanged. For data sets contaminated with outliers, the first 15 cases were contaminated by 18 outliers-variable 1 for cases 1-6, variable 2 for cases 7-12, and both variables 1 and 3 for cases 13-15. The contaminated values were independent normal with mean 0 and variance four times that of the uncontaminated variables. Moreover, to create data sets with missing values, 15% of all items not subject to contamination were deleted in an identical pattern for each data set.

Despite the modest fourfold increase in variance of the contaminated values, transformed Mahalanobis distance plots of the contaminated data showed extreme deviations from the line with slope 1 for the upper 15% of cases, reflecting the contamination of the correlation structure. Table 4 summarizes the typical performance of three estimation procedures: (a) the normal EM algorithm with no adjustment for outliers (EM); (b) the robust EM algorithm described in Section 2.3 with b1 = 1.96, b2 = 1.25 (ER); and (c) the normal EM algorithm applied to data with outliers edited until the transformed Mahalan- obis distance is less than 3, as described in Section 2.5. We refer to procedure (c) as REM. Table 4 displays average bias and mean squared error (MSE) for three sets of parameters (means, variances, and correlations) aver- aged over parameters as well as the 1,000 replications. Our conclusions are as follows:

1. ER and REM perform similarly (as expected), with a slight edge in favor of REM.

2. For uncontaminated normal data, the methods are all very similar.

3. For uncontaminated t data, there were more differ- ences between the methods. There was little to distinguish between the ER and EM for estimates of means and correlations. Moreover, the REM performed similarly to the ER and EM for means, but the bias of REM was larger for correlations. For variances, ER and REM display a noticeable negative bias, reflecting overediting of the tails of the distributions. Despite the bias, the average MSE of the estimated variances was considerably lower for ER and REM, indicating a marked reduction in variance for these methods.

4. For both normal and t data with contaminated values, EM overestimates variances and underestimates correlations, an expected pattern for a method that does not allow for the contamination. The ER and REM methods again underestimate variances, but they have small mean bias




Table 4. Summary of Typical Performance of the EM, ER, and REM Estimation Prtocedures on Four Types of Multivariate Data

Average bias (x 10,000) Average MSE (x 10,000)

Means Variances Correlations Means Variances Correlations

Uncontaminated normal data EM -42 -18 -8 116 311 7 ER -39 -266 -11 117 316 7 REM -44 -147 -2 117 324 7

Uncontaminated t data EM -31 76 7 124 879 12 ER -25 -2,083 7 i11 804 9 REM -32 -1,831 51 115 759 10

Contaminated normal data EM -14 1,632 -1,875 145 966 504 ER -2 -196 -145 131 361 13 REM -15 19 -100 1,223 344 11

Contaminated t data EM -11 1,651 -1,908 148 1,454 538 ER -3 -1,964 -157 117 805 18 REM 13 -1,723 -93 115 741 15

for other parameters. The ER and REM methods have about 50% lower MSE for variances, and 95% lower MSE for correlations. Thus in terms of MSE our methods are highly effective, particularly for correlations. Their main defect is a tendency to understate variances, particularly when the data have longer than normal tails.

6. DISCUSSION This article draws together statistical methodology in

robust estimation, graphical procedures, outlier detection, and multivariate analysis and extends and applies these methods to the problems of editing and imputation for survey data. In many well-established surveys, edits and imputations are based on ratios of variables, as in Equa- tions (1) and (2). These ratios are often obtained from subjective notions or historical data. One drawback to this procedure is that it makes aspects of the current survey data conform to the constraints imposed by subjective and historical information. In contrast, the methodology we have presented lets the current survey data "tell its own story" within the context of our models; that is, current survey information is used to obtain edits and imputed values from robustly estimated characteristics of the current survey information itself.

The methodology we present may be used for related edit/imputation operations. For example, the Mahalan- obis distance may be used to rank cases for deciding which firms should be called back to confirm their survey data. The Mahalanobis distance measures the "extremeness" of an observation in all variables simultaneously, rather than selecting cases as extreme that have exceptionally high or low values for "favorite" variables only.

We feel that our procedures should be applied to the problems of editing and imputation with appropriate sen- sitivity toward the data itself. We do not advocate a "black box" procedure that edits and fills in missing values without reference to what the data mean. Consequently, where

possible our methodology should be used in conjunction with subject-matter expert knowledge. For example, ratio ranges such as Equation (1) could be used to review and revise output from our procedures.

Data taken from the ASM are used to illustrate our methods. Although we believe that our methods are useful for these data, certain important features were ignored to simplify the exposition. In particular, certain linear constraints, such as the requirement that the variables OW and WW sum to another recorded variable, the wages and salary of all workers, were bypassed by omitting the latter variable from the analysis. Our methods can be modified to handle such constraints, as discussed by Little and Smith (1983), but more work is required in combining logical constraints with statistical outlier detection methods such as those described in this article.

APPENDIX: APPROXIMATION TO THE DISTRIBUTION OF THE SQUARED DISTANCE IN SMALL SAMPLES

In the absence of missing or contaminated values and under normal assumptions, the squared Mahalanobis distance D? in (10) has the property that (n - p)nD?I[(n - 1)(n + l)p] has an F distribution with p and n - p df (Anderson 1958; Hawkins 1974). If the data are incomplete the exact distribution of DM in (9) is unknown, but it is clear that since the ML estimates of p and X are consistent, then asymptotically D? - X. Taking into account the robust estimation of , and 1, we conjecture that (n, - pi)n,D?I[(n, - 1)(n, + l)piJ (=Fi, say) has approximately an F distribution with p, and nc - pi df, where n, is the number of the completely recorded cases. This conjecture has little the- oretical justification at present, but the use of the number of complete cases to determine degrees of freedom has done well in simulation studies for a related problem in Little (1979).

Using F,, we may compute ap value for each case. Those cases with a p value less than a specified significance level (e.g., .01) may be designated as outlying. Because the choice of the significance level is somewhat arbitrary and a correction for si-




multaneous inferences for all cases is needed, however, we advocate the more informal graphical procedure in Section 2.4.

[Received December 1984. Revised April 1986.]

REFERENCES

Anderson, T. W. (1958), An Introduction to Multivariate Statistical Anal- ysis, New York: John Wiley.

Barnett, V., and Lewis, T. (1978), Outliers in Statistical Data, New York: John Wiley.

Beale, E. M. L., and Little, R. J. A. (1975), "Missing Values in Mul- tivariate Analysis," Journal of the Royal Statistical Society, Ser. B, 37, 129-146.

Beaton, A. E. (1964), "The Use of Special Matrix Operators in Statistical Calculus," Research Bulletin 64-51, Princeton, NJ: Educational Test- ing Service.

Beckman, R. J., and Cook, R. D. (1983), "Outlier .......... s," Techno- metrics, 25, 119-149.

Campbell, N. A. (1980), "Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation," Applied Statistics, 29, 231-237.

Chapman, D. W. (1976), "A Survey of Nonresponse Imputation Pro- cedures," in Proceedings of the Social Statistics Section, American Sta- tistical Association, pp. 245-251.

Clarke, M. R. B. (1982), "The Gauss-Jordan Sweep Operator With Detection of Collinearity," Applied Statistics, 31, 166-168.

Colledge, M. F., Johnson, J. H., Par6, R., and Sande, I. G. (1978), "Large Scale Imputation of Survey Data," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 431- 436.

Dempster, A. P., Laird, M., and Rubin, D. B. (1977), "Maximum Like- lihood From Incomplete Data Via the EM Algorithm," Journal of the Royal Statistical Society, Ser. B, 39, 1-38.

Devlin, S. J., Gnanadesikan, R., and Kettenring, J. R. (1981), "Robust Estimation of Dispersion Matrices and Principal Components," Jour- nal of the American Statistical Association, 76, 354-362.

Dixon, W. J. (ed.) (1983), BMDP Statistical Software (3rd ed.), Berkeley, CA: University of California Press.

Eddy, W. F., and Kadane, J. B. (1982), "The Cost of Drilling for Oil and Gas: An Application of Constrained Robust Regression," Journal of the American Statistical Association, 77, 262-269.

Fellegi, I. P., and Holt, D. (1976), "A Systematic Approach to Automatic Edit and Imputation," Journal of the American Statistical Association, 71, 17-35.

Frane, J. W. (1976), "A New BMDP Program for the Identification and Parsimonious Description of Multivariate Outliers," in Proceedings of the Statistical Computing Section, American Statistical Association, pp. 160-162.

(1978), "Methods in BMDP for Dealing With Ill-Conditioned Data-Multicollinearity and Multivariate Outliers," in Proceedings of the Symposium on the Interface of Computer Science and Statistics, eds. A. R. Gallant and T. M. Gerig, Raleigh, NC: North Carolina State University Press.

Gnanadesikan, R. (1977), Methods for Statistical Data Analysis of Mul- tivariate Observations, New York: Wiley Interscience.

Gnanadesikan, R., and Kettenring, J. R. (1972), "Robust Estimates, Residuals, and Outlier Detection With Multiresponse Data," Bio- metrics, 28, 81-124.

Goodnight, J. H. (1979), "A Tutorial on the SWEEP Operator," The American Statistician, 33, 149-158.

Greenberg, B. (1981), "Developing an Edit System for Industry Statis- tics," in Computer Science and Statistics: Proceedings of the 13th Sym- posium on the Interface of Computer Science and Statistics, New York: Springer-Verlag, pp. 11-26.

Greenberg, B., and Surdi, R. (1984), "A Flexible and Interactive Edit and Imputation System for Ratio Edits," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 421- 426.

Hampel, F. R. (1973), "Robust Estimation: A Condensed Partial Sur- vey," Zeitschriftfar Wahrscheinlichkeitstheorie und Verwandte Gebiete, 27, 87-104.

(1974), "The Influence Curve and Its Role in Robust Estima- tion," Journal of the American Statistical Association, 69, 383-393.

Hartley, H. 0. (1980), "Statistics as a Science and as a Profession," Journal of the American Statistical Association, 75, 1-7.

Hartley, H. O., and Hocking, R. R. (1971), "The Analysis of Incomplete Data" (with discussion), Biometrics, 27, 783-808.

Hawkins, D. A. (1974), "The Detection of Errors in Multivariate Data Using Principal Components," Journal of the American Statistical As- sociation, 69, 340-344.

(1980), Identification of Outliers, London: Chapman & Hall. Hawkins, D. M., and Wixley, R. A. J. (1986), "A Note on the Trans-

formation of Chi-Squared Variables to Normality," TheAmerican Stat- istician, 40, 296-298.

Huber, P. J. (1977), Robust Statistical Procedures, Philadelphia: Society for Industrial and Applied Mathematics.

International Mathematical and Statistical Libraries, Inc. (1982), IMSL Library (Vols. 1-4), Houston: Author.

Kalton, G. (1981), "Compensating for Missing Survey Data," Income Survey Development Program (Dept. of Health and Human Services) report, University of Michigan Survey Research Center.

Kalton, G., and Kasprzyk, D. (1982), "Imputing for Missing Survey Responses," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 22-31.

Kalton, G., and Kish, L. (1981), "Two Efficient Random Imputation Procedures," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 146-151.

Kendall, M. G., and Stuart, A. (1969), The Advanced Theory of Statistics (Vol. 1), New York: Hafner Press.

Little, R. J. A. (1979), "Maximum Likelihood Inference for Multiple Regression With Missing Values: A Simulation Study," Journal of the Royal Statistical Society, Ser. B, 41, 76-87.

(1982), "Models for Nonresponse in Sample Surveys," Journal of the American Statistical Association, 77, 237-250.

Little, R. J. A., and Rubin, D. B. (1987), StatisticalAnalysis With Missing Data, New York: John Wiley.

Little, R. J. A., and Samuhel, M. E. (1983), "Alternative Models for CPS Income Imputation," in Proceedings of the Survey Research Meth- ods Section, American Statistical Association, pp. 415-420.

Little, R. J. A., and Smith, P. J. (1983), "Multivariate Edit and Impu- tation for Economic Data," in Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 518-522.

Maronna, R. A. (1976), "Robust M-Estimators of Multivariate Location and Scatter," The Annals of Statistics, 4, 51-67.

Orchard, T., and Woodbury, M. A. (1972), "Missing Information Prin- ciple: Theory and Applications," in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 1, pp. 697-715.

Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592.

(1987), Using Multiple Imputations to Handle Nonresponse in Sample Surveys, New York: John Wiley.

Sande, I. G. (1982), "Imputation in Surveys: Coping and Reality," The American Statistician, 36, 145-152.

Santos, R. L. (1981), "Effects of Imputation on Complex Statistics," Income Survey Development Program (Dept. of Health and Human Services) report, University of Michigan, Survey Research Center.

Sedransk, J., and Titterington, D. M. (1980), "Nonresponse in Sample Surveys," technical report, Washington, DC: U.S. Bureau of the Cen- sus.

Shih, W. J., and Weisberg, S. (1983), "Assessing Influence in Multiple Linear Regression With Incomplete Data," unpublished paper presented at the IMS/ENAR Biometric Society Meeting in Nashville.

U.S. Department of Commerce, Bureau of the Census (1981), Annual Survey of Manufactures, Washington, DC: U.S. Government Printing Office.

Wilks, S. S. (1963), "Multivariate Statistical Outliers," Sankhya, Ser. A, 2, 407-426.



Documents

Editing and Imputation for Quantitative Survey Data