13
139 Statistica Neerlandica (1995) Vol. 49, nr. 2, pp. 139-151 Data editing and imputation from a computational point of view L. C. R. J. Willenborg’ Department of Statistical Methods, Statistics Netherlands, P.O. Box 959, 2270 A2 Voorburg, The Netherlands Data editing and imputation are important activities in the processing of survey data. Several of their computational aspects are discussed. The logical structure of a questionnaire, as defined in Willenborg (1 988), is the key structure considered in this paper. It is used as a starting point to describe these activities. The present paper is a continuation of Willenborg (1 995). Key Words & Phrases: survey data processing, computational complexity. 1 Introduction Data collected in a survey will generally, sooner or later, be checked and, if necessary, be corrected. The checking and correcting, or, synonymously the data editing and imputation, may take place when the data are being collected, i.e. during the interviews, or at a later stage, after they have been collected. Whatever happens depends on the kind of survey. ‘Interactive’ data editing and imputation generally requires CAP1 (=computer assisted personal interviewing) or CAT1 (=computer assisted telephone interviewing) type surveys, which use computers to collect the requested information. These computers are supplied with questionnaires, in the guise of computer programs. They contain information (the routing structure) that can be used to navigate through the questionnaire on the basis of answers provided by respondents. They may also contain information (in the form of edits) to check whether the answers provide by respondents is not conflicting or implausible. If a conflicting or implausible set of answers is noted by the questionnaire-computer program, the interviewee can be confronted with it, and there is a possibility to resolve the problem by correcting one or more of the answers the interviewee has given, as a result of misunderstanding questions, careless or sloppy answering, etc. In traditional PAP1 (=pencil and paper interviewing) type surveys the data are first captured (in written form) and then fed into a computer. This can for instance be done batchwise and without checking the data when keyed-in, or interactively while ‘The views expressed in this paper are those of the author and do not necessarily reflect the policies of Statistics Netherlands. 8 WS, ’***. Published by Bhckwcll Publishm. 108 Cowlcy Road. Oxford OX4 IJF. UK and 238 Main Strect. Cambridge. MA 02I42. USA.

Data editing and imputation from a computational point of view

Embed Size (px)

Citation preview

Page 1: Data editing and imputation from a computational point of view

139

Statistica Neerlandica (1995) Vol. 49, nr. 2, pp. 139-151

Data editing and imputation from a computational point of view

L. C. R. J. Willenborg’

Department of Statistical Methods, Statistics Netherlands, P.O. Box 959, 2270 A2 Voorburg, The Netherlands

Data editing and imputation are important activities in the processing of survey data. Several of their computational aspects are discussed. The logical structure of a questionnaire, as defined in Willenborg (1 988), is the key structure considered in this paper. It is used as a starting point to describe these activities. The present paper is a continuation of Willenborg (1 995).

Key Words & Phrases: survey data processing, computational complexity.

1 Introduction

Data collected in a survey will generally, sooner or later, be checked and, if necessary, be corrected. The checking and correcting, or, synonymously the data editing and imputation, may take place when the data are being collected, i.e. during the interviews, or at a later stage, after they have been collected. Whatever happens depends on the kind of survey. ‘Interactive’ data editing and imputation generally requires CAP1 (=computer assisted personal interviewing) or CAT1 (=computer assisted telephone interviewing) type surveys, which use computers to collect the requested information. These computers are supplied with questionnaires, in the guise of computer programs. They contain information (the routing structure) that can be used to navigate through the questionnaire on the basis of answers provided by respondents. They may also contain information (in the form of edits) to check whether the answers provide by respondents is not conflicting or implausible. If a conflicting or implausible set of answers is noted by the questionnaire-computer program, the interviewee can be confronted with it, and there is a possibility to resolve the problem by correcting one or more of the answers the interviewee has given, as a result of misunderstanding questions, careless or sloppy answering, etc. In traditional PAP1 (=pencil and paper interviewing) type surveys the data are first captured (in written form) and then fed into a computer. This can for instance be done batchwise and without checking the data when keyed-in, or interactively while

‘The views expressed in this paper are those of the author and do not necessarily reflect the policies of Statistics Netherlands. 8 W S , ’***. Published by Bhckwcll Publishm. 108 Cowlcy Road. Oxford OX4 IJF. U K and 238 Main Strect. Cambridge. MA 02I42. USA.

Page 2: Data editing and imputation from a computational point of view

140 L. C. R. J. Willenborg

checking them immediately when keyed-in. (In some cases an optical scanner could be used to convert the written data into a machine readable form, but traditionally human beings have been (and continue to be) used to carry out this conversion instead.) The final step in the processing of the data may be, if considered at all necessary, that ‘faulty’ data are corrected so as to suit the logical structure of the questionnaire and even to yield a complete data set, either by calling back to respondents (which is not always possible) or by invoking imputation procedures. A complete data set has the advantage that it is easier to handle. Imputation procedures, i.e. the substitution of regular values for missing ones, may be contro- versial to users who do not believe in the underlying model used. When imputation is applied the least thing that should be guaranteed is that the data are consistent. About the ‘truth’ of imputed data (i.e. the agreement with reality) often no firm claims can be made.

In the present paper we consider the problem of checking records (each containing the answers provided by a respondent) and identifying possible errors they may contain. An error is understood here as ‘not in agreement with the logical structure of the questionnaire’. The logical structure of a questionnaire consists of its routing and edit structure (cf. WILLENBORG, 1988, or 1995). The intention is to replace the incorrect values by others that yield a correct record, i.e. one that is in agreement with the logical structure of the questionnaire (which one hopes to be in agreement with reality as well). The objective of this process is to correct the original record with as few changes as possible.

The emphasis in this paper is on procedures used in the data editing and imputation process. In particular the efficiency, or computational complexity, of these procedures is considered. This paper assumes a knowledge of only the basics of the theory of computational complexity, at the same level as required for WILLENB~RG (1995). As far as the notation and terminology of the present paper is not explained or defined it is drawn from WILLENBORG (1988) and WILLENBORG (1995). At some points below the previously defined notation and terminology (in the publications just mentioned) is augmented. As in these references we assume that the routing structure for any questionnaire is Markovian.

The remainder of the paper is organized as follows. In section 2 data editing, i.e. the localization and the partial correction of records, is considered. The rather general discussion in this section is elaborated in subsequent sections. In section 3 the problem of localizing and correcting routing errors is considered, which is formulated as a recognition problem in the theory of formal languages. The problem can be solved, in polynomial time, by an algorithm due to Wagner, based on dynamic programming. In section 4 we formulate the problem of localizing and (partially) correcting edit errors as an optimization problem. The formulation presented there is a somewhat simplified version of one of the formulations given in WILLENB~RG (1988, section 3.4.1). But even in this formulation the localization of edit errors is NP-hard. In section 5 we discuss a heuristic method, due to Fellegi and Holt, for localizing edit errors. It can be applied in case the domains of the variables are finite (0 ws. 199s

Page 3: Data editing and imputation from a computational point of view

Computational view of data editing and imputation 141

and the edits are CP-edits. In section 6 some computational aspects of imputation are discussed. We shall not discuss the (statistical) pros and cons of imputation (see for instance LIITLE and RUBM, 1987), and simply assume that it has to be applied. The issues discussed in section 6 are related to the consistency of a set of edits and the order in which they can be applied. Section 7 concludes the paper with a brief discussion.

2 Data editing and imputation: an overview

In this paper we assume that the error localization and correction process is carried out in four consecutive steps,

1. domain checks and corrections, 2. routing checks and corrections, 3. edit checks and corrections, 4. imputations.

It should be stressed that this assumption is made on heuristic grounds. Our method is therefore a heuristic rather than an exact one. The idea behind the method can be described as follows (cf. figure 1). First each variable in a record r is subjected to a domain check. If such a variable is found to have a value outside its domain this value is replaced by a set of all possible regular values for this variable at this moment, i.e. its domain. This latter set can be interpreted as a missing value. This yields a partially corrected record r'. Then r' is subjected to a routing check and possible routing errors are corrected. The outcome of this procedure is another partially corrected record r", which belongs to the set REC. This means that the values of the variables in r" are either regular, i.e. elements of the corresponding domains, or are subsets thereof, viz. transition sets. These (transition) sets can be interpreted as missing values. Finally, r" is subjected to the edit checks. This produces a partially corrected record r"' E REC, which has the same routing structure as r" and which obeys all edits that are applicable to it. r"' is obtained from r" by replacing the regular values of certain of its variables by corresponding transition sets (i.e. missing values). If one does not desire a data file with completed records one could terminate the process after this step. However, if the objective is to obtain a data set with complete records the last step consisting of imputations has to be made. This

r-r'-f'-r'"-r""

domain routing edit imputations

checks & checkes & checks &

corrections corrections corrections

Fig. 1 . Schematic view of a data editing and imputation process applied to a record r. 8 ws. 199s

Page 4: Data editing and imputation from a computational point of view

142 L. C. R. J. Willenborg

yields a complete record r"". It should be remarked that it is possible to combine the last two steps, as will be shown when discussing the Felled-Holt method.

At each step in figure 1 the record is modified if errors are found. In the first and third type of checks regular values considered to be incorrect are replaced by certain subsets of the corresponding domains. These subsets can be interpreted as missing values. In the second type of check variables might be added or deleted to r' or values (either of missing or of regular type) of existing variables in r' might be replaced by appropriate transition sets. In the last step certain missing values are replaced by regular values. Each of these regular values is an element of the subset of the domain representing the original missing value.

Routing checks (and corrections) are considered in section 3, whereas edit checks (and corrections) are elaborated in sections 4 and 5. Domain checking (and correcting) does not require any further discussion because it is rather simple and should be clear from the discussion above.

3 Localizing and correcting routing errors

Let a questionnaire Q be given containing n questions u, , . . . , u,. Let the routing graph of Q be denoted by G = (V, E). Assume that the routing structure of a partially corrected record r' = (ui, , a,,) * * * (uik, aik ) generated by Q is subjected to a routing test, where a,, is either a regular value in the domain of u,, or the domain R,, itself. If r' faiIs the routing check then it is to be replaced by an incomplete record r" with a correct routing structure, i.e. rw E REC. Recalling from section 2 in WILLENBORG (1995), a record rn = (uiI , ail ) * - (uik, aik) has a correct routing structure, if the following conditions are satisfied:

1. uII is the first question in the questionnaire. 2. uik is the last question in the questionnaire. 3. Eithera,,=R,,.,,+, ora , ,ER, , ,4+1,for j=1 ,..., k - 1 i f k > 0 . 4. Either a,k E R, or aik = Rik

We furthermore require r" to resemble r' as much as possible. How can such an r" be calculated?

Note first that for the routing structure the exact value of a regular value ai appearing in a pair (ui,ai) is unimportant. Only the transition set R, to which ai belongs matters here. Therefore we map r' into the uniquely defined object q,, that we shall call a string, and that is obtained from r' by replacing each of its regular values by the corresponding transistion set. Values that are domains remain unchanged.

Next, we observe that the path set IZ can be viewed in a natural way as a so-called regular language (cf. HOPCROFT and ULLMAN, 1969, section 3). Characteristic for a regular language is that its elements, consisting of certain strings of symbols belonging to a finite set I, called the input alphabet, can be recognized by a finite state automaton (FSA). An FSA operates as follows. Starting in its initial state FSA reads the symbols in a string one-by-one, from left to right. On the basis of its current 8 ws. 1995

Page 5: Data editing and imputation from a computational point of view

Computational view of data editing and imputation 143

state and the last read symbol it jumps to a uniquely determined state. The jump behaviour of FSA can be described by a labeled digraph, called a state diagram (the labels are the symbols that enable the various transitions). If by the time FSA has read the whole string it is in one of a few distinguished states, called final states then the string is accepted as a member of the regular language that FSA is supposed to recognize. If, however, FSA is not in a final state when this string has been read, it is not accepted as an element of this language ('the string is not recognized by FSA').

That I7 can be viewed as a regular language is a result of the following interpretation. The variables uI , . . . , u, in Q are the states in FSA, the source u, of G is the initial state of FSA, the sink u, of G is the unique final state of FSA. Each edge in G indicates a possible transition between states in FSA. Furthermore for all states in FSA, except u,, it is possible to remain there. The input alphabet I consists of all pairs (u,, R,) of variables and transition sets (such that (u,, u j ) E E), as well as all pairs (u,, R,) of variables and domains. Note that 111 = I VI + IEI. The jump behaviour of FSA is described as follows. Suppose FSA is in state u i , and it reads a symbol (u,, R,) for somej. In that case it jumps to state u,. In all other cases, FSA remains in state u,. Initially, FSA is in state ul . If FSA has read an input string and it is in the final state u,, then (and only then) the string is an element of I7, as one easily verifies. (It is assumed that in any string read by FSA any variable appears at most once.)

REMARK 1. The observation that I7 can be viewed as a regular language is in itself not remarkable, because any finite set can be viewed as a regular language (cf. HOPCROFT and ULLMAN, 1969, p. 36 theorem 3.7). In the case of I7, however, there is a very natural interpretation of the routing graph G as (essentially) the state diagram of an FSA recognizing the elements of I7.

With this observation of I7 at hand, we can apply an algorithm due to WAGNER (1974), which changes an incorrect string T,, into a string r ; ~ n , with as few mutations as possible. These changes, called elementary repairs, can be grouped into the following three types (for a variable i ) :

1. changing: (u,, a,), with a, = R, for some j or a, = R,, is replaced by (ui, R,) for

2. inserting: (ui , R,) is added for some j , 3. deleting: (u,, a,), with a, = R, for some j or a, = R,, is deleted.

some k; if a, = R, then k Z j ,

Wagner's algorithm is based on dynamic programming. It has a polynomial time and space complexity. After r ; has been calculated the record r"EREC is obtained by replacing each pair (u,, R,) in T; , for which (u,, a,) appears in r' and for which a,€ R,, by ( u , , ~ , ) . See WILLENBORC (1988, chapter 3) for more details. 0 ws. 199s

Page 6: Data editing and imputation from a computational point of view

1 44 L. C. R. J. Wittenborg

4 Localizing and correcting edit errors

The localization of edit errors in records can be defined as an optimization problem in several ways. Here we consider one such formulation, which, though it is a simplification, shows the essence of the problem.

Let r“ E REC be a partially corrected record, which has been produced after a routing check. Suppose that r” activates a subset E, of the set E of edits. Remember that r f r can only activate an edit e in E if the variables involved in e appear in r” . Suppose furthermore that rr’ violates a subset E,, of E,. We should like to replace r” by a partially corrected record rr” such that

1. r m idles all edits in E,,, because otherwise r m would still contain.an error, 2. r m has the same routing structure as rs , 3. r”’ is repairable, i.e. the transition sets in rrfr can be replaced by regular values

4. r”’ differs from r H in as few variables as possible. In other words the objective

That such an r“’ can indeed be found can be seen as follows. Condition 1 can be met by introducing extra missing values in r n such that (at least) the edits in E,., are idled. Condition 2. can be met by replacing the (regular) values of the variables in r” found to be in error by the transition sets to which the respective values belong. Property 1 in section 2.4 in WILLENBORG (1995) guarantees the existence of a repairable record, as requested in condition 3. To illustrate that condition 3 is not superfluous, consider the following example.

in these sets such that a complete and correct record is obtained,

is to find and correct as few errors in r f t as possible.

EXAMPLE 1. Let a record r” contain six variables, denoted by u I , . . . , 06. We assume that the corresponding questions occur in this order in a questionnaire, i.e. the routing structure is assumed to be trivial (there is no branching). Let RI be the domain of uI, 1 5 i 5 6. Suppose RI = R, = (0, l}, R2 = RS = {0,1,2}, R., = = {0,1,2,3}. Furthermore suppose that five CP-edits el have been defined, to which the following edit sets El correspond:

El =R1 x (0, l} x (0) x R.,

E2={1}xRZ ~ { l } ~ { O , l } ~ R ~ ~ { 2 , 3 }

E,={O}x{l,2} xR, x{1,2,3} x R S x &

E4= Rl x {0,2} x R3 x R., x RS x (0, I > E S = { l } ~ R 2 x R , x { 0 X{1,2} XR6

Suppose we have the record r ” = (uI , 1) (u2 , 0) (u, ,O) (u4 , 0) (us , 1) (us, 0). This record activates all five edits in (1). An inspection shows that it violates edits e, , e4 and e,, or, alternatively expressed, r n E ElnE4nE,. Evidently there is no single variable that idles all violated edits. It is easy to verify that the variables u2 or uS are involved in 0 vvs. 199s

Page 7: Data editing and imputation from a computational point of view

Computational view of data editing and imputation 145

the violated edits. However, it is impossible to impute values from R2 and R, for v2 and v, respectively in order to obtain a correct record.

The problem to check whether a partially corrected record is repairable is essentially the same as the edit cluster problem for CP-edits, which is known to be NP-complete (cf. theorem 2 in WILLENBORG, 1995). This implies that the following theorem holds.

THEOREM 1. Edit error localization as formulated above is NP-hard.

In the formulation of data editing as given above, all variables involved in the violated edits are equally suspected of being in error. This formufation can be modified so as to allow values to be in error in various degrees (cf. WILLENBORG, 1988, section 3.4).

5 The Fellegi-Holt method

Suppose a questionnaire Q only contains questions with finite domains and only contains CP-edits. Furthermore to localize edit errors in an incorrect partially corrected record it is only necessary to consider the variables involved in the activated edits. One of the aims is to idle the violated edits by assigning missing values to one or more variables involved in these edits. As the example in section 4 shows, idling the violated edits only may not yield a repairable partially corrected record. As FELLEGI and HOLT (1976) have shown, such a record will be produced, however, if the set of edits is a so-called complete set. This set consists of the original set of activated edits, supplemented with so-called implied edits. How the implied edits are calculated is explained below. The basic idea of the Fellegi-Holt method is therefore to determine such a complete set of edits for a set of activated edits. Edit error localization is then carried out with this complete set of edits instead of with the original set. We shall show that the edit error localization problem is in fact a set covering problem.

In the notation of the previous section, a partially corrected record r”’ should be found satisfying the conditions 1, 2 and 4 formulated there. Any solution then automatically corresponds to a repairable partially corrected record, i.e. satisfies condition 3. We now show that this is a set covering problem. Let e, , . . . , ek be the edits activated and violated by r”, and let W, denote the set of variables involved in e,. Put S = W, um**u W, and C = { W , , . . . , Wk}. Recall that an edit e,is idled if the value of a variable in r”, which is also involved in e,, is replaced by a missing value. Calculating an r’” that satisfies conditions 1,2 and 4 is equivalent to calculating a subcover C’cC of S, i.e. with S = ue6,-, c, which also has minimum size, i.e. C‘ is a minimum cover of S. In GAREY and JOHNSON (1979, p. 222) it is shown that the decision problem corresponding to this set covering problem (‘minimum cover’) is NP-complete. This implies that the set covering problem itself is NP-hard. Q ws. 1995

Page 8: Data editing and imputation from a computational point of view

146 L. C. R. J. Willenborg

REMARK 2. It is known (cf. GAREY and JOHNSON, 1979, p. 222) that the minimum cover problem remains NP-complete even if IcI s 3 for all c E C. But it is solvable in polynomial time if Icl= 2 for all c E C.

In order to apply the Felled-Holt method it is important that each CP-edit is written as a disjunction of normal CP-edits, i.e. for which the corresponding edit sets are Cartesian product sets. In the remainder of this section we shall therefore assume that all CP-edits are normal. Furthermore our attention is restricted to the edits in the set E,. that have been activated by r". As in section 4. E,., denotes the subset of E,. consisting of the edits violated by r".

An edit e is said to be (logically) implied by the edits el and e2 , if the corresponding edit set Ee of e is derived as follows from the edit sets E, of ei , for i = 1,2. Let Ei = Ail x x . . * x A , (i = 1,2) for non-empty subsets A , c R, ( i = 1,2). Now choose an index j and define an edit e with edit set

Ee=(AIl nA.2,) x (A12 n A,) x * * . x (A1,j-I nA2. j -1) x

( A l j u A t j ) x (Al , j+l ~ A z , , + I ) x x ( A ~ n n A,),

where AlknAu # 8 by assumption, fork = 1, . . . , n and k Zj. It is easy to verify that if a partially corrected record violates e then it also violates el or e,. Hence any implied edit is redundant. The variable j is called the generating variable for e. We shall write

e = e l *, e2,

to indicate that e is logically implied by el and e2, using variable j as the generating variable. It is easy to verify that the following identities hold: e *, e = e ; e l *, ez = e, *, e l , for all j ; in general: el *, (e2 * k e3) # (el */ e,) * k e3, for j # k; el *, (e2 */ e3) = (el *, e2) *, e3 , for all j .

EWPLE 2. Consider the edits el and e,, with corresponding edit sets El = [3,9] x [2,6] and E2 = [ l , 41 x [4,9]. Let e3 = el *I e, and e, = el *, e, denote the two edits logically implied by el and e,. These implied edits correspond to the respective edit sets E3 = [ I , 91 x [4,6] and E4 = [3,4] x [2,9].

An implied CP-edit e is called essentially new if the set A, ,uA, of the generating variablej equals the domain R,, and both A , and A , are proper subsets of R,. This means that the j t h variable is not involved in e. In case an implied edit e = el *, e2 is not essentially new, we may discard it for the purpose of edit error localization. The reason is that the set of variables involved in e is the union of the sets of variables in el and e2, and furthermore it holds that a record violates e if and only if it violates el and e,. However an implied edit that itself is not essentially new may be used to generate an essentially new implied edit. Therefore we cannot discard such edits when generating a so-called complete set of edits. This is a set of CP-edits such that no essentially new implied edits can be derived from it. 0 ws. 1995

Page 9: Data editing and imputation from a computational point of view

Computational view of data editing and imputation 147

Let 0 be a complete set of CP-edits defined for E,. Furthermore let & (1 5 k s n ) be the maximal subset of Q, containing all edits in 0 in which only the variables uI , . . . , uk are involved. That is, the edit sets corresponding to the edits in Qk are of the form: Al x - * x Ak x Rk+ I x * * x R,, where 8 # A , c Ri (i = 1, . . . , k) and R, is the domain of the j t h variable (j = 1, . . . , n) . We have 0, c- .can = 0. The following theorem holds (cf. FELLEGI and HOLT, 1976, theorem 1). For simplicity, but without the loss of generality, assume for the moment that r" is a complete record. The order in which the variables appear in the record is not necessarily derived from their topological ordering.

THEOREM 2. Let r"=(uI,al)~~~(uk~l,ak-I)(U~,*)~"(u,,,*), where .the a's are arbitrary regular ualues, be a record that does not violate the edits in 0 k - . Then there

does not violate the edits in 0,. a UalUt- U k € Rk Such that r " ( U k ) = (UI, a , ) * * * (vk-1, ak-l)(Uk, ak)(Uk+I, *) ' * * ( O n , *)

PROOF. See WILLENBORG (1988, pp. 76,77).

Note that the Fellegi-Holt method, as described above, would yield a complete record to replace r". The method can be easily modified so as to yield a partially corrected record r'" as was originally intended. The Fellegi-Holt method does not necessarily use the minimum number of corrections in r" to produce a corrected (or a repairable and partially corrected) record 1"'.

6 Imputation: computational aspects

The Fellegi-Holt method described in the previous section is generally used to produce complete records. But, as was remarked in the last paragraph of that section it can be modified so as to yield only partially corrected records. In this section we suppose this is the case. A reason not to use the Fellegi-Holt method to obtain complete records is that statistical considerations lead to specific imputation models that one wants to use instead of the rather mechanical Fellegi-Holt method. In the present section we consider some computational aspects when such imputation procedures are to be applied to partially corrected records in order to obtain completed ones. Such imputations can be either of a deterministic nature (in which they are often called derivations) or of a stochastic nature. From our formal point of view there is no necessity to lay much stress on this distinction.

The present section concentrates on certain aspects of imputation in relation to the logical structure imposed by a questionnaire. We shall not deal with the statistical side of imputation. This is dealt with in the extensive statistical literature on this subject.

Now consider a partially corrected record r'", which has been produced after the edit checks. In the present section we assume that r'" is fed into an imputation system that will try to complete it. 0 vvs. 1995

Page 10: Data editing and imputation from a computational point of view

148 L. C. R. J. Willenborg

The completion of a partially corrected record r”’ by an imputation procedure is either uniquely determined by r’” or it is not. It is uniquely determined if all missing values in rn can be imputed using deterministic imputations, or, synonymously, derivations. Formally a derivation of u k + I from vI , . . . , vk, with v1 < * * - < Uk (‘ <’ is the strict order derived from a topological sort ‘Sy of the associated routing graph), is a partial function g: R,, x - - - x R , + Rlk + , where R,, is the domain of q , i = 1,. . . , k + 1. In WILLENBORG (1988, section 6.3) a method is suggested to calculate the values in the domain of g for which it is defined.

If there are several possible completions of r”’, as usually is the case, one or more (stochastic) imputations should be applied. These stochastic imputations use a statistical model to replace missing values by regular ones. In this case g is a partially defined random variable, i.e. a random variable defined on a subset of R,, x * - x R,k.

We have assumed that the codomains of a stochastic imputation are univariate. This is the simplest case in practice, but for the theory itself as it will be presented here, it is of little importance.

In order to investigate whether a set of imputations is correctly defined, we introduce the following formalism. With each imputation that has been specified we associate an imputation triple (g,, W,, V,), where g, denotes the ith imputation, Wi is the set of auxiliary variables for this imputation, and V, is the set containing the imputation variable of this imputation. For our purposes below it is not strictly necessary that such a set V, contains exactly one element, but we shall assume it, in view of the convention adopted above with respect to imputations.

An evident requirement for an imputation triple is that W, n V, = 8. Furthermore, like edits, imputations should be activatable, which means that the variables in W, u V, should lie on a path in the routing graph. Another requirement, that does not have a counterpart for edits, is that if there are two imputation triples (g,, W,, V) and (g, , W, , V) then they should not be simultaneously activable in order to avoid ambiguities. That means that, in this case, there should be no path in the routing graph that cuts W, u W, u V.

We can define a partial order on the set of imputation triples associated with a questionnaire, as follows:

if and only if

1. (g,, W,, V,) and (g,, W,, V,) are simultaneously activatable. 2. V, n W,#S.

If (g,, W,, V,) < (g, , W, , 5) then gi should be applied prior to g,. This order structure defines a directed graph J on the set of imputation triples. Another requirement for a set of imputation triples is that J is acyclic. Otherwise certain imputations can never be camed out because the necessary background information is (partly) lacking and will never by supplied by the application of some other imputations. Note that this 0 ws. 1995

Page 11: Data editing and imputation from a computational point of view

Computational view of data editing and imputation 149

acyclicity requirement for J implies that Win 5 = 8 for any pair of imputation triples satisfying (g,, W,, VJ < (gj , wj , b>.

Acyclicity of J can be easily tested by applying a topological sort to J. If this topological sort cannot be completed, J is cyclic; otherwise J is acyclic.

Now suppose that a partially corrected record with several missing values is entered into an imputation system. The first thing that has to be considered is whether an imputation process can be started, i.e. whether any of the necessary imputations can indeed be carried out. Therefore it should be verified that the required imputations have indeed been defined for all imputation variables. If not, the record cannot be totally completed by the system. Of course the imputation system can still proceed completing as much of the record as possible: Afterwards it can transfer such a record to a special file for inspection by some subject matter expert; or it may drop the record, because too much information is lacking.

The order in which the imputations should be applied to a partially corrected record r"', is determined by the partial order structure embodied in J. To test whether r"' can be made complete by applying imputations, and if so, in what order which imputations should be applied, proceed as follows. First identify the variables with missing values in the record. Then check that for each of these variables an imputation triple (g,, W,, V,) is available. If not, the record cannot be completed. Otherwise, it can be completed, and then any imputation triple k,, W,, V,) has exactly one of the following properties.

1. All variables in W, have regular values in r "I,

2. There is at least one variable in W, with a missing value in r'". or

If an imputation g, has the first property it can be applied; in the second case it cannot. Carry out all imputations that have the first property, and check that the resulting values satisfy the constraints imposed by the logical structure. Assuming that this can be carried out without any problems, we have created either a completed record or a new partially corrected record s. If we repeat the arguments above but now for s instead of for r"', and so on, we see that the process finally yields a completed record r'"'.

7 Discussion

In this paper we have considered several computational aspects of data editing and imputation, with an emphasis on the role played by the logical structure of a questionnaire. The perspective of the discussion was that of micro-editing, to be contrasted with macro-editing. In micro-editing individual records are inspected and, if necessary, corrected. In macro-editing certain aggregate data derived from the microdata collected are at the focus of attention. These aggregates are checked against corresponding ones of a previous period for plausibility. In certain appli- cations, for instance when the objective is to publish only a few tables with limited Q vvs, 1995

Page 12: Data editing and imputation from a computational point of view

150 L. C. R. J. Willenborg

detail, the macro-editing approach is more efficient and cost-effective to obtain ‘cleaned data’ than a micro-editing one. Furthermore macro-editing catches different errors in data than micro-editing does. Therefore these methods must be viewed as supplementing each other rather than as mutual substitutes (of which macro-editing is supposed to be the more attractive one from an efficiency point of view). In case a reliable microdata set is required for detailed econometric or sociometric analyses some form of micro-editing is indispensible.

In the paper a heuristic method for data editing is presented and the computational complexity of its various parts was discussed. The data editing method considered consists of three steps, which are carried out consecutively: checks and corrections of 1. the domains of the variables in a record, 2. the routing structure;and 3. due to the restrictions imposed by the edits in the questionnaire (if any). The corrections in these steps consist of replacing regular values by certain missing ones, interpreted as specific subsets of the domains of the corresponding variables. As it turns out: the first step is trivial, the second is solvable in polynomial time, and the third is likely to be intractable.

We also briefly discussed the Fellegi-Holt imputation method, which can be used to obtain complete records. As such the method is claimed to have been successfully applied in various surveys in Canada, Spain and Brazil. Statistical offices in these countries employ computer programs (GEIS in Canade, DIA in Spain and Brazil, LINCE in Spain) based on this method.

The Fellegi-Holt method can be modified so as to obtain only partially corrected records. As such it is a heuristic method to solve the third problem mentioned above. When it is applied and one is not satisfied with the incomplete data set one has to apply imputation procedures that can be based upon certain statistical consider- ations. The discussion is closed with some general observations concerning compu- tational aspects of imputation methods. The emerging formalism is useful to describe non-Markovian questionnaires, which generalize the Markovian questionnaires considered in this paper.

Acknowledgements

This paper, as well as its companion WILLENBORG (1995), is based on a portion of my Ph.D. thesis, published as WILLENBORG (1988). I wish to thank the referees whose remarks helped improve the exposition of the originally submitted version of this paper.

References FELLEGI, 1. P. and D. HOLT (1976), A systematic approach to automatic edit and imputation,

GAREY, M. R. and D. S. JOHNSON (1979), Computers and intractability: a guide to the theory

H O P C R O ~ , J. E. and J. D. ULLMAN (1969), Formal languages and their relation to automata,

Journal of the American Statistical Association 71, 17-35.

of NP-completeness, W. H. Freeman & Co, San Francisco.

Addison-Wesley, Reading (Mass.). Q VVS. 1995

Page 13: Data editing and imputation from a computational point of view

Computational view of data editing and imputation 151

LITTLE, R. J. A. and D. B. RUBIN (1987), Statistical analysis with missing data, John Wiley

WAGNER, R. A. (1974), Order-n correction for regular languages, Communications of the ACM

WILLENBORG, L. C. R. J. (1988), Computational aspects of survey dataprocessing, CWI tract 54,

WILLENBORG, L. C. R. J. (1995), Testing the logical structure of questionnaires, Statistica

and Sons, New York.

17, 265-268.

Centre for Mathematics and Computer Science, Amsterdam.

Neerlandica 49, 95-109.

Received: March 1989. Revised: January 1994.