19
Automatic Editing with Soft Edits Sander Scholtus (Statistics Netherlands)

Automatic Editing with Soft Edits

  • Upload
    orsen

  • View
    39

  • Download
    1

Embed Size (px)

DESCRIPTION

Automatic Editing with Soft Edits. Sander Scholtus (Statistics Netherlands). Automatic editing. Goal: Detect and correct errors and missing values without human intervention Data is made consistent with respect to a set of edits Two steps: - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic Editing with Soft Edits

Automatic Editingwith Soft EditsSander Scholtus(Statistics Netherlands)

Page 2: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 2

Automatic editing

• Goal: Detect and correct errors and missing values without human intervention

• Data is made consistent with respect to a set of edits• Two steps:

• detecting erroneous and missing values (error localisation)• imputation of new values

Page 3: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 3

Automatic editing (2)

• Fellegi-Holt paradigm for error localisation: Find the smallest subset of the variables that can be imputed to satisfy all edits

• Generalised version uses confidence weights• At Statistics Netherlands: SLICE software

Page 4: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 4

SLICE

• Branch-and-bound algorithm:

x1

x2 x2

x2 erroneous

x1 correct

x3 x3 x3 x3

x1 erroneous

x2 erroneousx2 correct x2 correct

Page 5: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 5

SLICE

• Branch-and-bound algorithm:

x1

x2 x2

eliminate x2

fix x1

x3 x3 x3 x3

eliminate x1

eliminate x2fix x2 fix x2

Page 6: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 6

SLICE (2)

• Leaf nodes of the tree:• all variables have been either fixed or eliminated• interpretation: eliminated variables are incorrect

• Associated sets of edits:• contain no variables• either empty or contain only trivial statements

• Theorem (De Waal and Quere, 2003):A leaf node corresponds to a feasible solution of the errorlocalisation problem, if and only if the associated set of editscontains no contradictions

Page 7: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 7

SLICE (3)

• Application of SLICE in the production process:• automatic editing of micro data for the Dutch structural

business statistics• approximately 100 variables and 100 edits• evaluation studies: sometimes large differences between

automatic and manual editing

Page 8: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 8

Hard edits and soft edits

• Examples of edits:1. Profit = Turnover – Costs2. Profit < 0.6 x Turnover

• First example:• hard edit• has to hold by definition

• Second example:• soft edit• can also be failed by correct values

Page 9: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 9

Hard edits and soft edits (2)

• Manual editing uses both hard and soft edits• Current methods for automatic editing can only

handle hard edits• Practical solutions:

• ignore all soft edits• treat soft edits as hard edits

• Can this be improved?

Page 10: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 10

Error localisation with soft edits

• Current error localisation problem:Minimise, among subsets of variables that can be imputed to

satisfy all edits, the sum of the confidence weights

• Suggested new error localisation problem:Minimise, among subsets of variables that can be imputed to

satisfy all hard edits, the sum of the confidence weights plus a cost term for failed soft edits

Page 11: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 11

Error localisation with soft edits (2)

• The new error localisation problem can be solved by an extended version of the SLICE algorithm

x1

x2 x2

eliminate x2

fix x1

x3 x3 x3 x3

eliminate x1

eliminate x2fix x2 fix x2

Page 12: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 12

Example

• Variables:Turnover (T), Profit (P), Costs (C), Number of Employees (N)

• Edits:Hard edits: Soft edits:

• Confidence weights:Turnover: 2; Profit: 1; Costs: 1; Number of Employees: 3

• Contribution of each failed soft edit: 2

05500000

TNNCTPCT

01.005.0

TPPT

Page 13: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 13

Example (2)

• Original data and edits:T = 100; P = 40000; C = 60000; N = 5Hard edits: Soft edits:

05500000

TNNCTPCT

01.005.0

TPPT

Page 14: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 14

Example (3)

• Original data and edits:T = 100; P = 40000; C = 60000; N = 5Hard edits: Soft edits:

• Eliminate P from the original edits:Implied hard edits: Implied soft edits:

05500000

TNNCTPCT

01.005.0

TPPT

0550000

TNNCT

01.105.0

CTCT

Page 15: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 15

Example (4)

• According to the theory, P can be imputed to satisfy all hard edits, but the second soft edit is failed

• Imputing only P is a feasible solution to the error localisation problem

• The value of the target function equals 1 + 2 = 3

Page 16: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 16

Example (5)

• Data and edits after eliminating P:T = 100; C = 60000; N = 5Implied hard edits: Implied soft edits:

• Eliminate C from these edits:Implied hard edits: Implied soft edits:

0550000

TNNCT

01.105.0

CTCT

055000

TNNT

06.001.1

TT

Page 17: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 17

Example (6)

• According to the theory, P and C can be imputed to satisfy all hard and soft edits

• Imputing P and C is a feasible solution to the error localisation problem

• The value of the target function equals 1 + 1 = 2• This turns out to be the optimal solution• Possible corrected version of the record:

T = 100; P = 40; C = 60; N = 5

Page 18: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 18

Example (7)

• Imputing only P is the optimal solution if the soft edits are ignored

• Corrected version of the record:T = 100; P = -59900; C = 60000; N = 5

Page 19: Automatic Editing with Soft Edits

Automatic Editing with Soft Edits 19

Discussion

• Future work:• Implementation of the algorithm in R (in progress)• Test on realistic data (Dutch structural business statistics)• How to model the costs of failed soft edits

Thank you for your attention!