32
Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census Bureau Washington, DC 20233 [email protected]

Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

Embed Size (px)

Citation preview

Page 1: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

Protecting the Confidentiality of Tables by Adding Noise to the Underlying

Microdata

Paul Massell and Jeremy Funk

Statistical Research Division

U.S. Census Bureau

Washington, DC 20233

[email protected]

Page 2: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

2

Talk Outline

1. Overview of EZS Noise

2. Measuring Effectiveness of Perturbative Protection

3. Noise Applied to Weighted Data

4. Noise Applied to Unweighted Data: Random vs. Balanced Noise

5. Conclusions and Future Research

Page 3: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

3

The EZS Noise Method (Evans, Zayatz, Slanta)

Developed by Tim Evans, Laura Zayatz, and John Slanta in the 1990’s

Multiplicative noise is added to the underlying microdata, before table creation

A noise factor or multiplier is randomly generated for each record

Page 4: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

4

The distribution of the multipliers should produce unbiased estimates, and ensure that no multipliers are too close to 1

Weights both known and unknown to users are combined with the noise factors to obtain ‘noisy’ values for all records

When tabulated, in general, sensitive cells are changed quite a bit and non-sensitive cells are changed only by a small amount

The EZS Noise Method (Evans, Zayatz, Slanta)

Page 5: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

5

Tables with noisy data are created in the same way as the original tables:

simply: replace var X with var X-noisy

Tables are automatically additive

An approximate value could be released for every cell

(depends on agency policy)

No Complementary Suppressions

Attractive Features of EZS

Page 6: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

6

Linked tables and special tabs are automatically protected consistently

EZS allows for protection at the company level (Census requirement)

Ease of implementation compared to methods such as cell suppression

Attractive Features of EZS

Page 7: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

7

Measuring Effectiveness of the EZS Method

Step 1: Determine which cells in a table are sensitive – e.g., using p% Sensitivity Rule

Step 2: Measure level of protection to sensitive cells (using protection multipliers)

Step 3: Measure amount of perturbation to non-sensitive cells (via % change graph)

Page 8: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

8

The p% Sensitivity RuleUnweighted Data:

Let T = cell total ; x1, x2 top 2 contributionsLet ‘rem’ denote remainderSet rem = T – (x1 + x2)Let ‘prot’ denote suggested protectionSet prot = (p/100) * x1 – rem

if prot > 0, when Contributor 2 tries to estimate x1, rem does NOT provide enough uncertainty ; additional protection is needed; noise may provide this uncertainty

Page 9: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

9

p% Sensitivity Rule

Weighted Data:

TA = Fully Weighted Cell Estimate

X1 = Largest Cell Respondent Contribution

X2 = 2nd Largest Cell Contribution

wkn = Known Weights

wun = Unknown Weights

Page 10: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

10

Extended p% rule w. weights & rounding

rem = TA – (X1 * wkn1 + X2 * wkn2 )

prot = ( (p/100) * X1 * wkn1 ) – rem

Page 11: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

11

Measuring the Effectiveness of a Perturbative Protection Method

Protection of Sensitive Cells :Define Protection Multiplier (PM)

PM = abs (perturbation) / prot Find how many (or %) have PM < 1

Data Quality: Important: % change for non-sensitive cells Less important: % over-pertubation for

sensitive cells

Page 12: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

12

EZS Noise Factors for Unweighted Data

Let X = original microdata valueLet Y = perturbed valueLet M = noise multiplier; i.e. a draw from a

specified noise distribution of EZS type

Y = X * M

Page 13: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

13

Noise Distribution used for all examples:(a=1.05, b=1.15) 5% to 15% noise

Page 14: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

14

Noise Applied to Weighted Data

Key idea: weights (e.g., sample weights)

provide protection to microdata since users typically “know” weights only roughly (except when close to 1)

Not necessary to apply full M factor to X unless w = 1

Page 15: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

15

EZS Noise Factor for Weighted Data

Weighted Data:For a simple weight w with associated uncertainty interval at least as wide as 2*b*wthe noise factor S can be combined with w to form the Joint Noise-Weight Factor

JNW = M + (w-1)

Y = X JNW

Page 16: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

16

Noise Formula for Known and Unknown Weights

Calculation of Perturbed Values:

wkn is the known weight

wun is the unknown weight.

kn unY X w M w 1

Page 17: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

17

Noise for Weighted Data:Commodity Flow Survey (CFS)

Measures flow of goods via transport system in U.S.

Estimates volume and value of each commodity shipped: by origin, destination, modes of transport

Used for transport modeling, planning, ... Some users have objected to disclosure suppressions

Page 18: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

18

Effect of Noise on High Level Aggregate Cells

CFS Table: National 2-DigitCommodity

Data Quality Measure:43 cells; 0 are sensitive

41 cells change by [0 - 1] %

2 cells change by [1 - 2] %

Page 19: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

19

CFS Test Table

(Origin State by Destination State by 2 digit Commodity)

61,174 cells of which 230 are sensitive

Data Quality and Protection Assessments

(following slides)

Page 20: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

20

CFS Noise ResultsData Quality Assessment

While some cells may receive large doses of noise, vast majority get less than 1% or 2%

NON-SENSITIVE CELLS

01020304050607080

[0-1

]

(1-2

]

(2-3

]

(3-4

]

(4-5

]

(5-6

]

(6-7

]

(7-8

]

(8-9

]

(9-1

0]

(10

-11

]

(11

-12

]

(12

-13

]

(13

-14

]

(14

-15

]

Percent Change Interval

Pe

rce

nt

of

Ce

lls

Page 21: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

21

CFS Random NoiseProtection Assessment

Most sensitive cells receive significant noise, i.e. 5% to 11%

Only 2 out of 230 sensitive cells do not receive full protection from noise, as measured by Protection Multipliers (PM)

SENSITIVE CELLS

0

10

20

30[0

-1]

(1-2

]

(2-3

]

(3-4

]

(4-5

]

(5-6

]

(6-7

]

(7-8

]

(8-9

]

(9-1

0]

(10

-11

]

(11

-12

]

(12

-13

]

(13

-14

]

(14

-15

]

Percent Change Interval

Pe

rce

nt

of

Ce

lls

Page 22: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

22

Noise for Unweighted DataNon-Employers Statistics

Special Features of Microdata Unweighted adminstrative data Only 1 variable to protect: receipts Many small integers (after rounding to $1000)

Special Features of Key Table Many cells have a small number of

contributors; these include many safe cells Many sensitive cells with only 1 or 2

contributors

Page 23: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

23

NE Noise ResultsData Quality Assessment

Lack of weights results in much more distortion to non-sensitive cells than occurs for CFS

NON-SENSITIVE CELLS

0

10

20

30[0

-1]

(1-2

]

(2-3

]

(3-4

]

(4-5

]

(5-6

]

(6-7

]

(7-8

]

(8-9

]

(9-1

0]

(10-

11]

(11-

12]

(12-

13]

(13-

14]

(14-

15]

Percent Change Interval

Per

cen

t o

f C

ells

Page 24: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

24

NE Noise ResultsProtection Assessment

Resembles noise factor distribution, due to prevalence of 1 respondent cells in NE test table and no weights

SENSITIVE CELLS

0

10

20

[0-1

]

(1-2

]

(2-3

]

(3-4

]

(4-5

]

(5-6

]

(6-7

]

(7-8

]

(8-9

]

(9-1

0]

(10-

11]

(11-

12]

(12-

13]

(13-

14]

(14-

15]

Percent Change Interval

Per

cen

t o

f C

ells

Page 25: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

25

Noise Balancing

Is there a way to improve data quality in this situation?

Yes, if one can focus on one key table T

Idea: balance noise at each cell in ‘balancing sub-table B of T ’ (defn: every micro value is in at most one cell of B)

Choose noise directions to maximize noise cancellation for each cell of B

Page 26: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

26

Noise BalancingSupportive NE Characteristics

Balancing works especially well for NE because a high % of microdata is single unit

After balancing interior cells, need to check noise effect on aggregate cells in same table

Also need to check noise effect in higher and lower tables; these we call “trickle up” and “trickle down” effects

For NE, there are few of these other tables;this makes balancing decision easier

Page 27: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

27

NE – Balanced NoiseData Quality Assessment

Vast improvement in data quality

Resembles that of weighted data in CFS

NON-SENSITIVE CELLS

0

20

40

60

80

[0-1

]

(1-2

]

(2-3

]

(3-4

]

(4-5

]

(5-6

]

(6-7

]

(7-8

]

(8-9

]

(9-1

0]

(10-

11]

(11-

12]

(12-

13]

(13-

14]

(14-

15]

Percent Change Interval

Per

cen

t o

f C

ells

Page 28: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

28

NE – Balanced NoiseProtection Assessment

Very similar to Random Noise application

91.7% of sensitive cells fully protected

SENSITIVE CELLS

0

10

20

[0-1

]

(1-2

]

(2-3

]

(3-4

]

(4-5

]

(5-6

]

(6-7

]

(7-8

]

(8-9

]

(9-1

0]

(10-

11]

(11-

12]

(12-

13]

(13-

14]

(14-

15]

Percent Change Interval

Per

cen

t o

f C

ells

Page 29: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

29

Random Noise vs. Balanced NoiseNon Employer Test Data

Data Quality is greatly improved

Protection Level is not significantly reduced

Thus Balanced Noise is a Good Choice Here

Percent Fully Protected ( PM >= 1 )

Random 92.14%

Balanced 91.70%

PM density curves on [0,1] are nearly identical for 2 methods

Page 30: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

30

Conclusions

Conclusions:

1. EZS Noise is a useful method for protecting tables from a variety of economic programs

2. There are now several variations of the basic EZS method ; which is best for a survey depends on both microdata and table characteristics

Page 31: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

31

Future Research1. Should some sensitive cells be

suppressed; high noise cells flagged ?2. How to handle multiple variables ?3. What is the most that users can be

told about noise process without compromising data protection ?

4. How to handle company dynamics (births, deaths, mergers, ….) ?

5. How to coordinate survey protection ?

Page 32: Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census

32