October 2011

October 2011

Linda FardellCross Portfolio Data Integration Secretariat

The secret lives of us:data confidentiality

What is it & why should you care?

• It’s about obligations – legal/ethical

• Aim – protect identity and release useful data

• It’s more than removing name & address

• Trust of providers is essential to get good stats

Information is power

• Banker in Maryland obtained a list of patients with cancer• compared with list of clients with outstanding

loans

• called in the loans of clients with cancer.

Source: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy (Statist. Surv. Volume 5 (2011), 1-29.

Legislative obligations

• Privacy Act

• Specific legislation governing collection & use of information e.g.• Social Security (Administration) Act 1999

• Taxation Administration Act 1953

Other obligations

• Principles based obligationse.g. High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes

How agencies meet these obligations

• Implement procedures to address all aspects of data protection

• To ensure that identifiable information:• is not released publicly;• is available on a ‘need to know’ basis;• can’t be derived from disseminated data; and• is maintained and accessed securely.

Understand your obligations

Establish policies and procedures

De-identify the data

Assess potential identification risks

Manage the risks of identification - confidentialise

Test and evaluate to mitigate risks

Provide safe access to data

Managing identification risk

Access to other information

• Keep track of all information released from the dataset.

When should a cell be confidentialised?

• Common confidentiality rules:• frequency (threshold) rule• cell dominance (cell concentration) rule

• Keep specific confidentiality procedures secret (e.g. the particular value chosen when applying the threshold rule)

Two general methods

• Data reduction

• Data modification (perturbation)

Example: frequency rule - 5

Age Income

Low Med High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 18 24 48

50–59 4 5 14 23

60+ 12 9 7 28

Total 64 55 60 179

Before

Example: cont.Age Income

Low Med High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 n.p. 18 n.p. 24 48

50–59 4 n.p. 5 n.p. 14 23

60+ 12 9 7 28

Total 64 55 60 179

After

Alternative: concealing totals

Age Income

Low Medium High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 18 24 48

50–59 n.p. 5 14 >19

60+ 12 9 7 28

Total >60 55 60 >175

E.g. 2 – the cell dominance (n,k) rule

Widget brand Profit ($m)

A 150B 93C 21D 13E 8F 8G 6H 1Total 300

• Cell unsafe if combined contributions of the ‘n’ largest members of the cell represent more than ‘k’% of the total value of the cell

• n & k values are set by data custodian

• Example: (2, 75) rule• A & B contribute 81% of

total profit, so profit needs protecting

Data modification methodsAge Income

Low Med High Total

15–19 20 0 0 20

20–29 14 11 8 33

30–39 8 12 7 27

40–49 6 18 24 48

50–59 4 5 14 23

60+ 12 9 7 28

Total 64 55 60 179

Before roundingRR3

Data modification methods

Age Income

Low Med High Total

15–19 20 21 0 0 20 21

20–29 14 15 11 12 8 9 33

30–39 8 9 12 7 6 27

40–49 6 18 24 48

50–59 4 3 5 6 14 15 23 24

60+ 12 9 7 6 28 27

Total 64 63

55 54

60 179 180

After rounding RR3

Microdata

• Valuable resource

• 2 key types of disclosure risk:

1. spontaneous recognition

2. deliberate (malicious) attempt

Microdata – managing risks

• confidentialising

• deterrents

• restricting access

• educating data users about their obligations

• safe environment for access

Microdata – methods to assess risks

• cross-tabulation of variables;

• comparing sample data with pop’n data to see if the unique characteristics in the sample are unique in the population; and

• acquiring knowledge of other datasets & publicly available info. that could be used for list matching.

Protecting microdata

• 1st level of protection: remove direct identifiers

• Common ways to protect microdata are:

1. confidentialising; and/or

2. restricting access to the file

Confidentialising microdata

• Same principles as protecting aggregate data:

• limit variables

• introduce small amounts of random error (e.g. data swapping)

• combine categories (e.g. age in 5 year ranges)

• top/bottom code

• suppress particular values/records that can’t otherwise be protected.

Restricting access to microdata

What affects the risk of identification?

• motivation

• level of detail

• presence of rare characteristics

• accuracy of the data

• age of the data

• coverage of the data (completeness)

• presence of other information

A note on terminology…

• Confusion between de-identification and confidentialisation

More information – www.nss.gov.au

http://nss.gov.au/nss/home.nsf/pages/Data+Integration+Landing+Page?OpenDocument

Documents

October 2011