Sharing Sensitive Data Securely

  • View
    1.246

  • Download
    1

  • Category

    Software

Preview:

Citation preview

© 2014 MapR Technologies 1© 2014 MapR Technologies

Managing Security – Sharing DataTed Dunning

© 2014 MapR Technologies 2

Agenda• Two kinds of security failure

– Buried treasure• But what could go wrong?

– Horror stories• Sharing into controlled environments

– Views, masking and fine-grained control• Sharing without sharing

– When masking is not sufficient• Summary

© 2014 MapR Technologies 3

Locked Up Tight – The Cheapside Hoard• Between 1640 and 1666 somebody hid

a cache of jewels under the floorof 30-32 Cheapside Road

• They never came back for them …

• The hoard was found by workmen in 1910• Did the owners forget where they were?• Why didn’t their heirs or partners recover them?

© 2014 MapR Technologies 4

The Other Kind of Security Failure• Security can fail when there is a leak

– Enigma decryption– Retail data compromise– Klaus Fuchs

• Security also fails when data is not shared– AKA siloing– The many threads of 9/11– The Cheapside hoard– Invisible technological opportunity cost

© 2014 MapR Technologies 5

Netflix• Shared anonymized data• Huge boost in state of the art for some kinds of

recommendations

• Anonymization shown to be weak barrier

• Lawsuit, security clamp-down everywhere

© 2014 MapR Technologies 6

Reference Data Attack

© 2014 MapR Technologies 7

The Moral

• If there is something to correlate, anonymization may fail

• When I say “may”, you should read “will”

© 2014 MapR Technologies 8

NY Cab• Hack license and medallion number hashed using MD-5• No correlation data to work with

• But cab (medallion) numbers have only a few forms

• So we can generate hashes for all 20 million (or so) medallions

© 2014 MapR Technologies 9

So What?• What correlations are there?

• NYC medallions are public information anyway

• Taxis operate in the public realm

© 2014 MapR Technologies 10

So What?

© 2014 MapR Technologies 11

Paparrazo + Timestamp + Taxi = Who and Where

See http://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

© 2014 MapR Technologies 12

Extended Moral• Correlations are more common than we thought

• Masking PII is not sufficient for public datasets

• Theoretically, no solution is possible• Pragmatically, never bet against cleverness

• Must change the game

© 2014 MapR Technologies 13

Alternative Strategies

Public disclosure + Simple masking

Public disclosure + Simple masking

Public disclosure + Simple masking

© 2014 MapR Technologies 14

Key Elements of Masking• Opaque or format preserving?• Random or reversible or one-way?

• Simple omission?

• Right to be forgotten?

© 2014 MapR Technologies 15

Releasing Public Data• Why?

– Required– For research– For support

• How?– New technology based on KPI-preserving random data

• Three use cases

© 2014 MapR Technologies 16

Secure Development is Hard

© 2014 MapR Technologies 17

Secure Development is Hard

Outside collaborators are outside the security perimeter

They can’t see the data and they can’t tune new algorithms to fit reality

© 2014 MapR Technologies 18

How To Make Realistic Data

© 2014 MapR Technologies 19

Parametric Simulation

Parametric matching of failure signatures allows emulation of complex data properties

Matching on KPI’s and failure modes guarantees practical fidelity

© 2014 MapR Technologies 20

The Method• Pick realistic and important KPI’s and failure measures

– False positive rate– Scale invariant score distribution– Internal performance metrics (# of candidates searched, similar)

• Build emulation roughly based on real system• Tune data spec to match KPI’s using real models• Export data spec to alternative models• Re-tune data spec to match on alternative models

© 2014 MapR Technologies 21

Example #1 – Query failure• Performance index is query failure with particular stack signature

• Tuning knobs include– Table sizes– Data distributions– (potentially) field value realism– (potentially) field cross correlations

© 2014 MapR Technologies 22

The Original ConversationThem UsHive broke, fix it.

© 2014 MapR Technologies 23

The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?

No.

© 2014 MapR Technologies 24

The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?

No. OK. Can I see the stack trace?

No.

© 2014 MapR Technologies 25

The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?

No. OK. Can I see the stack trace?

No. Can I log in to the system?

No.

© 2014 MapR Technologies 26

The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?

No. OK. Can I see the stack trace?

No. Can I log in to the system?

No. What do you want me to do?

Fix it.

© 2014 MapR Technologies 27

The Broken Query

© 2014 MapR Technologies 28

A Simpler Example Schema

© 2014 MapR Technologies 29

A Simpler Example[ {"name":"customer_id", "class":"id"}, {"name":"name", "class":"name", "type":"first_last"}, {"name":"street", "class":"address"}, {"class":"flatten", "value": { "class":"zip", "fields":"city,state,zip"}}]

[ {"name":"sales_id", "class":"id"}, {"name":"customer_id", "class":"foreign-key", "size":"$customers"}, {"name":"time_id", "class":"foreign-key", "size":"$times"}, {"name":"store_id", "class":"foreign-key", "size":"$stores"}, {"name":"item_id", "class":"foreign-key", "size":"$items"}, {"name":"quantity", "class":"int", "skew":0.5}, {"name":"unit_price", "class":"gamma", "dof":1, "scale":10}, {"name":"discount", "class":"uniform", "min":0, "max":20}, {"name":"exact_time", "class":"event", "start": "2014-01-01", "format":"yyyy-MM-dd HH:mm:ss", "rate": "10/d"}]

© 2014 MapR Technologies 30

Data Flow

© 2014 MapR Technologies 31

Sample Data

customer_id,name,street,zip,city,state0,"Mark Long","8578 Pied River Flats","02630","BARNSTABLE","MA"1,"Chris Lanier","90018 Lost Treasure Corner","06083","ENFIELD","CT"2,"Bryant Brandon","30712 Bright Shadow Stroll","93922","CARMEL","CA"3,"Norman Horn","66871 Dewy Bird Shoal","59727","DIVIDE","MT"4,"Carmen Nowell","6053 Velvet Barn Glen","29329","CONVERSE","SC"

© 2014 MapR Technologies 32

Results• We had to match size, number of records, rough levels of skew

• Bug was in query planner– For particular values of relative table size, planner messed up

• Once we had the fault, we could slim down the tables– Final example had 3 tables, 1000 records in larges

© 2014 MapR Technologies 33

Common Point of Compromise• Scenario:

– Merchant 0 is compromised, leaks account data during compromise– Fraud committed elsewhere during exploit– High background level of fraud– Limited detection rate for exploits

• Goal:– Find merchant 0

• Meta-goal:– Screen algorithms for this task without leaking sensitive data

© 2014 MapR Technologies 34

Simulation Setup

© 2014 MapR Technologies 35

Simulation Strategy• For each consumer

– Pick consumer parameters such as transaction rate, preferences– Generate transactions until end of sim-time

• If merchant 0 during compromise time, possibly mark as compromised• For all transactions, possible mark as fraud, probability depends on history• Merchants are selected using hierarchical Pittman-Yor

• Restate data– Flatten transaction streams– Sort by time

• Tunables– Compromise probability, transaction rates, background fraud, detection

probability

© 2014 MapR Technologies 36

Performance Indicators to Match• User and merchant population• Transaction count/consumer• Merchant propensity skew• Level of detected fraud• Spectrum of meta-model scores

© 2014 MapR Technologies 37

© 2014 MapR Technologies 38

Real bad guys

© 2014 MapR Technologies 39

Results• We matched general mechanism, rough transaction rates

• Model was tuned on synthetic data, tested on live data

• We found real bad guys on the first try

© 2014 MapR Technologies 40

Summary• Security can fail through too much and

too little access• Sharing widely can have significant benefits and

substantial risks• New levels of control available for masking and filtering

of big data via Drill views• Synthetic data with KPI matching provides sharing of

realistic data without risk

© 2014 MapR Technologies 41

Questions

© 2014 MapR Technologies 42

Thank You@mapr maprtech

tdunning@mapr.comtdunning@apache.org

Ted Dunning, Chief Application Architect

MapRTechnologies

maprtech

mapr-technologies

Recommended