Upload
ted-dunning
View
1.246
Download
1
Embed Size (px)
Citation preview
© 2014 MapR Technologies 1© 2014 MapR Technologies
Managing Security – Sharing DataTed Dunning
© 2014 MapR Technologies 2
Agenda• Two kinds of security failure
– Buried treasure• But what could go wrong?
– Horror stories• Sharing into controlled environments
– Views, masking and fine-grained control• Sharing without sharing
– When masking is not sufficient• Summary
© 2014 MapR Technologies 3
Locked Up Tight – The Cheapside Hoard• Between 1640 and 1666 somebody hid
a cache of jewels under the floorof 30-32 Cheapside Road
• They never came back for them …
• The hoard was found by workmen in 1910• Did the owners forget where they were?• Why didn’t their heirs or partners recover them?
© 2014 MapR Technologies 4
The Other Kind of Security Failure• Security can fail when there is a leak
– Enigma decryption– Retail data compromise– Klaus Fuchs
• Security also fails when data is not shared– AKA siloing– The many threads of 9/11– The Cheapside hoard– Invisible technological opportunity cost
© 2014 MapR Technologies 5
Netflix• Shared anonymized data• Huge boost in state of the art for some kinds of
recommendations
• Anonymization shown to be weak barrier
• Lawsuit, security clamp-down everywhere
© 2014 MapR Technologies 6
Reference Data Attack
© 2014 MapR Technologies 7
The Moral
• If there is something to correlate, anonymization may fail
• When I say “may”, you should read “will”
© 2014 MapR Technologies 8
NY Cab• Hack license and medallion number hashed using MD-5• No correlation data to work with
• But cab (medallion) numbers have only a few forms
• So we can generate hashes for all 20 million (or so) medallions
© 2014 MapR Technologies 9
So What?• What correlations are there?
• NYC medallions are public information anyway
• Taxis operate in the public realm
© 2014 MapR Technologies 10
So What?
© 2014 MapR Technologies 11
Paparrazo + Timestamp + Taxi = Who and Where
See http://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/
© 2014 MapR Technologies 12
Extended Moral• Correlations are more common than we thought
• Masking PII is not sufficient for public datasets
• Theoretically, no solution is possible• Pragmatically, never bet against cleverness
• Must change the game
© 2014 MapR Technologies 13
Alternative Strategies
Public disclosure + Simple masking
Public disclosure + Simple masking
Public disclosure + Simple masking
© 2014 MapR Technologies 14
Key Elements of Masking• Opaque or format preserving?• Random or reversible or one-way?
• Simple omission?
• Right to be forgotten?
© 2014 MapR Technologies 15
Releasing Public Data• Why?
– Required– For research– For support
• How?– New technology based on KPI-preserving random data
• Three use cases
© 2014 MapR Technologies 16
Secure Development is Hard
© 2014 MapR Technologies 17
Secure Development is Hard
Outside collaborators are outside the security perimeter
They can’t see the data and they can’t tune new algorithms to fit reality
© 2014 MapR Technologies 18
How To Make Realistic Data
© 2014 MapR Technologies 19
Parametric Simulation
Parametric matching of failure signatures allows emulation of complex data properties
Matching on KPI’s and failure modes guarantees practical fidelity
© 2014 MapR Technologies 20
The Method• Pick realistic and important KPI’s and failure measures
– False positive rate– Scale invariant score distribution– Internal performance metrics (# of candidates searched, similar)
• Build emulation roughly based on real system• Tune data spec to match KPI’s using real models• Export data spec to alternative models• Re-tune data spec to match on alternative models
© 2014 MapR Technologies 21
Example #1 – Query failure• Performance index is query failure with particular stack signature
• Tuning knobs include– Table sizes– Data distributions– (potentially) field value realism– (potentially) field cross correlations
© 2014 MapR Technologies 22
The Original ConversationThem UsHive broke, fix it.
© 2014 MapR Technologies 23
The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?
No.
© 2014 MapR Technologies 24
The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?
No. OK. Can I see the stack trace?
No.
© 2014 MapR Technologies 25
The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?
No. OK. Can I see the stack trace?
No. Can I log in to the system?
No.
© 2014 MapR Technologies 26
The Original ConversationThem UsHive broke, fix it. Sure! Can I see the data?
No. OK. Can I see the stack trace?
No. Can I log in to the system?
No. What do you want me to do?
Fix it.
© 2014 MapR Technologies 27
The Broken Query
© 2014 MapR Technologies 28
A Simpler Example Schema
© 2014 MapR Technologies 29
A Simpler Example[ {"name":"customer_id", "class":"id"}, {"name":"name", "class":"name", "type":"first_last"}, {"name":"street", "class":"address"}, {"class":"flatten", "value": { "class":"zip", "fields":"city,state,zip"}}]
[ {"name":"sales_id", "class":"id"}, {"name":"customer_id", "class":"foreign-key", "size":"$customers"}, {"name":"time_id", "class":"foreign-key", "size":"$times"}, {"name":"store_id", "class":"foreign-key", "size":"$stores"}, {"name":"item_id", "class":"foreign-key", "size":"$items"}, {"name":"quantity", "class":"int", "skew":0.5}, {"name":"unit_price", "class":"gamma", "dof":1, "scale":10}, {"name":"discount", "class":"uniform", "min":0, "max":20}, {"name":"exact_time", "class":"event", "start": "2014-01-01", "format":"yyyy-MM-dd HH:mm:ss", "rate": "10/d"}]
© 2014 MapR Technologies 30
Data Flow
© 2014 MapR Technologies 31
Sample Data
customer_id,name,street,zip,city,state0,"Mark Long","8578 Pied River Flats","02630","BARNSTABLE","MA"1,"Chris Lanier","90018 Lost Treasure Corner","06083","ENFIELD","CT"2,"Bryant Brandon","30712 Bright Shadow Stroll","93922","CARMEL","CA"3,"Norman Horn","66871 Dewy Bird Shoal","59727","DIVIDE","MT"4,"Carmen Nowell","6053 Velvet Barn Glen","29329","CONVERSE","SC"
© 2014 MapR Technologies 32
Results• We had to match size, number of records, rough levels of skew
• Bug was in query planner– For particular values of relative table size, planner messed up
• Once we had the fault, we could slim down the tables– Final example had 3 tables, 1000 records in larges
© 2014 MapR Technologies 33
Common Point of Compromise• Scenario:
– Merchant 0 is compromised, leaks account data during compromise– Fraud committed elsewhere during exploit– High background level of fraud– Limited detection rate for exploits
• Goal:– Find merchant 0
• Meta-goal:– Screen algorithms for this task without leaking sensitive data
© 2014 MapR Technologies 34
Simulation Setup
© 2014 MapR Technologies 35
Simulation Strategy• For each consumer
– Pick consumer parameters such as transaction rate, preferences– Generate transactions until end of sim-time
• If merchant 0 during compromise time, possibly mark as compromised• For all transactions, possible mark as fraud, probability depends on history• Merchants are selected using hierarchical Pittman-Yor
• Restate data– Flatten transaction streams– Sort by time
• Tunables– Compromise probability, transaction rates, background fraud, detection
probability
© 2014 MapR Technologies 36
Performance Indicators to Match• User and merchant population• Transaction count/consumer• Merchant propensity skew• Level of detected fraud• Spectrum of meta-model scores
© 2014 MapR Technologies 37
© 2014 MapR Technologies 38
Real bad guys
© 2014 MapR Technologies 39
Results• We matched general mechanism, rough transaction rates
• Model was tuned on synthetic data, tested on live data
• We found real bad guys on the first try
© 2014 MapR Technologies 40
Summary• Security can fail through too much and
too little access• Sharing widely can have significant benefits and
substantial risks• New levels of control available for masking and filtering
of big data via Drill views• Synthetic data with KPI matching provides sharing of
realistic data without risk
© 2014 MapR Technologies 41
Questions
© 2014 MapR Technologies 42
Thank You@mapr maprtech
[email protected]@apache.org
Ted Dunning, Chief Application Architect
MapRTechnologies
maprtech
mapr-technologies