PRIVACY USC CSCI430 Dr. Genevieve Bartlett USC/ISI

Embed Size (px)

Citation preview

  • Slide 1
  • PRIVACY USC CSCI430 Dr. Genevieve Bartlett USC/ISI
  • Slide 2
  • Critter Lab Project http://steel.isi.edu/Projects/critter/install/index.htm l http://steel.isi.edu/Projects/critter/install/index.htm l Windows 7 and 10.10 installer OK, Linux try from source Mac results may be unpredictable Set proxy setting for your browser Surf the web and tune out ;-)
  • Slide 3
  • Privacy The state or condition of being free from observation.
  • Slide 4
  • Privacy The state or condition of being free from observation. Not really possible todayat least not on the internet.
  • Slide 5
  • Privacy The right of people to choose freely under what circumstances and to what extent they will reveal themselves, their attitude, and their behavior to others.
  • Slide 6
  • Privacy is not black and white Lots of grey areas and points for discussion What seems private to you may not seem private to me Three examples to start us off: HTTP Cookies Google Street View Facebook
  • Slide 7
  • HTTP cookies: What are they? Cookies = small text file Received from a server, stored on your machine Usually web Purpose: HTTP is stateless, so cookies maintain state for the HTTP protocol Eg keeping the contents of your shopping cart while you browse a site
  • Slide 8
  • HTTP cookies: 3 rd party cookies You visited your favorite site unicornsareawesome.com unicornsareawesome.com pulls ads from lameads.com You get a cookie from lameads.com, even though you never visited lameads.com lameads.com can track your browsing habits every time you visit any page with ads from lameads.com those might be a lot of pages
  • Slide 9
  • HTTP cookies: Grey Area? 3 rd party cookies allow ad servers to personalize your ads = more useful to you. Good! But You choose to go to unicornsareawesome.com = ok with unicornsareawesome.com knowing about how you use their site Nowhere did you choose to let lameads.com monitor your browsing habits
  • Slide 10
  • Short Discussion: Collusion: tool to track these 3 rd party cookies TED talk on Tracking the Trackers http://www.ted.com/talks/gary_kovacs_tracking_the_t rackers.html http://www.ted.com/talks/gary_kovacs_tracking_the_t rackers.html
  • Slide 11
  • Google Street View: What is it? Google cars drive around and take 360 panoramic pictures. Images are stitched together and can be browsed through on the Internet
  • Slide 12
  • Google Street View: Me
  • Slide 13
  • Google Street View: Lots to See
  • Slide 14
  • Google Street View: Grey Area Expectation of privacy? Im in public, I can expect people will see me Expectations? Picture linked to location Searchable Widely available Available for a long time to come
  • Slide 15
  • Facebook: What is it? Social networking site Connect with friends Share pictures, interests (likes)
  • Slide 16
  • Facebook: Grey Area Who uses Facebook data and how is data used? 4.7 million liked a page about health conditions or treatments. Insurance agents? 4.8 million shared information about dates of vacations. Burglars? 2.6 million discussed recreational use of alcohol. Employers?
  • Slide 17
  • Facebook: More Grey Security issues with Facebook Confusion over privacy settings Sudden changes in default privacy settings Facebook tracks browsing habits, even if a user isnt logged in (third-party cookies) Facebook sells user information to ad agencies and behavioral trackers
  • Slide 18
  • Slide 19
  • Why start with these examples? 3 examples: HTTP cookies, Google Street View, Facebook Lots more every day examples Users gain benefits by sharing data Tons of data generated, widely shared and accessible and stored (for how long?) Are users really aware of how and who?
  • Slide 20
  • Todays Agenda Privacy and Privacy & Security How do we safely share private data? Privacy and Inferred Information Privacy and Social Networks How do we design a system with privacy in mind?
  • Slide 21
  • Privacy and Privacy & Security How do we safely share private data? Privacy and Inferred Information Privacy and Social Networks How do we design a system with privacy in mind?
  • Slide 22
  • Examples private information Tons of information can be gained from Internet use: Behavior Eg. Person X reads reddit.com at work. Preferences Eg. Person Y likes high heel shoes and uses Apple products. Associations Eg. Person X and Person Y are friends. PPI (private, personal/protected information) credit card #s, SSN, nick names, addresses PII (personally identifying information) Eg. Your age + your address = I know who you are, even if Im not given your name.
  • Slide 23
  • How do we achieve privacy? policy + security mechanisms + law + ethics + trust Anonymity & Anonymization mechanisms Make each user indistinguishable from the next Remove PPI & PII Aggregate information
  • Slide 24
  • Who wants private info? Governments surveillance Businesses targeted advertising, following trends Attackers monetize information or cause havoc Researchers medical, behavioral, social, computer
  • Slide 25
  • Who has private info? You and me End-users Customers Patients Businesses Protect mergers, product plans, investigations Government & law enforcement National security Criminal investigations
  • Slide 26
  • Privacy and Security Security enables privacy Data is only as safe as the system its on Sometimes security at odds with privacy Eg. Security requires authentication, but privacy is achieved through anonymity Eg. TSA pat down at the airport
  • Slide 27
  • Privacy and Privacy & Security How do we safely share private data? Privacy and Inferred Information Privacy and Social Networks How do we design a system with privacy in mind?
  • Slide 28
  • Why do we want to share? Share existing data sets: Research Companies Buy data from each other Check out each others assets before merges/buyouts Start a new dataset: Mutually beneficial relationships Share data with me and you can use this service
  • Slide 29
  • Sharing everything? Easy, but what are the ramifications? Legal/policy may limit what can be shared/collected IRBs: Institutional Review Board HITECH & HIPAA: Health Insurance Portability and Accountability Act Future use and protection of data?
  • Slide 30
  • Mechanisms for limited sharing Remove really sensitive stuff (sanitization) PPI & PII (private, personal & private identifying) Without a crystal ball, this is hard Anonymization Replace information to limit ability to tie entities to meaningful identities Aggregation Remove PII by only collecting/releasing statistics
  • Slide 31
  • Anonymization Example Network trace: PAYLOAD
  • Slide 32
  • Anonymization Example Network trace: PAYLOAD All sorts of PII and PPI in there!
  • Slide 33
  • Anonymization Example Network trace: PAYLOAD Routing information: IP addresses, TCP flags/options, OS fingerprinting
  • Slide 34
  • Anonymization Example Network trace: PAYLOAD Remove IPs? Anonymize IPs?
  • Slide 35
  • Anonymization Example Network trace: PAYLOAD Removing IPs severely limits what you can do with the data. Replace with something identifying, but not the same data. IP1 = A IP2 = B Etc.
  • Slide 36
  • Aggregation Examples Fewer U.S. Households Have Debt, But Those Who Do Have More, Census Bureau Reports 3 people in class got As on the final
  • Slide 37
  • Methods can be bad or good Just because someone uses aggregation or anonymization, doesnt mean the data is safe 87% of the population of the United States can be uniquely identified by gender, date of birth, and 5- digit zip code Even if a dataset sanitizes names, if it includes zip, gender & birthdate the data its not preserving privacy.
  • Slide 38
  • Formalizing anonymization for better privacy K-anonymity "A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. L-diversity Each group contains at least L different values of sensitive information. Guards against Homogeneity attack And others (which we wont cover) T-closeness, m-invariance, delta-presence...
  • Slide 39
  • Example: Birth YearZip CodeGenderTest Result 1961 1960 1965 1963 1968 1964 1962 1966 1972 1971 1970 1984 1985 00198 00197 00196 00296 00298 00275 00279 00275 00277 00278 00356 00357 Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D
  • Slide 40
  • Example: Birth YearZip CodeGenderTest Result 1961 1960 1965 1963 1968 1964 1962 1966 1972 1971 1970 1984 1985 00198 00197 00196 00296 00298 00275 00279 00275 00277 00278 00356 00357 Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D SENSITIVE!
  • Slide 41
  • Example: Anonymized data Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 0029* 0027* 0035* Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D
  • Slide 42
  • Example: k-anonymity Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 0029* 0027* 0035* Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D
  • Slide 43
  • Example: k = ? Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 0029* 0027* 0035* Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D
  • Slide 44
  • Example: k=1 Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 0029* 0027* 0035* Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D
  • Slide 45
  • Example: getting to k=2 Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 0029* 0027* 0035* Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D This k=1 group can be merged with another group.
  • Slide 46
  • Example: k=2 anonymized data Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 002** 0027* 0035* Male Female Male Female Male Female A B+ A C A A- B C B B+ B A D Anonymize so these groups can be merged (by removing an extra digit in the zip).
  • Slide 47
  • Example: k=2 anonymized data Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 0035* Male Female Male Female A B+ A C A C A- B B+ B A D After merging, the smallest group is k=2. This now meets k=2 anonymization.
  • Slide 48
  • Example: l-diversity Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 0035* Male Female Male Female A B+ A C A C A- B B+ B A D K=2, l = ?
  • Slide 49
  • Example: l = 2 Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 0035* Male Female Male Female A B+ A C A C A- B B+ B A D
  • Slide 50
  • Example: l = ? Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 0035* Male Female Male Female A B+ A C A C A- B B+ B A D
  • Slide 51
  • Example: l = ? Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 0035* Male Female Male Female A B+ A C A C A- B A D
  • Slide 52
  • Example: l = 1 Birth YearZip CodeGenderTest Result 196* 197* 198* 0019* 002** 0027* 0035* Male Female Male Female A B+ A C A C A- B A D How is l=1 affect privacy? Was l=2 better?
  • Slide 53
  • l-diversity Not always possible Eg. gender: l cant ever be more than 2 Can be difficult to achieve Data is not always that diverse
  • Slide 54
  • Differential privacy The presence/absence of a record in database doesnt affect the result of data release, or the effect is negligable. Whether your information is in the database or not, the data analysis result will not be affected. How to achieve this? Adding noise. Data release is no longer deterministic, but in a error range. Level of protection and accuracy controlled by the level of noise added.
  • Slide 55
  • Privacy and Privacy & Security How do we safely share private data? Privacy and Inferred Information Privacy and Social Networks How do we design a system with privacy in mind?
  • Slide 56
  • What is Inferred? Take 2 sources of information, correlate data X + Y = . Example: Google Street View + what my car looks like + where I live = you know where I was back in November
  • Slide 57
  • Example: Netflix & IMDB Netflix prize: released an anonymized dataset Correlated with IMDB: undid anonymization (University of Texas)
  • Slide 58
  • K-anonymity, l-diversity & inferred information High k and l values dont guard against inferring
  • Slide 59
  • Privacy and Privacy & Security How do we safely share private data? Privacy and Inferred Information Privacy and Social Networks How do we design a system with privacy in mind?
  • Slide 60
  • What is social networking data? Associations Not what you say, but who you talk to OMG NEW BOYFRIEND
  • Slide 61
  • Why is social data interesting? From a privacy point of view: Guilt by association Eg. Government very interested Phone records (US) Facebook activity (Iran)
  • Slide 62
  • Computer Communication Computer communication = social network What sites/servers you visit/use = information on your relationship with those sites/servers Never mind the contentHow often you visit and who you visit may reveal a lot! You Unicornsareawesome.com
  • Slide 63
  • How do we provide privacy? Of course encrypt content (payload)! But: Network/transport layer = no encryption (for now) Anyone along the path can see source and destination so now what?
  • Slide 64
  • Onion Routing General idea: bounce connection through a bunch of machines
  • Slide 65
  • Dont we bounce around already? Not actually what happens
  • Slide 66
  • Dont we bounce around already? Closer to what actually happens.
  • Slide 67
  • Dont we bounce around already? Yes, we route packets through a series of routers BUT this doesnt protect the privacy of whos talking to whom Why? PAYLOAD
  • Slide 68
  • Dont we bounce around already? Yes, we route packets through a series of routers BUT this doesnt protect the privacy of whos talking to who Why? Contains routing information. ENCRYPTED
  • Slide 69
  • Yes, we bounce but: Everyone along the way can see src & dst Routes are easy to figure out Contains routing information = Cant encrypt Everyone along the path (routers and observers) can see who is talking to whom ENCRYPTED
  • Slide 70
  • Onion routing saves us Each router only knows about the last/next hop Routes are hard to figure out Change frequently Chosen by the source
  • Slide 71
  • The Onion part of Onion Routing Layers of encryption PAYLOAD Last hops key Second hops key First hops key
  • Slide 72
  • Onion Routing Example: Tor You Unicornsareawesome.com
  • Slide 73
  • Onion Routing Example: Tor You Tor directory Get a list of Tor Routers from the publically known Tor directory Tor Router IPs + public key for each router
  • Slide 74
  • Onion Routing Example: Tor You Unicornsareawesome.com Tor Routers
  • Slide 75
  • Onion Routing Example: Tor You Unicornsareawesome.com Choose a set of Tor routers to use 1st 2nd 3rd
  • Slide 76
  • Onion Routing Example: Tor You Unicornsareawesome.com Packets are now encrypted with 3 keys 1st 2nd 3rd
  • Slide 77
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Source: YOU, Dest: 1 st Tor router
  • Slide 78
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Decrypts 1 st layer
  • Slide 79
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Source: 1 st Tor router, Dest: 2 nd Tor router
  • Slide 80
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Decrypts 2 nd layer
  • Slide 81
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Source: 2nd Tor router, Dest: 3rd Tor router
  • Slide 82
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Decrypts last layer
  • Slide 83
  • Onion Routing Example: Tor You Unicornsareawesome.com 1st 2nd 3rd Original (unencrypted) packet sent to server. Source: 3rd Tor router, Dest: Unicornsareawesome.com
  • Slide 84
  • What does our attacker see? Encrypted traffic from You, to 1 st Tor router You
  • Slide 85
  • What does our attacker see? Other view points? Not easily traceable to you. You
  • Slide 86
  • What does our attacker see? Global view points? Very unlikely... But if so trouble!
  • Slide 87
  • What does our attacker see? Also unlikely can perform correlation between end-to-end.
  • Slide 88
  • Reliance on multiple users What would happen here if You were the only one using Tor? You
  • Slide 89
  • Side note: Tor is an overlay Tor routers are often just someones regular machine. Traffic is still routed over regular routers too.
  • Slide 90
  • Onion Routing: Things to Note Not perfect, but pretty nifty End host (unicornsareawesome.com) does not need to know about the Tor protocol (good for wide usage and acceptance) Data is encrypted all the way to the last Tor router If end-to-end application (like HTTPS) is using encryption, the payload is doubly encrypted along the Tor route.
  • Slide 91
  • Privacy and Privacy & Security How do we safely share private data? Privacy and Inferred Information Privacy and Social Networks How do we design a system with privacy in mind?
  • Slide 92
  • Designing privacy preserving systems Aim for the minimum amount of information needed to achieve goals Think through how info can be gained and inferred Inferred is often a gotcha! x + y = something private, but x and y by themselves dont seem all that special Think through how information be gained On the wire? Stored in logs? At a router? At an ISP?
  • Slide 93
  • Privacy and Stored Information Data is only as safe as the system How long is the data stored affects privacy Longer term = bigger privacy risk (in general) Longer time frame, more data to correlate & infer Longer opportunity for data theft Increased chances of mistakes, lapsed security etc.
  • Slide 94
  • Bringing it all together Example from current research at ISI: The Critter Project Critter@home is a continuously updated archive of content-rich network data, contributed by volunteer users. Data contributors join the Critter overlay whenever online, offering their data to interested researchers.
  • Slide 95
  • Critter: Why? Networking and cybersecurity research critically need publicly available, fresh and diverse application-level data, for data mining and for validation. There are very few publicly available network traces that contain application-level data. Outdated, contain very specific data useful only to some researchers Content-rich network data has enormous privacy risks for sharing, because it is rich with personal and private information (PPI) that Internet criminals can monetize E.g., human names, social security numbers, phone numbers, usernames, passwords, credit card numbers, etc.
  • Slide 96
  • Critter: Architecture
  • Slide 97
  • Critter: Key designs Users can host their own data locally PPI-sanitization process to replace all personal and private information (PPI). Data is always stored and transmitted in an encrypted format. No human apart from the contributor will ever access the raw, PPI-sanitized, data. Instead, researchers access data via a query system which only returns aggregate statistics. All contact with a contributor is at her discretion and is done via an anonymizing network where contributor identities are hidden both from researchers and the Internet at large. Contributors (if they so desire) can have full, fine-grained control over their data at all times via policy settings.
  • Slide 98
  • Accessing collected data (1) A researcher submits a query via the public portal. (2) Critter clients connect and poll for new queries via an anonymizing network. (3) The researchers stored query is sent to clients. (4) Patrol processes the query if the Query Policy permits, and returns encrypted results along with information on how a contributor wants its response aggregated. (5) Aggregated results are stored and can be retrieved
  • Slide 99
  • http://steel.isi.edu/critter/examples.ht ml
  • Slide 100
  • Querying Critter Data in beta version http://steel.isi.edu/critter/examples.html http://steel.isi.edu/critter/examples.html UI interface: steel.isi.edu/critter Types of queries: Boolean (1 or 0, eg. 5 users said yes) Histogram (eg. 2 users said 3 3 users said 1) Sum (eg. user 1 says 1, user 2 says 3, answer is 4)