View
52
Download
0
Category
Tags:
Preview:
DESCRIPTION
Publishing Search Query logs. CompSci 590.03 Instructor: Ashwin Machanavajjhala. Outline. Uses of search query logs Privacy and search logs K-Anonymity Differentially private agorithms. Search Query Log. < anonid , query, querytime , itemrank , clickurl >. Uses of search query logs. - PowerPoint PPT Presentation
Citation preview
Lecture 17: 590.03 Fall 12 1
CompSci 590.03Instructor: Ashwin Machanavajjhala
Publishing Search Query logs
Lecture 17: 590.03 Fall 12 2
Outline• Uses of search query logs
• Privacy and search logs
• K-Anonymity
• Differentially private agorithms
Lecture 17: 590.03 Fall 12 3
Search Query Log• <anonid, query, querytime, itemrank, clickurl>
Lecture 17: 590.03 Fall 12 4
Uses of search query logs• Search result caching
• Query Recommendation
• Synonym identification
• Reranking search results
• Search advertising and Keyword popularity estimation
Lecture 17: 590.03 Fall 12 5
Uses of search query logs[Silvestri FnT ‘10]
Lecture 17: 590.03 Fall 12 6
Google Flu
“We've found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real-time.” http://www.google.org/flutrends/
• Predictions by Google Flu are 1-2 weeks ahead of CDC’s ILI (Influenza-like illness) surveillance reports
[Ginsberg Nature ‘09]
Lecture 17: 590.03 Fall 12 7
Google Flu• Identify single ILI-related search queries that could most
accurately model the CDC ILI visit percentages in 9 regions
– P: probability of a ILI-related physician visit in a region (based on CDC data)– Q: ILI-related query fraction
• Pick 45 highest scoring queries, and fit a linear model to predict ILI visit rates.
[Ginsberg Nature ‘09]
Lecture 17: 590.03 Fall 12 8
Google Flu[Ginsberg Nature ‘09]
Lecture 17: 590.03 Fall 12 9
Flu Trends
U.S.
Australia
Google Flu Estimate National Data
Lecture 17: 590.03 Fall 12 10
Outline• Uses of search query logs
• Privacy and search logs
• K-Anonymity
• Differentially private agorithms
Lecture 17: 590.03 Fall 12 11
Privacy and Search Logs[NYTimes 2006]
Lecture 17: 590.03 Fall 12 12
Sensitive Information
• Obtain sensitive information directly from queries (user1)• Identifying users via demographic attributes (user2)• Identifying users by following urls (user3)• Identification leads to learning sensitive queries (user2)
[Chen et al FnT ‘09]
Lecture 17: 590.03 Fall 12 13
Challenges• Not clear which queries are identifying and which queries are
sensitive
• Users queries are almost always unique
• Adversaries may launch active attacks– Create many queries from different accounts to test if some user search for
sensitive queries.
Lecture 17: 590.03 Fall 12 14
Outline• Uses of search query logs
• Privacy and search logs
• Privacy-Enhancing Techniques
• Differentially private agorithms
Lecture 17: 590.03 Fall 12 15
Identifier Deletion• Delete personally identifying information like IP addresses and
cookies, names, social security numbers …
• Even if we remove age, gender, zip code from search logs, one can estimate these from the remaining log [Jones et al ‘07]
Lecture 17: 590.03 Fall 12 16
Hashing Queries• Replace queries with hash values
• One can estimate the words based on co-occurrence analysis if token based hashing schemes are used [Kumar et al WWW ‘07]
• Utility is lost …
Lecture 17: 590.03 Fall 12 17
K-anonymity and deleting infrequent queries
• Unlikely that many people search for a specific individual’s identifiers.
• Algorithm: Suppress all queries which are posed by at most K users.
• … But a combination of frequent queries can still identify an individual …
• Solution: Split a users log into smaller ones … based on query sessions– Query session is a set of queries that
are related to each other.
[Adar WWW 07]
Lecture 17: 590.03 Fall 12 18
TrackMeNot• Users send noise queries in addition to real queries
• TrackMeNot is a browser plugin which posts queries to search engines
Problems: • Distribution of noisy queries is different from distribution of
actual queries … so noise can be removed• Imposes load on the search engine • Query log loses utility …
[Howe & Nissbaum 08]
Lecture 17: 590.03 Fall 12 19
Outline• Uses of search query logs
• Privacy and search logs
• Privacy-Enhancing Techniques
• Differentially private agorithms
Lecture 17: 590.03 Fall 12 20
Differential Privacy and Search Logs• Consider two databases that differ in the log of one user
• In the worst case all queries are by the same user– Sensitivity = |query log|– Can guarantee no utility!
• Pick at most m queries from each user
Lecture 17: 590.03 Fall 12 21
Differential Privacy and Search Logs• Domain of search terms is very large. Hence no differentially
private algorithm is “useful”.
• Consider the problem of publishing all queries posted by at least τ users each.
Theorem: [Gotz et al TKDE 2012] For a sufficiently large domain size, the accuracy of any differentially private algorithm is worse than that of an algorithm with always returns an emptyset!
Lecture 17: 590.03 Fall 12 22
Probabilistic Differential Privacy
Adversary may distinguish between D1 and D2 based on a set of unlikely outputs
with probability at most δ
For every probable output
OD2D1
For every pair of inputs that differ in one value
Pr[O | < eε] > 1 - δPr[D1 O]Pr[D2 O]
Lecture 17: 590.03 Fall 12 23
Publishing Frequent queries/clicks
[Korolova et al WWW 2009,Gotz et al TKDE 2012]
Lecture 17: 590.03 Fall 12 24
Privacy• The algorithm presented in the previous slide guarantees (ε,δ)-
probabilistic differential privacy if
– Where U is the number of users, m is the maximum number of queries per user, λ is the laplace noise parameter, and τ, τ’ are the two thresholds used by the algorithm
Lecture 17: 590.03 Fall 12 25
Utility• Let ξ = (τ’ – τ)/3*, and let τ* = τ + ξ
Any query that appears with frequency < τ* - ξ …• Has frequency less than τ• Is published in the output with probability 0.
Any query that appears with frequency > τ* + ξ …• Is published if τ* + ξ + Lap(λ) > τ’• That is, noise > ξ• That is, query is published with probability 1- 0.5*e-ξ/λ.
Lecture 17: 590.03 Fall 12 26
Utility[Gotz et al TKDE 2012]
Distributions are significantly
different
Lecture 17: 590.03 Fall 12 27
Web Caching scenario• Speed up web search, by storing the results for most frequent
queries.• Each keyword is given a score based on frequency in the
(anonymous) log. • Top few keywords are maintained in memory …
[Gotz et al TKDE 2012]
Lecture 17: 590.03 Fall 12 28
Summary • Publishing search logs can lead to very useful applications
– Web– Social Science– …
• Very sensitive information. Also individuals are easily identifiable.
• Simple techniques do not provide sufficient protection
• Differentially private techniques throw away a significant amount of data– Only m queries per person– All tail queries (with low frequency) are thrown away
Lecture 17: 590.03 Fall 12 29
ReferencesF. Silvestri, “Mining Query Logs: Turning Search Usage Data into Knowledge”, Foundations
and Trends 4 (1-2) 2010J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, L. Brilliant, “Detecting influenza
epidemics using search engine query data”, Nature, vol. 457, Feb 2009Bee-Chung Chen, Daniel Kifer, Kristen LeFevre and Ashwin Machanavajjhala "Privacy-
Preserving Data Publishing", Foundations and Trends® in Databases: Vol. 2: No 1-2, pp 1-167, 2009.
R. Jones, R. Kumar, B. Pang, and A. Tomkins, “I know what you did last summer — query logs and user privacy,” in CIKM, 2007.
R. Kumar, J. Novak, Bo. Pang, and A. Tomkins, “On anonymizing query logs via token-based hashing,” WWW 2007
E. Adar, “User 4xxxx9: Anonymizing query logs”, WWW 2007HOWE, D. AND NISSENBAUM, H. 2008. TrackMeNot: Resisting surveillance in web search.A. Korolova, K. Kenthapadi, N. Mishra, A. Ntoulas, “Releasing Search Queries and Clicks
Privatey”, WWW 2009M. Gotz, A. Machanavajjhala, G. Wang, X. Xiao, J. Gehrke, “Publishing Search Logs”, TKDE
2012
Recommended