Web-Based Inference Detection

8/14/2019 Web-Based Inference Detection

1/16

Web-Based Inference Detection

Jessica Staddon 1 , Philippe Golle 1 , Bryce Zimny 2

1 Palo Alto Research Center {staddon,pgolle }@parc.com

2 University of [email protected]

AbstractNewly published data, when combined with existing

public knowledge, allows for complex and sometimesunintended inferences. We propose semi-automatedtools for detecting these inferences prior to releasingdata. Our tools give data owners a fuller understandingof the implications of releasing data and help them ad- just the amount of data they release to avoid unwantedinferences.

Our tools rst extract salient keywords from the pri-vate data intended for release. Then, they issue searchqueries for documents that match subsets of these key-words, within a reference corpus (such as the publicWeb) that encapsulatesas much of relevant public knowl-edge as possible. Finally, our tools parse the documentsreturned by the search queries for keywords not present

in the original private data. These additional keywordsallow us to automatically estimate the likelihood of cer-tain inferences. Potentially dangerous inferences areagged for manual review.

We call this new technology Web-based inferencecontrol. The paper reports on two experiments whichdemonstrate early successes of this technology. The rstexperiment shows the use of our tools to automaticallyestimate the risk that an anonymous document allowsfor re-identication of its author. The second experimentshows the use of our tools to detect the risk that a doc-ument is linked to a sensitive topic. These experiments,while simple, capture the full complexity of inference de-

tection and illustrate the power of our approach.

1 Introduction

Information has never been easier to nd. Search en-gines allow easy access to the vast amounts of infor-mation available on the Web. Online data repositories,

This work was done while a coop student at the Palo Alto ResearchCenter.

newspapers, public records, personal webpages, blogs,etc., make it easy and convenient to look up facts, keepup with events and catch up with people.

On the ip side, information has never been harder tohide. With the help of a search engine or web informa-tion integration tool [45], one can easily infer facts, re-construct events and piece together identities from frag-ments of information collected from disparate sources.Protecting information requires hiding not only the in-formation itself, but also the myriad of clues that mightindirectly lead to it. Doing so is notoriously difcult, asseemingly innocuous information may give away onessecret.

To illustrate the problem, consider a redacted biogra-phy [8] (shown in the left-hand side of gure 6) that wasreleased by the FBI. Prior to publication, the biographywas redacted to protect the identity of the person whomit describes. All directly identifying information, such asrst and last names, was expunged from the biography.The redacted biography contains only keywords that ap-ply to many individuals, such as half-brother, Saudi,magnate and Yemen. None of these keywords is par-ticularly identifying on its own, but in aggregate they al-low for near-certain identication of Osama Bin Laden.Indeed, a Google search for the query Saudi magnatehalf-brother returns in the top 10 results, pages that areall related to the Bin Laden family. This inference, aswell as potentially many others, should be anticipatedand countered in a thorough redaction process.

The need to protect secret information from unwantedinferences extends far beyond the FBI. In addition to in-telligence agencies and the military, numerous govern-ment agencies, businesses and individuals face the prob-lem of insulating their secrets from the information theydisclose publicly. In the litigation industry for example,information protected by client-attorney privilege mustbe redacted from documents prior to disclosure. In thehealthcare industry, it is common practice and mandatedby some US state laws, to redact sensitive information


2/16

(such as HIV status, drug or alcohol abuse and mentalhealth conditions) from medical records prior to releas-ing them. Among individuals, anonymous bloggers area good example of people who seek to ensure that theirposts do not disclose their secret (their identity). Thisis made challenging by the fact that in some cases very

little personal information may sufce to infer the blog-gers identity. For example, if the second author of thispaper were to reveal his rst name (Philippe) and men-tion the rst name of his wife (Sanae), then his last name(or at least, a strong candidate for his last name) can beinferred from the rst hit returned by the Google query,Philippe Sanae wedding.

In all these instances, the problem is not access con-trol, but inference control . Assuming the existence of mechanisms to control access to a subset of informa-tion, the problem is to determine what information canbe released publicly without compromising certain se-crets, and what subset of the information cannot be re-leased. What makes this problem difcult is the quantityand complexity of inferences that arise when publisheddata is combined with, and interpreted against, the back-drop of public knowledge and outside data.

This paper breaks new ground in considering the prob-lem of inference detection not in a restricted setting (suchas, e.g., database tables), but in all its generality. Wepropose the rst all-purpose approach to detecting un-wanted inferences. Our approach is based on the ob-servation that the combination of search engines and theWeb, which is so well suited to detect inferences, worksequally well defensively as offensively. The Web is anexcellent proxy for public knowledge, since it encapsu-lates a large fraction of that knowledge (though certainlynot all). Furthermore, the dynamic nature of the Webreects the dynamic nature of human knowledge andmeans that the inferences detected today may be differentfrom those drawn yesterday. The likelihood of certain in-ferences can thus be estimated automatically, at any pointin time, by issuing search queries to the Web. Returningto the example of the biography redacted by the FBI, asimple search query could have agged the risk of re-identication coming from the keywords Saudi, mag-nate and half-brother.

The Web is an ideal resource for identifying infer-

ences because keyword search allows for efcient de-tection of the information that is associated with an in-dividual. Such associations can be just as important inidentifying someone as their personal attributes. As anexample, consider the fact that the top 2 hits returned bythe Google query, pop singer vogueing 1 have nothingto do with the singer Madonna, whereas the top 3 hits re-turned by the Google query, gay pop singer vogueing 2

all pertain to Madonna. The attribute gay helps to fo-cus the results not because it is an attribute of Madonna

(at least not as it is used in the top 3 hits) but rather itis an attribute associated with a large subset of her fan-base. Similarly, the entire rst page of hits returned bythe query naltrexone acamprosate all pertain to alco-holism, not because they are alcoholism symptoms or insome other way part of the denition of alcoholism, but

rather they are associated with alcoholism because theyare drugs commonly used in its treatment.

We propose generic tools for detecting unwanted in-ferences automatically using the Web. These tools rstextract salient keywords from the private data intendedfor release. Then, they issue search queries for docu-ments that match subsets of these keywords, within areference corpus (such as the public Web) that encapsu-lates as much of relevant public knowledge as possible.Finally, our tools parse the documents returned by thesearch queries for keywords not present in the originalprivate data. These additional keywords allow us to au-tomatically estimate the likelihood of certain inferences.

Potentially dangerous inferences are agged for manualreview. We call this new technology Web-based infer-ence control.

We demonstrate the success of our inference detectiontools with two experiments. The rst experiment showsthe use of our tools to automatically estimate the risk thatan anonymous document allows for re-identication of its author. The second experiment shows the use of ourtools to detect the risk that a document is linked to a sen-sitive topic. These experiments, while simple, capturethe full complexity of inference detection and illustratethe power of our approach. 3

OVERVIEW . We discuss related work in section 2.We dene our models and tools, as well as our basicalgorithm for Web-assisted inference detection in sec-tion 3. We list a number of potential applications of Web-assisted inference control in section 4. Section 5describes two experiments that demonstrate the successof our inference control tools. Section 6 provides an ex-ample using Web-based inference detection to improvethe redaction process. We conclude in section 7.

2 Related Work

Our work can be viewed both as a new technique for in-ference detection and as a new way of leveraging Websearch to understand content. There is substantial exist-ing work in both areas, but ours is the rst Web-basedapproach to inference detection. We discuss the mostclosely related work in these areas below.

INFERENCE DETECTION . Most of the previous work oninference detection has focused on database content (see,for example, [33, 21, 43, 19]). Work in this area takesas input the database schema, the data themselves and,


3/16

sometimes, relations amongst the attributes of the data-base that are meant to model the outside knowledge ahuman may wield in order to infer sensitive information.To the best of our understanding, no systematic methodhas been demonstrated for integrating this outside knowl-edge into an inference detection system. Our work seeks

to remedy this by demonstrating the use of the Web forthis purpose. When coupled with simple keyword extrac-tion, this general technique allows us to detect inferencein a variety of unstructured documents.

A particular type of inference allows the identica-tion of an individual. Sweeney looks for such inferencesusing the Web in [35] where inferences are enabled bynumerical values and other attributes characterizable byregular expressions such as SSNs, account numbers andaddresses. Sweeney does not consider inferences basedon English language words. We use the indexing powerof search engines to detect when words, taken together,are closely associated with an individual.

The closely related problem of author identicationhas also been extensively studied by the machine learn-ing community (see, for example, [25, 11, 24, 34, 20]).The techniques developed generally rely on a trainingcorpus of documents and use specic attributes like self-citations [20] or writing style [25] to identify authors.Our work can be viewed as exploiting a previously un-studied method of author identication, using informa-tion authors reveal about themselves to identify them.

Atallah, et al. [2], describe how natural languageprocessing can potentially be used to sanitize sensi-tive information when the sanitization rules are alreadyknown. Our work is focused on using the Web to iden-tify the sanitization rules.

W EB -ASSISTED QUERY INTERPRETATION . There is alarge body of work on using the Web to improve queryresults (see, for example, [16, 32, 10]). One of the funda-mental ideas that has come out of this area is to use over-lap in query results to establish a connection between dis-tinct queries. In contrast, we analyze the content of thequery results in order to detect connections between thequery terms and an individual or topic.

W EB -BASED SOCIAL NETWORK ANALYSIS . Recently,the Web has been used to detect social networks (e.g.,[1, 23]). A key idea in this work is using the Web to look for co-occurences of names and using this to infer a link in a social network. Our techniques can support this typeof analysis, when, for example, names in a network whenentered as a Web query, yield a name that is not alreadyin the network. However, our techniques are aimed ata broader goal, that is, understanding all inferences thatcan be drawn from a document.

W EB -ASSISTED CONTENT ANALYSIS AND ANNOTA -TION . There is a large body of work on using the Web

to understand and analyze content. Nakov and Hearst[30] have shown the power of using the Web as trainingdata for natural language analysis. Web-assistance forextracting keywords for the purposes of content indexingand annotation is studied in [12, 37, 26]. This work is fo-cused on automated, Web-based tools for understanding

the meaning of the text as written, as opposed to the in-ferences that can be drawn based on the text. That said,in our work we use very simple content analysis tools,and improvements to our approach could involve moresophisticated content analysis tools including Web-basedtools such as those developed in these works.

W EB -BASED DATA AGGREGATION . Finally, we notethat the commercial world is beginning to offer Web-based data aggregation tools (see, for example [14, 13,31]) for the purposes of tracking competitor behavior,doing market analysis and intelligence gathering. We arenot aware of support for pre-production inference controlin these offerings, as is the focus of this paper.

3 Model and Generic Algorithm

Let C denote a private collection of documents that isbeing considered for public release, and let R denote acollection of reference documents. For example, the col-lection C may consist of the blog entries of a writer, andthe collection R may consist of all documents publiclyavailable on the Web.

Let K (C) denote all the knowledge that can be com-puted from the private collection C. The set K (C) infor-mally represents all the statements and facts that can be

logically derived from the information contained in thecollection C. The set K (C) could in theory be computedwith a complete and sound theorem prover given all theaxioms in C. In practice, such a computation is impos-sible and we will instead rely on approximate represen-tations of the set K (C). Similarly let K (R ) denote allthe knowledge that can be computed from the referencecollection R .

Informally stated, the problem of inference controlcomes from the fact that the knowledge that can be ex-tracted from the union of the private and reference col-lections K (C R) is typically greater than the unionK (C) K (R ) of what can be extracted separately fromC and R . The inference control problem is to understandand control the difference:

Diff (C, R ) = K (C R) K (C) K (R ) .

Returning to the Osama Bin Laden example discussedin the introduction, consider the case where the col-lection C consists of the single declassied FBI docu-ment [8], and where R consists of all information pub-licly available on the Web. Let S denote the statement:


4/16

The declassied FBI document is a biography of OsamaBin Laden. Since the identity of the person to whom thedocument pertains has been redacted, it is impossible tolearn the statement S from C alone, and so S K (C).The statement S is clearly not in K (R ) either since it isimpossible to compute from R alone a statement about

a document that is in C but not in R . It follows that S does not belong to K (C) K (R ). But, as shown ear-lier, the statement S belongs to K (C R). Indeed, welearn from C that the document pertains to an individ-ual characterized by the keywords Saudi, magnate,half-brothers, Yemen, etc. We learn from R thatthese keywords are closely associated with Osama BinLaden. If we combine these two sources of information,we learn that the statement S is true with high probabil-ity.

It is critical to understand Diff (C, R ) prior to pub-lishing the collection C of private documents, to en-sure that the publication of C does not allow for un-wanted inferences. The owner of C may choose to with-hold from publication parts or all of the documents inthe collection based on an assessment of the differenceDiff (C, R ). Sometimes, the set of sensitive knowledgeK that should not be leaked is explicitly specied. Inthis case, the inference control problem consists moreprecisely of ensuring that the intersection Diff (C, R ) K is empty.

3.1 Basic Approach

In this work, we consider the case in which C can be any

arbitrary collection of documents. In particular, contraryto prior work on inference control in databases, we donot restrict ourselves to private documents formatted ac-cording to a well-dened structure. We assume that thecollection R of public documents consists of all publiclyavailable documents, and that the public Web serves as agood proxy for this collection. Our generic approach toinference detection is based on the following two steps:

1. U NDERSTANDING THE CONTENT OF THE DOCU -MENTS IN THE PRIVATE COLLECTION C. We employautomated content analysis in order to efciently extractkeywords that capture the content of the document in thecollection C. A wide array of NLP tools are possible forthis process, ranging from simple text extraction to deeplinguistic analysis. For the proof-of-concept demonstra-tions described in section 5, we employ keyword selec-tion via a term frequency - inverse document frequency(TF.IDF) calculation, but we note that a deeper linguisticanalysis may produce better results.

2. E FFICIENTLY DETERMINING THE INFERENCESTHAT CAN BE DRAWN FROM THE COMBINATION OF

C AND R . We issue search queries for documents that

match subsets of the keywords extracted in step 1, withina reference corpus (such as the public Web) that encap-sulates as much of relevant public knowledge as possi-ble. Our tools then parse the documents returned by thesearch queries for keywords not present in the originalprivate data. These additional keywords allow us to au-

tomatically estimate the likelihood of certain inferences.Potentially dangerous inferences are agged for manualreview.

3.2 Inference Detection Algorithm

In this section, we give a generic description of our infer-ence detection algorithm. This description emphasizesconceptual understanding. Specic instantiations of theinference detection algorithms, tailored to two particularapplications, are given in section 5. These instantiationsdo not realize the full complexity of this general algo-rithm partly for efciency reasons and partly because of

the attributes of the application. We start with a descrip-tion of the inputs, outputs and parameters of our genericalgorithm.

INPUTS : A private collection of documents C ={C 1 , . . . , C n }, a collection of reference documents Rand a list of sensitive keywords K that represent sen-sitive knowledge.

O UTPUT : A list L of inferences that can be drawn fromthe union of C and R . Each inference is of the form:

(W 1 , . . . , W k ) K

0 ,

where W 1 , . . . , W k are keywords extracted from docu-ments in C, and K 0 K is a subset of sensitive key-words. The inference (W 1 , . . . , W k ) K 0 , indicatesthat the keywords (W 1 , . . . , W k ), found in the collectionC, together with the knowledge present in R allow forinference of the sensitive keywords K 0 . The algorithmreturns an empty list if it fails to detect any sensitive in-ference.

PARAMETERS : The algorithm is parameterized by avalue that controls the depth of the NLP analysis of the documents in C, by two values and that controlthe search depth for documents in R that are related toC, and nally by a value that controls the depth of theNLP analysis of the documents retrieved by the searchalgorithm. The values ,, and are all positive in-tegers. They can be tuned to achieve different trade-offsbetween the running time of the algorithm and the com-pleteness and quality of inference detection.

U NDERSTANDING THE DOCUMENTS IN C. Our basicalgorithm uses TF.IDF (term frequency - inverse docu-ment frequency, see [28] and section 5.1) to extract fromeach document C i in the collection C the top keywords


5/16

that are most representative of C i . Let S i denote the setof the top keywords extracted from document C i , andlet S = ni =1 S i .

INFERENCE DETECTION . The list L of inferences is ini-tially empty. We consider in turn every subset S S of size |S | . For every such subset S =(W 1 , . . . , W k ), with k , we do the following:

1. We use a search engine to retrieve from the collec-tion R of reference documents the top documentsthat contain all the keywords W 1 , . . . , W k .

2. With TF.IDF, we extract the top keywords fromthis collection of documents. Note that these key-words are extracted from the aggregate collection of documents (as if all these documents were con-catenated into a single large document), not fromeach individual document.

3. Let K

0 denote the intersection of the keywordsfrom step 2 with the set K of sensitive keywords.If K 0 is non-empty, we add to L the inference C K 0 .

The algorithm outputs the list L and terminates.

3.3 Variants of the Algorithm

The algorithm of section 3.2 can be tailored to a varietyof applications. Two such applications are discussed inexhaustive detail in section 5. Here, we discuss brieyother possible variants of the basic algorithm.

D ETECTING ALL INFERENCES . In some applications,the set of sensitive knowledge K may not be known ormay not be specied. Instead, the goal is to identify allpossible inferences that arise from knowledge of the col-lection of documents C and the reference collection R .A simple variation of the algorithm given in 3.2 handlesthis case. In step 3 of the inference detection phase, werecord all inferences instead of only inferences that in-volve keywords in K . Note that this is equivalent toassuming that the set K of sensitive knowledge consistsof all knowledge. The algorithm may also track the num-ber of occurrences of each inference, so that the list L canbe sorted from most to least frequent inference.

A LTERNATIVE REPRESENTATION OF SENSITIVEKNOWLEDGE . The algorithm of section 3.2 assumesthat the sensitive knowledge K is given as a set of keywords. Other representations of sensitive knowledgeare possible. In some applications for example, sensitiveknowledge may consist of a topic (e.g. alcoholism,or sexually transmitted diseases) instead of a list of keywords. To handle this case, we need a pre-processingstep which converts a sensitive topic into a list of

sensitive keywords. One way of doing so is to issue asearch query for documents in the reference collectionR that contain the sensitive topic, then use TF.IDFto extract from these documents an expanded set of sensitive keywords.

4 Example Applications

This section describes a wide array of potential applica-tions for Web-based inference detection. All these appli-cations are based on the fundamental algorithm of sec-tion 3. The rst two applications are the subjects of theexperiments described in detail in section 5. Experiment-ing with other applications will be the subject of futurework.

REDACTION OF MEDICAL RECORDS . Medical recordsare often released to third parties such as insurance com-

panies, research institutions or legal counsel in the caseof malpractice lawsuits. State and federal legislationmandates the redaction of sensitive information frommedical records prior to release. For example, all ref-erences to drugs and alcohol, mental health and HIV sta-tus must typically be redacted. This redaction task is farmore complex than it may initially appear. Extensive andup-to-date knowledge of diseases and drugs is required todetect all clues and combinations of clues that may allowfor inference of sensitive information. Since this medicalinformation is readily available on public websites, theprocess of redacting sensitive information from medicalrecords can be partially automated with Web-based infer-

ence control. Section 5.3 reports on our experiments withWeb-based inference detection for medical redaction.

PRESERVING INDIVIDUAL ANONYMITY . Intelligenceand other governmental agencies are often forced by law(such as the Freedom of Information Act) to release pub-licly documents that pertain to a particular individual orgroup of individuals. To protect the privacy of those con-cerned, the documents must be released in a form thatdoes not allow for unique identication. This problem isnotoriously difcult, because seemingly innocuous infor-mation may allow for unique identication, as illustratedby the poorly redacted Osama Bin Laden biography [8]

discussed in the introduction. Web-based inference con-trol is perfectly suited to the detection of indirect infer-ences based on publicly available data. Our tools canbe used to determine how much information can be re-leased about a person, entity or event while preserving k-anonymity, i.e. ensuring that it remains hidden in a groupof like-entities of size at least k, and cannot be identiedany more precisely within the group. Section 5.2 reportson our experiments with Web-based inference detectionfor preserving individual anonymity.


6/16

FORMULATION OF REDACTION RULES . Our Web-basedinference detection tools can also be used to pre-computea set of redaction rules that is later applied to a collectionof private documents. For a large collection of privatedocuments, pre-computing redaction rules may be moreefcient than using Web-based inference detection to an-

alyze each and every document. In 1995 for example,executive order 12958 mandated the declassication of large amounts of government data [9] (hundreds of mil-lions of pages). Sensitive portions of documents were tobe redacted prior to declassication. The redaction ruleswere exceedingly complex and formulating them wasreportedly nearly as time-consuming as applying them.Web-based inference detection is an appealing approachto automatically expand a small set of seed redactionrules. For example, assuming that the keyword mis-sile is sensitive, web-based inference detection couldautomatically retrieve other keywords related to missiles(e.g. guidance system, ballistics, solid fuel) and

add them to the redaction rule.

PUBLIC IMAGE CONTROL . This application considersthe problem of verifying that a document conforms tothe intentions of its author, and does not accidentally re-veal private information or information that could eas-ily be misinterpreted or understood in the wrong con-text. This application, unlike others, does not assumethat the set of unwanted inferences is known or explic-itly dened. Instead, the goal of this application is todesign a broad, general-purpose tool that helps contex-tualize information and may draw an authors attention

to a broad array of potentially unwanted inferences. Forexample, Web-based inference detection could alert theauthor of a blog to the fact that a particular posting con-tains a combination of keywords that will make the blogappear prominently in the results of some search query.This problem is related to other approaches to public im-age management, such as [13, 31]. Few technical detailshave been published about these other approaches, butthey do not appear focused on inference detection andcontrol.

LEAK DETECTION . This application helps a data owner

avoid accidental releases of information that was not pre-viously public. In this application of Web-based infer-ence control, the set of sensitive knowledge K consistsof all information that was not previously public. In otherwords, the release of private data should not add anythingto public knowledge. This application may have helpedprevent, for example, a recent incident in which Googleaccidentally released condential nancial informationin the notes of a PowerPoint presentation distributed tonancial analysts [22].

5 Experiments

Our experiments focus on exploring the rst two pri-vacy monitor applications of section 4: redaction of med-ical records and preserving individual anonymity. Intesting these ideas, we faced two main challenges that

constrained our experimental design. First, and mostchallenging, was designing relevant experiments that wecould execute given available data. The second, morepragmatic, challenge was getting the right tools in placeand executing the experiments in a time-efcient manner.We describe each of these challenges, and our approachto meeting them, in more detail below.

5.1 Experimental Design Challenges andTools

Ideally, our idea of Web-based inference detection wouldbe tested on authentic documents for which privacy is achief concern. For example, a corpus of medical recordsbeing prepared for release in response to a subpoenawould be ideal for evaluating the ability of our tech-niques to identify sensitive topics. However, such a cor-pus is hard to come by for obvious reasons. Similarly,a collection of anonymous blogs would be ideal for test-ing the ability of our techniques to identify individuals,but such blogs are hard to locate efciently. Indeed, theexcitement over the recently released AOL search data,as illustrated by the quick appearance of tools for min-ing the data (see, for example, [44, 4]), demonstrates thewidespread difculty in nding data appropriate for vet-ting data mining technologies, of which our inference de-tection technology is an example. 4

Given the difculties of nding unequivocally sensi-tive data on which to test our algorithms, we used in-stead publicly available information about an individual,which we anonymized by removing the individuals rstand last names. In most cases, the public informationabout the individual, thus anonymized, appeared to be adecent substitute for text that the individual might haveauthored on their blog or Web page.

All of our experiments rely on Java code we wrotefor extracting text from html, on calculation of an ex-tended form of TF.IDF (see denition below) for identi-

fying keywords in documents and on the Google SOAPsearch API [18] for making Web queries based on thosekeywords.

Our code for extracting text from html uses standardtechniques for removing html tags. Because our experi-ments involved repeated extractions from similarly for-matted html pages (e.g Wikipedia biographies) it wasmost expedient to write our own code, customized forthose pages, rather than retrotting existing text extrac-tion code such as is available in [3].


7/16

As mentioned above, in order to determine if a wordis a keyword we use the well known TF.IDF metric (see,for example, [28]). The TF.IDF rank of a word in adocument is dened with respect to a corpus, C . Westate the denition next.

Denition 1 Let D be a document that contains theword W and is part of a corpus of documents, C . Theterm frequency (TF) of W with respect to D is the num-ber of times W occurs in D . The document frequency(DF) of W with respect to the corpus, C , is the total num-ber of documents in C that contain the keyword W . TheTF.IDF value associated with W is the ratio: TF/DF .

Our code implements a variant of TF.IDF in which werst use the British National Corpus (BNC) [27] to stemlexical tokens (e.g. the tokens accuse, accused, ac-cuses and accusing would all be mapped to the stemaccuse). We then use the BNC again to associate witheach token the DF of the corresponding stem (i.e. ac-cuse in the earlier example).

As with text extraction from html, there are opensource (and commercial) offerings for calculatingTF.IDF based on a reference corpus. We did not, how-ever, have a reference corpus on which to base our cal-culations, and thus opted to write our own code to com-pute TF.IDF based on the DF values reported in the BNC(which is an excellent model for the English language asa whole, and thus presumably also for text found on theWeb).

Our nal challenge was experimental run-time. Al-though we did not invest time optimizing our text ex-

traction code for speed it nevertheless proved remark-ably efcient in comparison with the time needed to ex-ecute Google queries and download Web pages. In addi-tion, Google states that they place a constraint of 1, 000queries per day for each registered developer on theGoogle SOAP Search API service [18]. This constraintrequired us to amass enough Google registrations in or-der to ensure our experiments could run uninterrupted; inour case, given the varying running times of our experi-ments, 17 registrations proved enough. The delay causedby query execution and Web page download caused us tomodify our algorithms to do a less thorough search forinferences than we had originally intended. These modi-

cations almost certainly cause our algorithms to gener-ate an incomplete set of inferences. However, it is alsoimportant to note that despite our efforts, our results con-tain some links that should have been discarded becausethey either dont represent new information (e.g. scrapesof the site from which we extracted keywords) or dontconnect the keywords in our query to the sensitive wordsin a meaningful way (e.g. an online dictionary covering abroad swath of the English language). Hence, it is possi-ble to improve upon our results by changing the parame-

ters of our basic experiments to either do more lteringof the query results or analyze more of the query resultsand require a majority contain the sensitive word(s).

We describe each experiment in detail below.

5.2 Web-based De-anonymizationAs discussed in section 4 one of our goals is to demon-strate how keyword extraction can be used to warn theend-user of impending identication. Our inferencedetection technology accomplishes this by constantlyamassing keywords from online content proposed forposting by the user (e.g. blog entries) and issuing Webqueries based on those keywords. The user is alertedwhen the hits returned by those queries return their name,and thus is warned about the risk of posting the content.

We simulated this setting with Wikipedia biographiesstanding in for user-authored content. We removed

the biography subjects name from the biography andviewed the personal content in the biography as beinga condensed version of the information an individualmight reveal over many posts to their blog, for example.From these anonymized biographies we extracted key-words. Subsets of keywords formed queries to Google.A portion of the returned hits were then searched for thebiography subjects name and a ag was raised when ahit that was not a Wikipedia page contained a mentionof the biographys subject. For efciency reasons, welimited the portion and number of Web pages that wereexamined. In more detail, our experiment consists of thefollowing steps:

Input : a Wikipedia biography, B :

1. Extract the subject, N , of the biography, B , andparse N into a rst name, N 1 , optional middle nameor middle initial, N 1 , and a last name, N 2 (whereN j is empty if a name in that position is not givenin the biography). 5

2. Extract the top 20 keywords from the Wikipedia bi-ography, B , forming the set, S B , through the fol-lowing steps:

(a) Extract the text from the html.(b) Calculate the enhanced TF.IDF ranking of

each word in the extracted text (section 5.1).If present, remove N 1 , N 1 and N 2 from thislist, and select the top 20 words from the re-maining text as the ordered set, S B .

3. For x = 20 , 19, . . . , 1, issue a Google query on thetop x keywords in S B . Denote this query by Qx .For example, if W 1 , W 2 , W 3 are the top 3 keywords,the Google query Q3 is: W 1 W 2 W 3 , with noadditional punctuation. Let H x be the set of hits


8/16

Identifiability of California individuals with Wikipediabiographies

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25

Maximum number of keywords used to identify an individual

P e r c e n

t a g e o

f i n d i v

i d u a

l s

i d e n

t i f i e

d

Identifiability of Illinois individuals with Wikipedia Biographies

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25

Maximum number of keywords used to identify an individual

P e r c e n

t a g e o

f i n d i v i d u

a l s i d e n

t i f i e

d

Figure 1: Using 20 keywords per person, extracted from each residents Wikipedia biography, the percentage of individuals who were identiablebased on x keywords or less for x = 1 , . . . , 20 . The graph on the left shows results for the 234 biographies of California residents in Wikipedia

and the graph on the right shows the results for the106

biographies of Illinois residents in Wikipedia.

returned by issuing query Qx to Google with the re-strictions that the hits consist solely of html or text 6

and that no hits from the en.wikipedia.org Web sitebe returned.

4. Let H x, 1 , H x, 2 , H x, 3 Hx be the rst, second andthird hits (respectively) resulting from query Qx .7For x = 20 , 19, . . . , 1, determine if H x, 1 , H x, 2 andH x, 3 contain references to subject, N , by search-ing for contiguous occurrences of N 1 , N 1 and N 2(meaning, no words appear in between the words ina name) within the rst 5000 lines of html in each of H x, 1 , H x, 2 and H x, 3 . Record any such occurrences.

Output : S B , each query Qx that contains N 1 , N 1 andN 2 contiguously in at least one of the three examinedhits, and the url of the particular hit(s).

We ran this test on the 234 biographies of Californiaresidents, and the 106 biographies of Illinois residentscontained in Wikipedia. The results for both states areshown in Figure 1 and are very similar. In each case, 10or fewer keywords (extracted from the Wikipedia biog-raphy) sufce to identify almost all the individuals. Note

that statistics in Figure 1 are based solely on the outputof the code, with no human review.We also include example results (keywords, url, biog-

raphy subject) in Figure 2. These results illustrate thatthe associations a person has may be as useful for identi-fying them as their personal attributes. To highlight oneexample from the gure, 50% of the rst page of hits re-turned from the Google query n nicole goldman fran-cisco pro are about O. J. Simpson (including the top 3hits), but there is no reference to O. J. Simpson in any of

the rst page of hits returned by the query n franciscopro. Hence, the association of O. J. Simpson with hiswife (Nicole) and his wifes boyfriend (Goldman) is veryuseful to identifying him in the pool of professional foot-ball players who once were members of the San Fran-cisco 49ers.

PERFORMANCE . In our initial studies, there was widevariation, from a few minutes to over an hour, in the total

time it took to process a single biography, B , dependingon the length of the Web pages returned and the num-ber of hits. Hence, in order to efciently process a suf-ciently large number of biographies we restricted thecode to only examining the rst 5000 lines of html in thereturned hits from a given query, and to only search therst 3 hits returned from any given query. With theserestrictions, each biography took around 20 minutes toprocess, with some variation due to differences in biog-raphy length. In total, our California experiments took around 78 hours and our Illinois experiments took about35 hours. Our experimental code does not keep track of the number of queries issued per registration and do-

ing so may yield better performance because switch-ing between registrations occurred only upon receivinga Google SOAP error and so caused some delay.

Our code was not optimized for performance and im-provements are certainly possible. In particular, our mainslow down came from the text extraction step. One im-provement would be to cache Web sites to avoid repeatextractions.


9/16

Keywords URL of Top Hit Name of Personcampaigned soviets http://www.utexas.edu/features/archive/2004/election policy.html Ronald Reagandefense contra reagan http://www.pbs.org/wgbh/amex/reagan/peopleevents/pande08.html Caspar Weinbergerreagan attorneyedit pornography http://www.sourcewatch.org/index.php?title=Edwin Meese III Edwin Meesen nicole goldman

francisco pro http://www.brainyhistory.com/years/1997.html O. J. Simpsonhttp://www.amazon.com/Kung-Fu-Complete-Second-Season/

kung fu actors dp/B0006BAWYM David Carradinemedals medal raidhonor aviation http://www.voicenet.com/ lpadilla/pearl.html Jimmy Doolittlefables chicago indiana http://www.indianahistory.org/pop hist/people/ade.html George Adewisconsin illinois chicagoarchitect designed http://www.greatbuildings.com/architects/Frank Lloyd Wright.html Frank Lloyd Wright

Figure 2: Excerpts from our de-anonymization experiments. Each row lists keywords extracted from the Wikipedia biography of an individual(categorized under California or Illinois), a hit returned by a Google query on those keywords that is one of the top three hits returned andcontains the individuals name, and the name of the individual.

5.3 Web-based Sensitive Topic Detection

Another application of Web-based inference detection isthe redaction of medical records. As discussed earlier, itis common practice to redact all information about dis-eases such as HIV/Aids, mental illness, and drug andalcohol abuse, prior to releasing medical records to athird party (such as, e.g., a judge in malpractice liti-gation). Implementing such protections today relies onthe thoroughness of the redaction practitioner to keepabreast of all the medications, physician names, diag-noses and symptoms that might be associated with suchconditions and practices. Web-based inference detectioncan be used to improve the thoroughness of this task byautomating the process of identifying the keywords al-lowing such conditions to be inferred.

To demonstrate how our algorithm can be used in thisapplication, our experiments take as input a page that isviewed as authoritative about a certain disease. In ourexperiments, we used Wikipedia to supply pages for al-coholism and sexually transmitted diseases (STDs). Thetext is then extracted from the html, and keywords areidentied. To identify keywords that might allow theassociated disease to be inferred we then issued Google

queries on subsets of keywords and examined the top hitfor references to the associated disease. In general, wecounted as a reference any mention of the associated dis-ease. The one exception to this rule is that we ltered outsome medical term sites since such sites list unrelatedmedical terms together (for indexing purposes) and wedidnt want such lists to trigger inference results.

In the event that such a reference was found werecorded those keywords as being potentially inference-enabling. In practice, a redaction practitioner might then

use this output to decide what words to redact from med-ical records before they are released in order to preservethe privacy of the patient.

To gain some condence in our approach we also useda collection of general medical terms as a control andfollowed the same algorithm. That is, we made Googlequeries using these medical terms and looked for refer-ences to a sensitive disease (STDs and alcoholism) in thereturned links. The purpose of this process was to seeif the results would differ from those obtained with key-words from the Wikipedia pages about STDs and alco-holism. We expected a distinct difference because theWikipedia pages should yield keywords more relevant toSTDs and alcoholism, and indeed the results indicate thatis the case.

The following describes our experiment in more de-tail.

1. Input : An ordered set of sensitive words, K ={v1 , . . . , v b}, for some positive integer b, and apage, B . B is either the Wikipedia page for alco-holism [40], the Wikipedia page for sexually trans-mitted diseases (STDs) [41] or a control page of general medical terms.

(a) If B is a Wikipedia page, extract the top30 keywords from B , forming the set S B ,through the following steps:

i. Extract text from html.ii. Calculate the enhanced TF.IDF ranking

of each word in the extracted text (sec-tion 3). Select the top 30 words as theordered set, S B = {W 1 , W 2 , . . . , W 30 }.

(b) If B is a medical terms page, extract the termsusing code customized for that Web site and


10/16

let W B = {W 1 , W 2 , . . . , W 30 } be a subsetof 30 terms from that list, where the selec-tion method varies with each run of the exper-iment (see the results discussion below for thespecics).

(c) For each pair of words {W i , W j } S B , let

Q i,j be the query consisting of just those twowords with no additional punctuation and therestriction that no pages from the domain of source page B be returned, and that all re-turned pages be text or html (to avoid parsingdifculties). Let H i,j denote the rst hit re-turned after issuing query Q i,j to Google, afterknown medical terms Web sites were removedfrom the Google results 8 .

(d) For all i, j {1, . . . , 30}, i = j , and for= {1, . . . , b}, search for the string v K

in the rst 5000 lines of H i,j . If v is found,record v , wi , wj and H i,j and discontinue thesearch.

2. Output : All triples (v , Q i,j , H i,j ) found in step 1,where v is in the rst 5000 lines of H i,j .

R ESULTS FOR STD E XPERIMENTS . We ran the abovetest on the Wikipedia page about STDs [41], B , and aselected set, B , of 30 keywords from the medical termindex [29]. The set B was selected by starting at the49th entry in the medical term index and selecting every400th word in order to approximate a random selectionof medical terms. As expected, keyword pairs from inputB generated far more hits for STDs ( 306/ 435 > 70%)

than keyword pairs from B (108/ 435 < 25%). Theresults are summarized in gure 3.

R ESULTS FOR A LCOHOLISM E XPERIMENTS . We ranthe above test on the Wikipedia page about alcoholism[40], B , and a selected set, B , of 30 keywords from themedical term index [29]. For the run analyzed in Fig-ure 4, the set B was selected by starting at the 52nd entryin the medical term index and selecting every 100th worduntil 30 were accumulated in order to approximate a ran-dom selection of medical terms. As expected, keywordpairs from input B generated far more hits for alcoholism(47.82%) than B (9.43%). In addition, we manually re-viewed the URLs that yielded a hit in v K

Alcfor a

seemingly innocuous pair of keywords. These results aresummarized in gure 4.

A PPLYING THE RESULTS . When redacting medicalrecords, a redaction practitioner might use the results ingures 3 and 4 to choose content to redact. For exam-ple, gure 4 indicates the medications naltrexone andacamprosate should be removed due to their popular-ity as alcoholism treatments. The words identied asSTD-inference enabling are far more ambiguous (e.g.

transmit, infected). However, for some individualsthe very fact that even general terms are frequently as-sociated with sensitive diseases may be enough to jus-tify redaction (e.g. a politician may desire the removalof any red ag words). In general though, we think a redaction practitioner could defensibly not make it a

practice to redact such general terms given their associa-tion with other, less sensitive, diseases. This emphasizesthat our techniques support semi-automation, but not fullautomation, of the redaction process.

PERFORMANCE . Amortizing the cost of text extractionfrom the Wikipedia source page over all the queries, de-termining if each keyword pair yielded a top hit contain-ing a sensitive word took approximately 150 seconds.Hence, each of the experiments in gures 3 and 4 took around 6 hours, since 435 pairs from the Wikipedia pagewere tested along with 435 pairs from the control setof keywords.

As in the de-anonymization experiments, our maintime cost was due to the process of text extraction fromhtml. For these experiments caching is likely to signi-cantly improve performance as many of the medical re-source sites were visited multiple times.

6 Use Scenario: Iterative Redaction

As mentioned in sections 1 and 4, the process of sani-tizing documents by removing obviously identifying in-formation like names and social security numbers canbe improved by using Web-based inference detection toidentify pieces of seemingly innocuous information thatcan be used to make sensitive inference. To illustrate thisidea, we return to the poorly redacted FBI document inthe left-hand side of gure 6. Algorithms like those pre-sented in sections 3.2 and 5 can be used to identify setsof keywords that allow for undesired inferences. Someor all of those keywords can then be redacted to improvethe sanitization process.

We emphasize that the strategy for redacting basedupon the inferences detected by our algorithms is a re-search problem that is not addressed by this paper. In-deed many strategies are possible. For example, onemight redact the minimum set of words (in which case,

the redactor seeks to nd a minimum set cover for thecollection of sets output by the inference detection algo-rithm). Alternatively, the redactor might be biased in fa-vor of redacting certain parts of speech (e.g. nouns ratherthan verbs) to enhance readability of the redacted docu-ment.

The type of redaction strategy that is employed mayinuence the Web-based inference detection algorithm.For example, if the goal is to redact the minimum set of words, then it is necessary to consider all possible sets


11/16

Summary of STD Experiments

Input Web Page, B : Wikipedia STD site [41]

Extracted Keywords, S B : transmit, sexually, transmitting, transmitted, infection, std, sti, hepatitis, infected,infections, transmission, stis, herpes, viruses, virus, chlamydia, 9 , stds, sexual, disease, hiv, membrane, genital,intercourse, diseases, pmid, hpv, mucous, viral, 2006

Input Web Page, B , (control page) : Medical Terms Site [29]

Extracted Control Keywords, S B : Ablation, Ah-Al, Aneurysm, thoracic, Arteria femoralis, Barosinusitis,Bone mineral density, Cancer, larynx, Chain-termination codon, Cockayne syndrome, Cranial nerve IX,Dengue, Disorder, cephalic, ECT, Errors of metabolism, inborn, Fear of nudity, Fracture, comminuted, Gland,thymus, Hecht-Beals syndrome, Hormone, thyroxine, Immunocompetent, Iris melanoma, Laparoscopic, Lungreduction surgery, Medication, clot-dissolving, Mohs surgery, Nasogastric tube, Normoxia, Osteosarcoma,PCR (polymerase chain reaction), Plan B

Sensitive Keywords,K

ST D : STD, Chancroid, Chlamydia, Donovanosis, Gonorrhea, Lymphogranulomavenereum, Non-gonococcal urethritis, Syphilis, Cytomegalovirus, Hepatitis B, Herpes, HSV, Human Immun-odeciency Virus, HIV, Human papillomavirus, HPV, genital warts, Molluscum, Severe acute respiratorysyndrome, SARS, Pubic lice, Scabies, crabs, Trichomoniasis, yeast infection, bacterial vaginosis, trichomonas,mites, nongonococcal urethritis, NGU, molluscum contagiosm virus, MCV, Herpes Simplex Virus, Acquiredimmunodeciency syndrome, aids, pubic lice, HTLV, trichomonas, amebiasis, Bacterial Vaginosis, Campy-lobacter Fetus, Candidiasis, Condyloma Acuminata, Enteric Infections, Genital Mycoplasmas, Genital Warts,Giardiasis, Granuloma Inguinale, Pediculosis Pubis, Salmonella, Shingellosis, vaginitis

Percentage of words in S B yielding a top hit containing word(s) in K ST D : 33 . 33%

Percentage of word pairs in S B yielding a top hit containing word(s) in K ST D : 70 . 34%

Percentage of control words in S B yielding a top hit containing word(s) in K

ST D : 3 . 33%

Percentage of control word pairs in S B yielding a top hit containing word(s) in K

ST D : 24 . 83%

Example keyword pairs from S B returning a top hit containing a word in K ST D :10

Keywords URL of Top Hit Sensitive Wordin Top Hit

transmit, infected http://www.rci.rutgers.edu/ insects/aids.htm HIVtransmit, mucous http://research.uiowa.edu/animal/?get=empheal Herpestransmitting, viruses http://www.cdc.gov/hiv/resources/factsheets/transmission.htm Hepatitis Btransmitted, viral http://www.eurosurveillance.org/em/v10n02/1002-226.asp Hepatitis Btransmitted, infection http://www.plannedparenthood.org/sti/ STDtransmitted, disease http://www.epigee.org/guide/stds.html STDinfection, mucous http://www.niaid.nih.gov/factsheets/sinusitis.htm HIV

infected, disease http://www.ama-assn.org/ama/pub/category/1797.html HIVinfected, viral http://www.merck.com/mmhe/sec17/ch198/ch198a.html Cytomegalovirusinfections, viral http://www.nlm.nih.gov/medlineplus/viralinfections.html Cytomegalovirusvirus, disease http://www.mic.ki.se/Diseases/C02.html Cytomegalovirus

Figure 3: Summary of experiments to identify keywords enabling STD inferences.


12/16

Summary of Alcoholism Experiments

Input Web Page, B : Wikipedia Alcoholism site [40]

Extracted Keywords, S B : alcoholism, alcohol, drunk, alcoholic, alcoholics, naltrexone, drink, addiction,dependence, detoxication, diagnosed, screening, drinks, moderation, abstinence, 2006, disorder, drinking,behavior, questionnaire, cage, treatment, citation, acamprosate, because, pharmacological, anonymous,extinction, sobriety, dsm

Input Web Page, B , (control page) : Medical Terms Site [29]

Extracted Control Keywords, S B : ABO blood group, Alarm clock headache, Ankle-foot orthosis,Ascending aorta, Benign lymphoreticulosis, Breast bone, Carotid-artery stenosis, Chondromalacia patellae,Congenital, Cystic periventricular leukomalacia, Discharge, DX, Enterococcus, Familial Parkinson diseasetype 5, Fondation Jean Dausset-CEPH, Giant cell pneumonia, Heart attack, Hormone, parathyroid, Impetigo,Itching, Laughing gas, M. intercellulare, Membranous nephropathy, MRSA, Nerve palsy, laryngeal, Oligoden-drocyte, Pap Smear, Phagocytosis, Postoperative, Purpura, Henoch-Schonlein

Sensitive Keywords, K Alc : alcoholism, alcoholic(s), alcohol

Percentage of words in S B yielding a top hit containing word(s) in K Alc : 23 .33%

Percentage of word pairs in S B yielding a top hit containing word(s) in K Alc : 47 .82%

Percentage of control words in S B yielding a top hit containing word(s) in K

Alc : 0 . 00%

Percentage of control word pairs in S B yielding a top hit containing word(s) in K

Alc : 9 . 43%

Example word sets from S B returning a top hit containing a word in K Alc :11

Keywords URL of Top Hitnaltrexone http://www.nlm.nih.gov/medlineplus/druginfo /medmaster/a685041.htmlacamprosate http://www.nlm.nih.gov/medlineplus/druginfo /medmaster/a604028.htmldsm, detoxication http://www.aafp.org/afp/20050201/495.html 12

dsm, detoxication, dependence http://www.aafp.org/afp/20050201/495.html

Figure 4: Summary of experiments to identify keywords enabling alcoholism inferences.

Redacted Word(s) Example Link Sensitivity of Word(s)http://multimedia.belointeractive.com/attack Having 50 or more siblings is very

50, 52, 54 /binladen/1004blfamily.html characteristic of Osama Bin Laden.http://www.time.com/time/magazine/article

Boston /0,9171,1000943,00.html?promoid=googlep Many of Osamas relatives reside in Boston. 13

magnate http://www.outpostoffreedom.com/bin ladin.htm Osamas father was a building magnate.denounced, denunciation http://www.cairnet.org/html/911statements.html A number of groups (including Bin Ladens

family) have denounced his actions.http://www.usnews.com/usnews/politics/whispers A number of groups (including Bin Ladens

condemnation /archive/september2001.htm family) have condemned his actions.

Figure 5: Words redacted as a result of Web-based inference detection. Column 1 is the word or words, column 2 is a link using those wordsoutput by the algorithm, and column 3 explains why the word(s) are sensitive.


13/16

of keywords when looking for inferences. In contrast, if readability is an important concern, then the consideredsets might be those favoring certain word types.

What we discuss here is one example of usingWeb-based inference detection to improve the redactionprocess. The approach we take is inuenced by readabil-

ity and performance (i.e. speed of the redaction process)but is by no means an optimal approach with respectto either concern. We began by applying some simpleredaction rules to the document [8]. Specically, we re-moved all location references since our example in sec-tion 1 indicated those were important to identifying thebiography subject, any dates near September 11, 2001,which is clearly a memorable date, and nally, all cita-tion titles since when paired with the associated publi-cation, these enable the citation articles to be easily re-trieved. The resulting redacted document is depicted ingure 6, where grey rectangles indicate the redaction re-sulting from the rules just described.

Our subsequent redaction proceeded iteratively. Ateach stage, we extracted the text from the current doc-ument, calculated the keywords ordered by the TF.IDFmetric and searched for inferences drawn from subsetsof a specied number of the top keywords. We then eval-uated the output of the algorithm by checking to ensurethe produced links did indeed reect identifying infer-ences. If a link did not use all the queried keywords in adiscussion about Osama Bin Laden then it was deemedinvalid. A common source of invalid links were news ar-ticle titles printed in the side-bar of the link that did notmake use of the keywords found in the main body. Forexample, the query condone citing prestigious, yieldsthe top hit [6] (a humor site) because a sidebar links toan article with Osama in the title, however, none of thekeywords are used in the description of that article. 14

We incorporated manual review of the links becausethe current form of our algorithms involves too little con-tent analysis to provide condence that a returned link reects a strong connection between the associated key-words and Osama Bin Laden. In addition, given the highsecurity nature of most redaction settings it is unlikelythat a purely automated process will ever be accepted.

For those inferences that were found valid, we maderedactions to prevent such inferences and repeated thisprocess for the newly redacted document. The followingmakes the steps we followed precise.

1. Dates near September 11, 2001, titles of all citationsand location names were removed from the biogra-phy [8].

2. For i = 2 , . . . , 5:

(a) We executed Google queries for each i-tuplein the top n i keywords in the biography. Then i values were chosen based on performance

constraints as described in section 5. 15 The(i, n i ) values were: (2, 50), (3, 20), (4, 15)and (5, 13). We concluded with 5-tuples be-cause no valid inferences were found for thatrun of the algorithm, and only 7% of the linksreturned by the algorithm run for (i, n i ) =

(4, 15) were valid. For each (i, n i ) execu-tion of the algorithm we received a list of setsof keywords that were potentially inference-enabling, and the associated top link leadingthe algorithm to make this conclusion.

(b) We reviewed the returned links to see if all thecorresponding keywords were used in a dis-cussion of Osama Bin Laden. If so, we madea judgement as to which keyword or keywordsto remove to remove the inference while pre-serving readability of the document.

(c) We incremented i and returned to step (a) withthe current form of the redacted document.

Figure 5 lists the words that were redacted as a resultof our Web-based inference detection algorithm. The ta-ble also gives an example link output by the algorithmthat motivates the redaction and a brief explanation of why the word is sensitive (gained from the manual re-view of the link(s)). Note that while our algorithm foundsome document features to be identifying that are un-likely to have been covered by a generic redaction rule(e.g. Osama Bin Ladens fathers attribute of being abuilding magnate) it left other, seemingly unusual, at-

tributes (such as Osama Bin Laden potentially being oneof 20 children). Since the Web is at best a proxy for hu-man knowledge, and our algorithm used the Web in alimited way (i.e. our analysis was limited to a few hitswith little NLP use), it seems likely that inferences weremissed. Hence, we emphasize that our tool is best usedto semi-automate the redaction process.

Finally, we note that the act of redacting informa-tion may introduce as well as remove, privacy problems.For example, as noted by Vern Paxson [39], redactingBoston without redacting Globe may allow the sen-sitive term Boston to be inferred. Our tool suggestsBoston for redaction, as opposed to Boston Globe,because a number of Osama Bin Ladens relatives residethere, however, acting on this recommendation is prob-lematic precisely because of the difference between thenature of the inference and the document usage of theterm. An improved algorithm would understand the useof the term within the document and use this to guide theredaction process.

Our nal redacted document is shown in the right handside of gure 6.


14/16

Output from using the Web-based inference detection

algorithm

Figure 6: The left picture shows the original FBI-redacted biography. The right hand side shows the document resultingfrom using the Web-based inference detection algorithm, where black rectangles represent redactions recommendedby the algorithm and grey rectangles are redactions coming from removing dates in 2001, locations and the titles of cited articles (i.e. the grey and black rectangles are redactions made by the authors of this paper).


15/16

7 Conclusion

We have introduced the notion of using the Web to detectundesired inferences. Our proof-of-concept experimentsdemonstrate the power of the Web for nding the key-words that are likely to identify a person or topic.

As is to be expected with an initial work, there re-mains a lot of room for improvement in the algorithms.In particular, to produce an inference detection tool ca-pable of functioning in real-time, as is needed in someapplications, improvements already discussed such asWeb caching, additional ltering of results to improveprecision, and deeper hit analysis to improve recall, areneeded. Another avenue for improvement is throughdeeper content analysis (i.e. beyond keyword extrac-tion). For example, employing a tool capable of deepersemantic analysis such as [15] may allow for both moremeaningful extraction of words and phrases for generat-ing queries, and improved analysis of the returned hits

for more accurate inference detection. In addition, sim-ple improvements to the content analysis such as bet-ter ltering of stop words and html syntax, would createmore useful keyword lists.

Acknowledgement

The authors are very grateful to Richard Chow and VernPaxson for their help in revising earlier versions of thispaper.

References

[1] B. Aleman-Meza, M. Nagarajan, C. Ramakrishnan, L. Ding, P.Kolari, A. Sheth, B. Arpinar, A. Joshi and T. Finin. Semantic an-alytics on social networks: experiences in addressing the prob-lem of conict of interest detection. 15th International World Wide Web Conference , 2006.

[2] M. Atallah, C. McDonough, S. Nirenburg, and V. Raskin. Nat-ural Language Processing for Information Assurance. Proc. 9th ACM/SIGSAC New Security Paradigms Workshop (NSPW 00) ,pp.51-65, 2000.

[3] Apache Lucene. http://lucene.apache.org/java/docs/

[4] AOL Keyword Searches. http://dontdelete.com/

default.asp

[5] M. Barbaro and T. Zeller. A face is exposed for AOL searcherno. 4417749. The New York Times , August 9, 2006.

[6] http://www.bongonews.com/layout1.php?event=2315

[7] W. Broad. U. S. Web Archive is Said to Reveal a Nuclear Primer.The New York Times , November 3, 2006.

[8] http://www.judicialwatch.org/archive/2005/osama.pdf

[9] Executive Order 12958, Classied National Security Informa-tion. http://www.dss.mil/seclib/eo12958.htm

[10] B. Davison, D. Deschenes and D. Lewanda. Finding relevantwebsite queries. Twelfth International World Wide Web Confer-ence , 2003.

[11] O. de Vel, A. Anderson, M. Corney and G. Mohay. Mining email

content for author identication forensics. SIGMOD Record ,Vol. 30, No. 4, December 2001.

[12] Mike Dowman, Valentin Tablan, Hamish Cunningham andBorislav Popov. Web-Assisted Annotation, Semantic Indexingand Search of Television and Radio News. WWW , 2005.

[13] Factiva Insight: Reputation Intelligence. http://www.factiva.com

[14] Fetch Technologies. http://www.fetch.com

[15] GATE: General Architecture for Text Engineering.http://gate.ac.uk/projects.html

[16] N. Glance. Community Search Assistant. IUI , 2001.

[17] P. Golle. Revisiting the Uniqueness of Simple Demographics inthe US Population. Workshop on Privacy in the Electronic Soci-ety, 2006.

[18] Google SOAP search API. http://code.google.com/apis/soapsearch/

[19] J. Hale and S. Shenoi. Catalytic inference analysis: detectinginference threats due to knowledge discovery. IEEE Symposiumon Security and Privacy , 1997.

[20] S. Hill andF. Provost.The myth of thedouble-blind review? Au-thor identication using only citations. SIGKDD Explorations ,2003.

[21] T. Hinke. Database inference engine design approach. DatabaseSecurity II: Status and Prospects , 1990.

[22] D. Jones. Googles PowerPoint blunder was preventable.IR Web Report. http://www.irwebreport.com/perspectives/2006/mar/google blunder.htm

[23] E. Kin, Y. Matsuo, M. Ishizuka. Extracting a social network among entities by web mining. ISWC 06 Workshop on WebContent Mining with Human Language Technologies , 2006.

[24] M. Koppel and J. Schler. Authorship verication as a one-classclassication problem. Proceedings of the 21st InternationalConference on Machine Learning , 2004.

[25] M. Koppel, J. Schler, S. Argamon and E. Messeri. Authorshipattribution with thousands of candidate authors. SIGIR 06 .

[26] M. Lapata and F. Keller. The Web as a Baseline: Evaluating thePerformance of Unsupervised Web-based Models for a Range of NLP Tasks, HLT-NAACL , 2004.

[27] G. Leech, P. Rayson and A. Wilson. Word frequencies in writ-ten and spoken english: based on the British National Corpus. ,Longman, London, 2001.

[28] C. Manning and H. Schutze. Foundations of statistical naturallanguage processing. MIT Press, 1999.

[29] MedicineNet.com. http://www.medterms.com/script/main/hp.asp


16/16

[30] P. Nakov and M. Hearst. Using the Web as an Implicit Train-ing Set: Application to Structural Ambiguity Resolution. HLT- NAACL, 2005.

[31] Nstein Technologies. http://www.nstein.com/pim.asp

[32] G. Pant, S. Bradshaw and F. Menczer. Search engine-crawler

symbiosis: adapting to community interests. 7th European Con- ference on Digital Libraries , 2003.

[33] X. Qian, M. Stickel, P. Karp, T. Lunt and T. Garvey. Detectionand elimination of inference channels in multilevel relationaldatabase systems. IEEE Symposium on Security and Privacy ,1993.

[34] M. Steyvers, P. Smyth, M. Rosen-Zvi and T. Grifths. Proba-bilistic author-topic models for information discovery. KDD 04 .

[35] L. Sweeney. AI Technologies to Defeat Identity Theft Vulnera-bilities. AAAI Spring Symposium on AI TEchnologies for Home-land Security, 2005 .

[36] L. Sweeney. Uniqueness of Simple Demographics in the U.S.Population. LIDAP-WP4. Carnegie Mellon University, Labora-tory for International Data Privacy, Pittsburgh, PA , 2000.

[37] P. Turney. Coherent Keyphrase Extraction via Web Mining. IJ-CAI , 2002.

[38] Unied Medical Language System. http://www.umm.edu/glossary/a/index.html

[39] Personal communication.

[40] Wikipedia. Alcoholism. http://en.wikipedia.org/wiki/Alcoholism

[41] Wikipedia. Sexually transmitted disease. http://en.wikipedia.org/wiki/Sexually transmitteddisease

[42] http://wordweb.info/free/

[43] R. Yip and K. Levitt. Data level inference detection in databasesystems. IEEE Eleventh Computer Security Foundations Work-shop , 1998.

[44] D. Zhao and T. Sapp. AOL Search Database. http://www.aolsearchdatabase.com/

[45] http://www.zoominfo.com/

Notes

1 http://www.popandpolitics.com/2005/09/06/and-lite-jazz-singers-shall-lead-the-way/, www.popandpolitics.com/2006/10/06/our-paris/

2 http://en.wikipedia.org/wiki/ Madonna and the gay community,http://gaybookreviews.info/review/2807/615,http://www.youtube.com/results?search type=related&search query=madonna%20oh%20father

3 Example results from our experiments appear in section 5. Be-cause of the dynamic nature of the Web, issuing the same queries todaymay yield somewhat different results.

4 The AOL data can potentially be used to demonstrate the Websability to de-anonymize ([5] may be one such example), which is oneof the goals of our algorithms, however because our target applicationis the protection of English language content, we opted not to vet ouralgorithms with that data.

5 The vast majority of the biographies we used identied their subject by both a rst and last name with no middle name or initial. Also,name sufxes (e.g. Jr. or annotations made by Wikipedia authors re-garding profession), were ignored.

6 This was done to avoid difculties parsing non-ascii pages.

7 These are the rst three links that appear on the results page,whether or not one URL is a substring of another.

8 Here known site means any site with medterm or medwordin the URL. As this certainly not sufcient to remove all medical termssites, we manually reviewed the results before generating the examplekeyword pairs in Figure 3.

9 Notethis extracted non-wordindicates a aw in our text-from-htmlextraction algorithm.

10In a manual review of the word pairs from W B yielding a top hitcontaining word(s) in K ST D , we did not nd any hits using the word

pair in a meaningful way in relation to a sensitive word. Rather, the hitsgenerally turned out to be medical term lists.

11 Since all of our sensitive words pertain to the same topic, alco-holism, we did not record which particular sensitive word was con-tained in the top hit (if any).

12 Note this is the 4 th returned hit, indicating a change in our searchstrategy would improve recall.

13 The biography only mentions Boston in a citation, so this is aconservative redaction choice.

14 Alternative metrics for validity are of course possible. For exam-ple, a more thorough algorithms might look for shared topic (e.g. theevents of September 11, 2001) amongst links, and retain any links per-taining to the most popular topic as valid.

15 We tended to experience problems communicating with Googlewhen when executing algorithm runs that exceeded 1500 queries,hence we chose values of {n i }i that yielded query counts in the rangeof 1000 1500 .

Documents

Web-Based Inference Detection