8

Click here to load reader

Usable, Secure, Private Search

Embed Size (px)

Citation preview

Page 1: Usable, Secure, Private Search

1540-7993/12/$31.00 © 2012 IEEE Copublished by the IEEE Computer and Reliability Societies September/October 2012 53

Database search

A secure anonymous database search system can provide exact keyword match capability. Using reroutable encryption, this system lets multiple parties efficiently execute exact-match queries over distributed encrypted databases in a controlled manner.

T he ability to securely share sensitive information between untrusting parties is a prerequisite for

many real-world applications. For example, consider a hypothetical criminal investigation database. For obvi-ous reasons, access to details pertaining to on going investigations should be strictly controlled. On the other hand, it might be appropriate to let investigators query the database for other investigations related to their own. To do this, we would need to anonymize the queries’ identifiers and content to ensure that no investigation’s details are compromised. Moreover, the database must truthfully execute all queries, return-ing any and all relevant information on the basis of the given query without revealing information about unrelated investigations.

We describe a secure anonymous database search (SADS) system that provides exact keyword match capability. By using a new primitive, reroutable encryp-tion, along with Bloom filters1 and deterministic encryption,2 SADS lets multiple parties efficiently exe-cute exact-match queries over distributed encrypted databases in a controlled manner. Furthermore, it addresses the more complicated problem of allowing document similarity searches. Although there is a pool of work considering similarity in terms of error toler-ance and Hamming distance,3,4 we capture semantic-level similarity in our definition as well as show how to apply this work in an efficient encrypted search sce-nario—something that hasn’t been done before.

Extending the Secure Anonymous Database SearchExisting systems for encrypted search provide privacy guarantees, but at a provably high cost in efficiency. As such, these systems scale poorly and aren’t suitable for real-world applications with very large databases. Although traditional SADS provides one desirable set of security and efficiency guarantees, it lacks flexibility and semantic awareness of the corpora over which it oper-ates. Our SADS system uses third parties and relaxed definitions of security to circumvent these inherent effi-ciency costs. We extend SADS in two ways: we expand its search capabilities beyond exact keyword match and provide a modular framework for adapting the system to meet varying security and efficiency needs.

The exact-match constraint that traditional SADS imposes doesn’t allow for search capabilities such as gram-matical case and number. To overcome this limitation, we applied the notion of context-specific, semantically aware feature extraction to encrypted search scenarios. Using this abstraction mechanism to pre process both input data and queries, we demonstrated ways to significantly augment the search capabilities, which could apply to all Bloom filter–based querying systems.

Using SADS as the foundational building block, we developed a system capable of creating flexible query systems that deliver strong cryptographic and privacy-preserving guarantees. We demonstrate how such a framework can be parameterized to adapt to a spectrum

Usable, Secure, Private Search

Mariana Raykova | IBM ResearchAng Cui, Binh Vo, Bin Liu, Tal Malkin, Steven M. Bellovin, and Salvatore J. Stolfo | Columbia University

Page 2: Usable, Secure, Private Search

54 IEEE Security & Privacy September/October 2012

Database search

of security and usability requirements. The result is a modular system with a well-defined set of compile-time switches that lets the system designer select the appro-priate combination of matching behavior strictness, flexibility, and computational overhead for a specific document corpus, all without changing the system’s fundamental organization.

Flexible Queries in an Exact-Match WorldCreating accurate, user-friendly query systems that exhibit strong cryptographic and privacy guarantees presents several challenges when we consider the gamut of documents systems encounter. Furthermore, the con-text in which such systems are used significantly affects the desired behavior of the query system, possibly alter-ing the definition of terms such as “false positive” and “false negative.” In short, we need a framework that is

■ flexible, to achieve reasonable usability, ■ parameterizable, to adapt to diverse environments and

privacy requirements, and ■ multilingual, to incorporate documents of different

languages, media, and formats,

while simultaneously providing strong cryptographic and privacy guarantees.

Search over Simple WordsConsider this example dataset of a few words and names: Cat, dog, Bird, Lion, frog, Ang, and Mariana. Nat-ural language conventions can complicate even a simple query system like this. A direct application of secure keyword search requires an exact lexicographical match to yield a positive search result. Consequently, slight variations in capitalization and punctuation between the query and queried terms will yield negative search results. For example, queries such as “birds” or “Mari-ana’s” won’t yield matches unless the system is aug-mented to account for such variations.

Depending on the context in which this query sys-tem is deployed, such restrictions might be desirable. However, awareness of language conventions and a controllable degree of matching flexibility can greatly enhance usability.

To achieve this, we can preprocess the input during insertion and query operations. As we discuss later, the careful construction of preprocessing a feature extractor (FExt)—an algorithm that classifies an input document according to a similarity metric—plays a pivotal role in defining and controlling the flexibility level a secure query system will tolerate. Feature extractors are context sensitive and should be tailored to the specific document corpora and the desired matching flexibility level under which that query system is expected to operate.

Consider the following sets of unique tokens, here-after referred to as 1-grams:

■ FExt A: cat, dog, bird, lion, frog, ang, mariana; ■ FExt B: cat, dog, bird, lion, frog, ang, mariana, Cat, Dog,

Bird, Lion, Frog, Ang, Mariana; ■ FExt C: cat, caT, cAt, cAT, Cat, CaT, CAt, CAT, ... ,

MARIANA; ■ FExt D: cat, dog, bird, lion, frog, Ang, Mariana; and ■ FExt E: cat, dog, bird, lion, frog, cats, dogs, birds, lions,

frogs, Ang, Mariana.

Depending on the encrypted database’s sensitivity and the desired degree of matching flexibility for a given context, one of these feature extraction methods will be more appropriate than the others.

Without diving deeply into a discussion on feature extraction methods and trade-offs, it suffices to say that FExt E is a fairly good choice because it lets users query the database without worrying about capitalization or grammatical number and still distinguishes between common and proper nouns.

Searching over English SentencesConsider the following set of simple sentences and their corresponding 1-grams:

■ “War is peace”—War, is, peace; ■ “Freedom is slavery”—Freedom, is, slavery; and ■ “Ignorance is strength”—Ignorance, is, strength.

In this case, the following two sets of 1-gram queries result from a simple, direct keyword search:

■ set A—war, freedom, ignorance and ■ set B—is.

Only queries containing exact lexicographical matches against a previously inserted 1-gram will yield a positive search result. Consequently, all three queries in set A will match nothing, whereas the query in set B will match all three sentences.

In this case, preprocessing using standard natural-language-processing techniques, such as stemming and tagging parts of speech, can help normalize input terms’ capitalization and grammatical number. It can also potentially remove common terms such as auxiliary verbs, prepositions, and conjunctions.

Now, consider these slightly more complicated queries:

■ “peace is war,” ■ “Ignorance has strengths,” and ■ “War’s peace.”

Page 3: Usable, Secure, Private Search

www.computer.org/security 55

Because our keyword query system operates on 1-grams, we can accommodate the above queries in one of several ways. The naive approach is to treat each query string as a single 1-gram. However, doing so will yield no matches because they are multiword queries and the stored 1-grams are single words. Alternatively, we can break each query string into corresponding subqueries containing individual 1-grams, then apply a Boolean oper-ation on the subsequent results. However, this approach doesn’t let us express any conditionality on the ordering of the queried terms. A better approach is to construct sub-queries containing phrases of the original query.

For example, from the search string “peace is war,” we can construct the following subqueries:

■ “peace is,” ■ “is war,” and ■ “peace is war.”

Doing so lets us carry out order-preserving partial matches using a single query string. We can then use the binary results from these subqueries to calculate an overall match score. The desired matching behavior will directly affect how this score is calculated and interpreted.

This flexibility comes at a price. To match the sub-queries, all possible subphrases of the input corpora must be inserted into the encrypted database. However, because the search time of a query over a single docu-ment is independent of the number of terms generated for that document, our SADS system can easily accom-modate a large number of document terms without per-formance penalties.

Secure SearchTo achieve the desired functionality in a mutually anon-ymous query system, we must provide the following security guarantees:

■ query privacy—the server doesn’t learn information about the query;

■ client anonymity—the server doesn’t learn the query-ing clients’ identity;

■ database privacy—clients learn only results matching the query; and

■ client authorization—the server can control which cli-ents can search the database.

The database privacy requirement assumes that there’s a way to unequivocally determine the match-ing database entries for a given query. If we consider a query at a syntactic level, such as in a keyword search, its matching context is well defined. However, if we want to interpret a query semantically, the meaning of “match-ing” becomes fairly complicated. Overly restrictive

interpretations of matching can result in a query not matching semantically equivalent information, result-ing in a semantic false negative. Similarly, overly relaxed matching can cause unrelated documents to be returned in response to a broad or meaningless query (for exam-ple, words such as “and,” “the,” and so forth), resulting in a semantic false positive. When engineering real-world, secure, private search systems that aim to identify rel-evant content, we need to consider both syntactic and semantic elements of query interpretation.

The Basis of Our SystemWe based our search system on the SADS protocol, which offers a model to achieve balance among the con-flicting requirements of security, anonymity, and practi-cal efficiency.5 In particular, it obtains query efficiency that scales sublinearly with the number of search entries in a document and number of possible search terms. This rules out encrypted search protocols that provide query and database privacy but don’t fulfill anonymity and authorization requirements.

This SADS model requires relaxing strict cryptog-raphy security definitions to meet the minimal security requirements. It also distributes a limited amount of trust to two intermediary parties that mediate the shar-ing process between the client and database owner. This enhances privacy and anonymity for both parties. The two additional parties, the index server (IS) and query router (QR), are neutral parties available to regulate the search process that we assume will act semihonestly in protocol execution.

The Index Server and Query Router RolesThe server computes search structures from its encrypted data, which can be used to outsource the search functionality to the IS without revealing the actual document content. Although the IS can compare the return results for different queries, guarantees for the client’s anonymity will prevent it from associating queries with a particular client.

The QR’s role is to protect the client’s identity as well as to enforce the authorization check on the server’s behalf. The QR receives the queries from all clients and anonymizes identities before submitting them to the IS. It correspondingly returns the results to the client. However, the QR isn’t trusted to learn either the queries or the results.

Building ProtocolsWe used the following building-block protocols to con-struct our SADS system.5

Reroutable encryption. We needed an encryption sys-tem that would allow encryption transformation under

Page 4: Usable, Secure, Private Search

56 IEEE Security & Privacy September/October 2012

Database search

given corresponding keys without leaking information about the encrypted messages. In reroutable encryp-tion, one party is responsible for routing messages between senders and receivers. Although the routing party is trusted to match senders and receivers, it’s not trusted to learn the routed messages. For our system, we augmented this construction by letting the routing party forward only partial information extracted from the transformed ciphertext.

PH-DSAEP+. Whereas the reroutable encryption sys-tem provides a framework to realize the QR functions, we needed an additional property that would facilitate efficient search in the IS. Because the standard cryptog-raphy definitions of security require an encryption sys-tem to be probabilistic, which makes sublinear search complexity impossible,6 we needed to use determinis-tic encryption. This trade-off of security for efficiency follows the idea introduced by Mihir Bellare and col-leagues, who defined deterministic encryption in the public-key setting and showed how to convert a stan-dard public-key encryption to a deterministic one.2 We followed the same approach, adapting it to the secret-key setting. We constructed the private-key encryption system PH-SAEP+ as a combination of the Pohlig- Hellman function7 and the SAEP+ (Simple-OAEP) padding construction introduced in “Simplified OAEP for the RSA and Rabin Functions,”8 and PH-DSAEP+ with the removal of the nondeterministic component.

Bloom filters. The private-key deterministic encryption system PH-DSAEP+ provides search capability over ciphertexts, achieving sublinear complexity. To utilize this capability, we must construct searchable tags from deterministic encryptions. For this, we used Bloom fil-ters, which extend the idea of hashing using multiple hash functions, improving collision probabilities.1 This facilitates efficient search, which requires constant time per Bloom filter (despite the number of terms added), guarantees that there won’t be false negatives, and allows a tunable rate of false positives. For our purposes, we computed a Bloom filter for each document in the database. Each Bloom filter contains all keywords from a single document, and on those keywords we use PH-DSAEP+ to generate hash function indexes.

SADS AlgorithmsWe used the building-block protocols to instantiate a protocol for SADS. Given a server, a client, a QR, and an IS, we defined our SADS system with several algo-rithms (see Figure 1).

Key generation. Both the server and the IS choose an encryption key. The client generates a key for query

submission and a key for retrieving results. To autho-rize the client to search the server, the QR and the cli-ent run a key-exchange protocol to let the QR obtain a ratio key between the server’s encryption key and the client’s query submission key. Also, the IS, client, and QR run a key-exchange protocol so that the QR obtains a ratio key between the IS’s encryption key and the cli-ent’s return result key.

Preprocessing. The server generates a Bloom filter for each of its documents by encrypting all words in the document using PH-DSAEP+ under its own key. The server sends the resulting Bloom filters to the IS.

Query submission. The client encrypts its query with PH-DSAEP+ with its key and sends it to the QR. The QR reencrypts the ciphertext with its transformation key for the client, extracts Bloom filter indexes from the new encryptions, and sends them to the IS.

Search. The IS uses the obtained indexes to execute the Bloom filter search to get the result.

Query return. The query result is returned with a different instantiation of the reroutable encryption protocol. The IS encrypts the result with PH-SAEP+ and sends it to the QR. The QR reencrypts the ciphertext with the return result transformation key and sends it to the client.

A General Flexible, Private, Secure Search FrameworkSimple keyword matching alone isn’t always enough to satisfy real-world queries over complex, structured documents. To overcome the constraints that SADS imposes, we informally preprocessed both the docu-ment corpus and all incoming queries to achieve greater semantic awareness and improved search flexibility while operating under the exact-match constraint. Our previous examples focused on natural language docu-ments. However, the general framework we presented goes beyond textual corpora; the feature extraction pro-cess can be applied to any document format. Recall that a feature extractor is an algorithm FExt that takes an input document D and returns a set of tokens f1, f2, ..., fn, which identify the class to which D belongs according to the similarity metric implemented by the extractor.

We can use the feature tokens produced by various extractors for search or combine them with any sys-tem that utilizes a search structure defined as follows: a search structure B is any object for which there exist efficient algorithms to create an empty search structure B (Init(B)), to insert a token t in the search structure B (Insert(B,t)), and to query whether t is present in B (Query(B,t)).

Page 5: Usable, Secure, Private Search

www.computer.org/security 57

Figure 1. A secure anonymous database search query submission and the results. The server stores an encrypted document index on the IS. The client can then send encrypted and anonymized queries to the IS through the QR.

Client

Server

Query router(QR)

BF_Search (i1, ..., i

k) =

{ri, ... r

n = res_v}

Index server(IS)

res' = PH–DSAEP+ (res_v, IS key)Document Bloom

filters underserver’s key

BF_indices(c')= {i

1, ... i

k}

res'' = PH-DSAEP+(res', transform key for client)

c' = PH-DSAEP+ (c,transform key for server)

c' = PH-DSAEP+ (query, client’s key)

The exact efficiency for the insertion and query will depend on the specific algorithms they implement. We can now define a document search system that instanti-ates these structures with different document features. A document search system SFExt = (Insert_Docu-ment, Query_Token) is defined by the following two algorithms:

■ Insert_Document(Di) creates a search structure that contains the features extracted from the docu-ment as entries, and

■ Query_Token(t) returns all documents that have features matching search token t.

This system lets us search documents with respect to a particular feature, given by the underlying extrac-tor. However, the similarity measure conveyed by just one query feature or one particular feature extractor might not be sufficient to indicate a meaningful match. For this purpose, we sought the ability to combine the search results over several features of the query string, as well as over several different feature extractors.

We created two frameworks that allow tunable defi-nitions for the intended document matches for a query. First, we translated the query string into multiple fea-tures and used the search results for all of them to compute a similarity matching coefficient for each doc-ument with respect to the query.

Let SFExt be a document search system. A weighted search computes matching coefficients for each document in the database on the basis of their sim-ilarity to a string Q. We define an algorithm Weighted_ QuerySFExt (Q) on a set of records R1 ... n as follows:

1. Extract the searchable features of the query F = FExt(Q).

2. Search for each feature fi ∈ F Ri = Query_Token(fi).3. For each document Dj, compute a vector Pj of the

features for which it was a match.4. Assign to each document Dj a matching weight

given by Decision_Function(Pj), where the function Decision_Function can be instanti-ated with any function that assigns weights to a set of features extracted from a string.

5. Return a vector Res of the matching weights for all documents.

The other way to extend similarity matching capabil-ities is to utilize several different feature extractors. We use multiple feature extractors to build multiple seman-tic summaries of the documents. The match results of the documents are then weighted by feature extractor and combined to compute the final search result.

Let SFExt1 , …, SFExtn be document search systems

using different feature extractors that have been initial-ized by inserting the same set of documents in each of them. A threshold search identifies all documents that have aggregate similarity across all features of a string Q greater than a threshold value Thresh. We define threshold query over multiple features, or Thresh-old_Query(Q,Thresh), as follows:

1. Compute the matching weights for the documents with respect to each feature

i n1 S (Q)i FExtiRes∀ ≤ ≤ =Weighted_Query .2. For each document Dj compute a vector Pj of the

matching weights for each feature.3. Return a document Dj, in the result vector Res

if Aggregate_Decision_Function(Pj) > Thresh, where Aggregate_Decision _Function can be instantiated with any func-tion that computes a single weight value from the weights of the different features extracted from the documents.

This allows us to generate a search that incorporates multiple techniques and sums their values into a single result. This method can then be applied to a single docu-ment or to a database of many documents (see Figure 2).

Document Feature ExtractorsUsing SADS and the private secure search framework we described, we engineered an efficient and usable secure private document database over two separate corpora—a collection of 106 RFC documents and all verses of the King James Version of the Bible (KJV). To further demonstrate the proposed framework’s adaptability, we also created a similar document database capable of executing queries over Chinese novels. We used three

Page 6: Usable, Secure, Private Search

58 IEEE Security & Privacy September/October 2012

Database search

different feature extractors—1-gram, stemmer, and term frequency–inverse document frequency (TF–IDF) extractors—to create the final document database for both the RFC and KJV corpora. Adjusting the decision and aggregate decision functions controlling the feature extractors defines the database’s query-matching behav-ior. Therefore, choosing the right feature extraction meth-ods along with suitable decision and aggregate decision functions achieves the desired search behavior, and we can look at efficiency as an optimization problem.

1-GramThe 1-gram feature extractor is the simplest in this trio and does simple-word tokenization, returning a list of 1-grams. Although extremely simple to compute, the 1-gram feature extractor can only be used to execute order-independent exact matches on single words. A 1-gram can also be extended to an n-gram, which returns a list of terms up to n-grams. Each n-gram’s term is equal to N consecutive words in the document.

StemmerStemmer is a full-blown natural language parsing–based feature extraction method. As the name suggests, stem-mer enables queries that are independent of gram-matical variations such as capitalization, verb tense, or grammatical number. More specifically, stemmer uses the National Language Toolkit library (www.nltk.org) and applies paragraph, sentence, and word tokeniza-tion; part of speech tagging and filtering; stemming; capitalization processing; and sentence-scoping of n-gram generation on the input document.

Term Frequency–Inverse Document FrequencyThe TF–IDF feature extractor returns terms that are

unique to each document with respect to the entire cor-pus. Each document in the corpus is partitioned into features containing up to N tokens (we used N = 3). The TF–IDF value is computed for each feature and sorted in decreasing order. The feature extractor outputs the top M terms. In our test results, we used M = 1,000 for the RFC corpus and M = 100 for the KJV corpus.

Decision and Aggregate Decision FunctionsWe present three decision functions in decreasing order of matching strictness. Each decision function imple-ments Decision_Function(Q) and returns a nor-malized score s such that {s : s ∈ R, 0 ≤ s ≤ 1}. Keep in mind that the decision function is used to query the set of features extracted by some feature extractor FExt(Di) on every document Di in the document corpus.

Decision function 1 represents an exact-match deci-sion function that returns 1 if the longest term in the set of features extracted for a query is found and returns 0 otherwise.

Decision function 2 is a relaxed version of decision function 1 and returns 1 if all features extracted from the query are found in the given document and returns 0 otherwise. Depending on the feature extractor used, this decision function exhibits a range of behaviors. Although the output score is binary, decision function 2 can be used to constrain the “in order” requirement of the matching behavior. Suppose the query is a five-word phrase and the feature extractor used returns all the 3-grams derived from the query. Decision func-tion 2 will return a match (a score of 1) if all three of the 3-grams are found somewhere in the document. Clearly, the in-order requirement can be adjusted by changing the n-grams that the feature extractor outputs, where a value of n = 1 represents the degenerate case in

Figure 2. Sample query (a) over a single document and (b) over all documents. The search incorporates multiple techniques and sums their values into a single result.

(a) (b)

Aggregate decision function (ADF)

ADF (document 2)

{0|1}

Decisionfunction

...

F-n...F-2F-1

...

Decisionfunction

1

F-n...F-2F-1

SADS-1/FExt-1

Decisionfunction

mADF (document n)ADF (document 1)

F-n...F-2F-1

SADS-m/FExt-m

Res[2] = [0l1] Res[n] = [0l1]Res[1] = [0l1]

Query term: Q

Page 7: Usable, Secure, Private Search

www.computer.org/security 59

which order isn’t considered. The experimental results, shown in Figure 3a, use n = 2 for this decision function on all feature extractors.

Decision function 3 is the most relaxed decision function. It computes the percentage of features found in a particular document over all the features extracted from an input query string. Consider the vast range of possible matching behaviors with decision function 3 when used in conjunction with different feature extrac-tion methods.

The aggregate decision function is instantiated with a linear threshold function that combines, in a weighted manner, the scores produced by the searches with each of the three feature extractors. Both weights and thresh-old are parameters that we can change to alter the sys-tem’s matching behavior.

Experimental ResultsWe queried 10 sample phrases over the demonstration database under various settings. Figure 3a shows the

Figure 3. Flexible search results for query configuration and aggregate search configuration. (a) Three decision functions of increasing relaxedness used in conjunction with three different feature extractors, demonstrating a variable number of results. (b) Combining the results using weighted thresholds to give a system with a tunable number of return results.

0

20

40

60

80

100

120

Stemmer-DF-1

Stemmer-DF-2

Stemmer-DF-3

TF–IDF-DF-1

TF–IDF-DF-2

TF–IDF-DF-3

n-gram-DF-1

n-gram-DF-2

n-gram-DF-3

Remote Procedure CallSend and receiveUser Datagram ProtocolTransmission Control ProtocolIP address

User authenticationHeader fieldData transferConnection requestUniform resource identifier

0

10

20

30

40

50

60

70

80

90

100

DF-1 �reshold=20

DF-2 �reshold=20

DF-3 �reshold=20

DF-1 �reshold=30

DF-2 �reshold=30

DF-3 �reshold=30

DF-1 �reshold=40

DF-2 �reshold=40

DF-3 �reshold=40

DF-1 �reshold=60

DF-2 �reshold=60

DF-3 �reshold=60

DF-1 �reshold=90

DF-2 �reshold=90

DF-3 �reshold=90

Remote Procedure CallSend and receiveUser Datagram ProtocolTransmission Control ProtocolIP address

User authenticationHeader fieldData transferConnection requestUniform resource identifier

Num

ber o

f doc

umen

ts re

turn

edN

umbe

r of d

ocum

ents

retu

rned

(a)

(b)

Page 8: Usable, Secure, Private Search

60 IEEE Security & Privacy September/October 2012

Database search

number of documents matching each query using each of the three decision functions and each of the feature extractors with a threshold value of 0.01. This demon-strates the matching behavior of each individual deci-sion function when used in conjunction with each of the three decision functions. When combined with each of the feature extractors, the number of results returned by the different decision functions increased in a similar pattern.

The results from the summary TF–IDF extractor are most conservative because it preserves a limited num-ber of searchable terms per document. Most of the time, the stemmer feature extractor returns more results than the n-gram because it increases the matches from one word to include all other words with the same stem. Figure 3b demonstrates the decrease in the number of result documents when we increased the threshold value across all decision functions.

Trade-Offs and UsabilityFigures 3 clearly shows that choice of feature extrac-tion methods, decision functions, and aggregate deci-sion functions directly affects the matching behavior of the private secure search system. Although the under-lying matching mechanism remained the same across all test cases, the final query results exhibited large variations and allude to some deeper trade-offs. Again, when dealing with real-world documents rich in struc-ture and semantic nuances, measuring false positive and false negative error rates becomes a complex, context-specific task. Our implementation incorporates three well-known and currently used decision functions and produces the same results, barring false positive results added by the underlying Bloom filter–based implemen-tation of the search. This is a tunable parameter and can be reduced to negligible levels at the cost of storage space at the system designer’s discretion.

S ystem designers must consider the desired security and privacy requirements at a semantic level and

carefully consider the optimal tolerance for semantic false positive and false negative errors, which is often a nontrivial compromise. We must consider several fac-tors such as the “cost” of a semantic false positive versus a false negative, the potential errors introduced by the algorithms used for feature extraction, and their compu-tational costs. Correctly tuning the matching behavior to comply with high-level security and privacy require-ments at a semantic level is perhaps the most important problem for any private secure search system.

References1. B.H. Bloom, “Space/Time Trade-Offs in Hash Coding

with Allowable Errors,” Comm. ACM, vol. 13, no. 7, 1970, pp. 422–426.

2. M. Bellare, A. Boldyareva, and A. O’Neill, “Deterministic and Efficiently Searchable Encryption,” Proc. Int’l Cryptol-ogy Conf. (Crypto 2007), Springer, 2007, pp. 535–552.

3. H.-A. Park et al., “Secure Similarity Search,” Proc. 2007 IEEE Int’l Conf. Granular Computing (GRC 07), IEEE CS, 2007, p. 598.

4. A. Sahai and B. Waters, “Fuzzy Identity-Based Encryp-tion,” LNCS 3494, Springer, 2005, pp. 457–473.

5. M. Raykova et al., “Secure Anonymous Database Search,” Proc. ACM Cloud Computer Security Workshop (CCSW 09), ACM, 2009, pp. 115–126.

6. S. Goldwasser and S. Micali, “Probabilistic Encryption,” J. Computer and System Sciences, vol. 28, no. 2, 1984, pp. 270–299.

7. S. Pohlig and M. Hellman, “An Improved Algorithm for Computing Logarithms over gf(P) and Its Cryptographic Significance,” IEEE Trans. Information Theory, vol. 24, no. 1, 1978, pp. 106–110.

8. D. Boneh, “Simplified OAEP for the RSA and Rabin Functions,” LNCS 2139, Springer, 2001, pp. 275–291.

Mariana Raykova is a postdoc at IBM Research. She has a PhD in computer science from Columbia Univer-sity. Contact her at [email protected].

Ang Cui is a PhD student of computer science at Colum-bia University. Contact him at [email protected].

Binh Vo is a PhD student of computer science at Columbia University. Contact him at binh@cs. columbia.edu.

Bin Liu is an MS student of computer science at Colum-bia University.

Tal Malkin is an associate professor of computer sci-ence at Columbia University. Contact her at [email protected].

Steven M. Bellovin is a professor of computer sci-ence at Columbia University. His research interests include security, privacy, and networking. Bellovin has a PhD in computer science from the University of North Carolina at Chapel Hill. Contact him at [email protected].

Salvatore J. Stolfo is a professor of computer science at Columbia University. His research interests include security and privacy, intrusion detection systems, and machine learning. Stolfo has a PhD in computer sci-ence from New York University. Contact him at [email protected].