Space-Constrained Gram-Based Indexing for Efﬁcient ...flamingo.ics.uci.edu/pub/icde2009-memreducer-full.pdf · Space-Constrained Gram-Based Indexing for Efﬁcient Approximate String

Space-Constrained Gram-Based Indexing forEfficient Approximate String Search

(Technical Report)Alexander Behm Shengyue Ji Chen Li Jiaheng Lu

Department of Computer Science, University of California,Irvine, CA 92697, USA{abehm,shengyuj,chenli,jlu2}@ics.uci.edu

Abstract— Answering approximate queries on string collec-tions is important in applications such as data cleaning, queryrelaxation, and spell checking, where inconsistencies and errorsexist in user queries as well as data. Many existing algorithms usegram-based inverted-list indexing structures to answer approxi-mate string queries. These indexing structures are “notoriously”large compared to the size of their original string collection.In this paper, we study how to reduce the size of such anindexing structure to a given amount of space, while retainingefficient query processing. We first study how to adopt existinginverted-list compression techniques to solve our problem. Then,we propose two novel approaches for achieving the goal: one isbased on discarding gram lists, and one is based on combiningcorrelated lists. They are both orthogonal to existing compressiontechniques, exploit a unique property of our setting, and offernew opportunities for improving query performance. For eachapproach we analyze its effect on query performance and developalgorithms for wisely choosing lists to discard or combine.Our extensive experiments on real data sets show that ourapproaches provide applications the flexibility in deciding thetradeoff between query performance and indexing size, and canoutperform existing compression techniques. An interesting andsurprising finding is that while we can reduce the index sizesignificantly (up to 60% reduction) with tolerable performancepenalties, for 20-40% reductions we can even improve queryperformance compared to original indexes.

I. I NTRODUCTION

Many information systems need to support approximatestring queries: given a collection of textual strings, suchasperson names, telephone numbers, and addresses, find thestrings in the collection that are similar to a given query string.The following are a few applications. In record linkage [17],we often need to find from a table those records that are similarto a given query string that could represent the same real-worldentity, even though they have slightly different representations,such asspielberg versusspielburg. In Web search, manysearch engines provide the “Did you mean” feature, whichcan benefit from the capability of finding keywords similar toa keyword in a search query. Other information systems suchas Oracle and Lucene also support approximate string querieson relational tables or documents.

Various functions can be used to measure the similaritybetween strings, such as edit distance (a.k.a. Levenshteindistance), Jaccard similarity, and cosine similarity. Many algo-rithms are developed using the idea of “grams” of strings. A

q-gram of a string is a substring of lengthq that can be usedas a signature for the string. For example, the 2-grams of thestringbingo arebi, in, ng, andgo. These algorithms rely onan index of inverted lists of grams for a collection of strings tosupport queries on this collection. Intuitively, we decomposeeach string in the collection to grams, and build an invertedlist for each gram, which contains the id of the strings withthis gram. For instance, Fig. 1 shows a collection of 6 stringsand the corresponding inverted lists of their2-grams.

id string

1 bingo

2 bioinng

3 bitingin

4 biting

5 boing

6 going

(a) Strings.

gram string ids

bi → 1, 2, 3, 4

bo → 5

gi → 3

go → 1, 6

in → 1, 2, 3, 4, 5, 6

io → 2

it → 3, 4

ng → 1, 2, 3, 4, 5, 6

nn → 2

oi → 2, 5, 6

ti → 3, 4

(b) Inverted lists of string ids.

Fig. 1. Strings and their inverted lists of2-grams.

The algorithms answer an approximate string query usingthe following observation: if a stringr in the collection issimilar enough to the query string, thenr should share a certainnumber of common grams with the query string. Therefore,we decompose the query string to grams, and locate thecorresponding inverted lists in the index. We find those stringids that appear at least a certain number of times on theselists, and these candidates are post-processed to remove thefalse positives.

Motivation : These gram-based inverted-list indexing struc-tures are “notorious” for their large size relative to the sizeof their original string data. This large index size causesproblems for applications. For example, many systems requirea very high real-time performance to answer a query. Thisrequirement is especially important for those applicationsadopting a Web-based service model. Consider online spellcheckers used by email services such as Gmail, Hotmail,

and Yahoo! Mail, which have millions of online users. Theyneed to process many user queries each second. There is abig difference between a 10ms response time versus a 20msresponse time, since the former means a throughput of 100queries per second (QPS), while the latter means 20 QPS. Sucha high-performance requirement can be met only if the indexis in memory. In another scenario, consider the case wherethese algorithms are implemented inside a database system,which can only allocate a limited amount of memory for theinverted-list index, since there can be many other tasks in thedatabase system that also need memory. In both scenarios, itis very critical to reduce the index size as much as we can tomeet a given space constraint.

Contributions : In this paper we study how to reduce the sizeof such index structures, while still maintaining a high queryperformance. In Section III we study how to adopt existinginverted-list compression techniques to our setting [32].Thatis, we partition an inverted list into fixed-size segments andcompress each segment with a word-aligned integer codingscheme. To support fast random access to the compressed lists,we can use synchronization points [24] at each segment, andcache uncompressed segments to improve query performance.Most of these compression techniques were proposed in thecontext of information retrieval, in which conjunctive keywordqueries are prevalent. In order to ensure correctness, losslesscompression techniques are usually required in this setting.

The setting of approximate string search is unique in thata candidate result must occur only acertain number of timesamong all the inverted lists, and not necessarily on all theinverted lists. We exploit this unique property to develop twonovel approaches for achieving the goal. The first approachis based on the idea of discarding some of the lists. Westudy several technical challenges that arise naturally inthisapproach (Section IV). One issue is how to compute a newlower bound on the number of common grams (whose listsare not discarded) shared by two similar strings, the formulaof which becomes technically interesting. Another question ishow to decide lists to discard by considering their effects onquery performance. In developing a cost-based algorithm forselecting lists to discard, we need to solve several interestingproblems related to estimating the different pieces of timeinanswering a query. For instance, one of the problems is toestimate the number of candidates that share certain numberof common grams with the query. We notice that severalrelated techniques [15], [22], [19] cannot be used directlytosolve these problems (see Section IV-A.4). We develop a novelalgorithm for efficiently and accurately estimating this number.We also develop several optimization techniques to improvethe performance of this algorithm for selecting lists to discard.

The second approach is combining some of the correlatedlists (Section V). This approach is based on two observations.First, the string ids on some lists can be correlated. Forexample, many English words that include the gram “tio” alsoinclude the gram “ion”. Therefore, we could combine thesetwo lists to save index space. Each of the two grams shares the

union list. Notice that we could even combine this union listwith another list if there is a strong correlation between them.Second, recent algorithms such as [20], [10] can efficientlyhandle long lists to answer approximate string queries. As aconsequence, even if we combine some lists into longer lists,such an algorithm can still achieve a high performance. Westudy several technical problems in this approach, and analyzethe effect of combining lists on a query. Also, we exploit anew opportunity to improve the performance of existing list-merging algorithms. Based on our analysis we develop a cost-based algorithm for finding lists to combine.

We have conducted extensive experiments on real datasetsfor the list-compression techniques mentioned above (Sec-tion VI). While existing inverted-list compression techniquescan achieve compression ratios up to 60%, they considerablyincrease the average query running time due to the online de-compression cost. We two novel approaches are orthogonal toexisting inverted-list-compression techniques, and offer uniqueoptimization opportunities for improving query performance.Note that using our novel approaches we can still compute theexact results for an approximate query without missing anytrue answers. The experimental results show that (1) the noveltechniques can outperform existing compression techniques,and (2) the new techniques provide applications the flexibilityin deciding the tradeoff between query performance and index-ing size. An interesting and surprising finding is that whilewecan reduce the index size significantly (up to a 60% reduction)with tolerable performance penalties, for 20-40% reductionswe can even improve the query performance compared tothe original index. Our techniques work for commonly usedfunctions such as edit distance, Jaccard, and cosine. We mainlyfocus on edit distance as an example for simplicity.

A. Related Work

In the literature the termapproximate string query alsomeans the problem of finding within a long text string thosesubstrings that are similar to a given query pattern. See [25]for an excellent survey. In this paper, we use this term to referto the problem of finding from a collection of strings thosesimilar to a given query string.

In the field of list compression, many algorithms [23], [31]are developed to compress a list of integers using encodingschemes such as LZW, Huffman codes, and bloom filters.In Section III we discuss in more detail how to adopt theseexisting compression techniques to our setting. One observa-tion is that these techniques often need to pay a high costof increasing query time, due to the online decompressionoperation, while our two new methods could even reducethe query time. In addition, the new approaches and existingtechniques can be integrated to further reduce the index size,as verified by our initial experiments.

Many algorithms have been developed for the problem ofapproximate string joins based on various similarity func-tions [2], [3], [4], [6], [9], [27], [28], [29], especially inthe context of record linkage [17]. Some of them are pro-posed in the context of relational DBMS systems. Several

recent papers focused on approximateselection (or search)queries [10], [20]. The techniques presented in this paper canreduce index sizes, which should also benefit join queries,and the corresponding cost-based analysis for join queriesneeds future research. Hore et al. [12] proposed a gram-selection technique for indexing text data under space con-straints, mainly considering SQL LIKE queries. Other relatedstudies include [7], [16], [26]. There are recent studies ontheproblem of estimating the selectivity of SQL LIKE substringqueries [5], [14], [18], and approximate string queries [22],[15], [19], [11].

Recently a new technique calledVGRAM [21], [30]was proposed to use variable-length grams to improveapproximate-string query performance and reduce index size.This technique, as it is, can only support edit distance, whilethe techniques presented in this paper support a variety ofsimilarity functions. Our techniques can also provide theuser the flexibility to choose the tradeoff between index sizeand query performance, which is not provided byVGRAM.Our experiments show that our new techniques outperformVGRAM, and potentially they can be integrated withVGRAMto further reduce the index size (Sections VI-D and VII).

II. PRELIMINARIES

Approximate String Search: Let S be a collection of strings.An approximate string search query includes a strings and athresholdk. It asks for allr ∈ S such that the distance betweenr and s is within the thresholdk. Various distance functionscan be used, such as edit distance, Jaccard similarity and co-sine similarity. Take edit distance as an example. Formally, theedit distance (a.k.a. Levenshtein distance) between two stringss1 ands2 is the minimum number of edit operations of singlecharacters that are needed to transforms1 to s2. Edit opera-tions include insertion, deletion, and substitution. We denotethe edit distance between two stringss1 ands2 ased(s1, s2).For example,ed(“Levenshtein”, “ Levnshtain”) = 2.Using this function, an approximate string search with a querystring q and thresholdk is finding all s ∈ S such thated(s, q) ≤ k.

Grams: Let Σ be an alphabet. For a strings of the charactersin Σ, we use “|s|” to denote the length ofs, “s[i]” to denotethe i-th character ofs (starting from 1), and “s[i, j]” to denotethe substring from itsi-th character to itsj-th character. Weintroduce two charactersα and β not in Σ. Given a strings and a positive integerq, we extends to a new strings′

by prefixing q − 1 copies ofα and suffixingq − 1 copies ofβ.1 A positional q-gram of s is a pair(i, g), whereg is thesubstring of lengthq starting at thei-th character ofs′, i.e.,g = s′[i, i+q−1]. The set ofpositional q-grams of s, denotedby G(s, q), or simply G(s) when theq value is clear in thecontext, is obtained by sliding a window of lengthq over thecharacters ofs′. For instance, supposeα =#, β = $, q = 3,

1The results in the paper extend naturally to the case where wedo notextend a string to produce grams.

and s = irvine. We have:G(s, q) = {(1, ##i), (2, #ir),(3, irv), (4, rvi), (5, vin), (6, ine), (7, ne$), (8, e$$)}. Thenumber of positionalq-grams of the strings is |s|+q−1. Forsimplicity, in our notations we omit positional information,which is assumed implicitly to be attached to each gram.

Inverted Lists: We assume an index of inverted lists for thegrams of strings in the collectionS. In the index, for eachgramg of the strings inS, we have a listlg of the ids of thestrings that include this gram (possibly with the correspondingpositional information). All the ids inlg are sorted in theincreasing order. It is observed in [27] that an approximatequery with a strings can be answered by solving the followinggeneralized problem:

T -occurrence Problem: Find the string ids that ap-pear at leastT times on the inverted lists of thegrams inG(s, q), whereT is a constant related tothe similarity function, the threshold in the query,and the gram lengthq.

Take edit distance as an example. For a stringr ∈ S thatsatisfies the conditioned(r, s) ≤ k, it should share at least thefollowing number ofq-grams withs:

Ted = (|s| + q − 1) − k × q. (1)

Intuitively, the query has|s| + q − 1 grams, and each editoperation can destroy at mostq grams of lengthq. Thus atleast(|s| + q − 1) − k × q grams inG(s) can “survive.”

Several existing algorithms [20], [27] are proposed for an-swering approximate string queries efficiently. They first solvetheT -occurrence problem to get a set of string candidates, andthen check their real distance to the query string to removefalse positives. Note that if the thresholdT ≤ 0, then the entiredata collection needs to be scanned to compute the results.We call it a panic case. One way to reduce this scan time isto apply filtering techniques. For example, length filteringandprefix filtering [9], [20] can separate the collection to differentgroups. Given a query, we only need to scan some of groupsto find the results. To summarize, the following are the piecesof time needed to answer a query:

• If the lower boundT (also called “merging threshold”) ispositive, then the time includes the time to traverse thelists of the grams in the query to find candidates (called“merging time”) and the time to remove the false positivesfrom the candidates (called “post-processing time”).

• If the lower boundT is zero or negative, we need to spendthe time (called “scan time”) to scan the entire data set,possibly using filtering techniques.

In the following sections we adopt existing techniquesand develop new techniques to reduce this index size. Forsimplicity, we first focus on the edit distance function. InSection VII we will see that the results can be naturallyextended to other commonly used functions.

III. A DOPTING EXISTING INVERTED-L IST-COMPRESSION

TECHNIQUES

There are many techniques [32] focusing on inverted-listcompression in the literature, which mainly study the problemof representing integers on inverted lists efficiently to savestorage space. In this section we will briefly review thesetechniques, adopt them to our problem setting and then discussthe limitations of these techniques.

A. Encoding and Decoding Lists of Integers

Most of the list-compression techniques exploit the fact thatids on an inverted list are monotonically increasing integers.For example, suppose we havel = (id1, id2, . . . , idn), idi <idi+1 for 1 ≤ i < n. If we take the differences of adjacentintegers to construct a new listl′ = (id1, id2 − id1, id3 −id2, ..., idn − idn−1), the integers in this new list tend to besmaller than the original ids. Thisl′ list is called the gappedrepresentation of the inverted listl, since it stores the “gap”between each pair of adjacent integers. Using this gappedform, many integer-compression techniques such as gammacodes, delta codes, and Golomb codes [32] can efficientlyencode inverted lists by using shorter representations forsmall integers. We choose one of the recent techniques calledCarryover-12 [1] to adopt in our setting.

B. Segmenting, Compressing, and Indexing Inverted Lists

An issue arises when using the encoded, gapped repre-sentation of a list. Since decoding is usually achieved in asequential way, a sequential scan on the list might not beaffected too much. However, random accesses could becomeexpensive. Even if the compression technique allows us todecode the desired integer directly, the gapped representationstill requires restoring of all preceding integers. Many efficientlist-merging algorithms such asDivideSkip [20] rely heavilyon binary search on the inverted lists. This problem can besolved by segmenting the list and introducingsynchronizationpoints [24]. Each segment is associated with a synchronizationpoint and decoding can start from any synchronization point,so that only one segment needs to be decompressed in order toread a specific integer. One way to adopt this technique is tohave each segment contain the same number of intergers. Sincedifferent encoded segments could have different sizes, it isnecessary to index the starting offset of each encoded segment,so that they can be quickly located and decompressed. Figure2illustrates the idea of segmenting inverted lists and indexingcompressed segments.

Uncompressed List

Compressed List

Segment Index

Segment0 Segment1 Segment2 Segment3

Fig. 2. Inverted-list compression with segmenting and indexing

A simple way to access elements is by decoding thecorresponding segment for each integer access. If multipleintegers within the same segment are requested, the segmentwill be decompressed multiple times. The repeated efforts canbe alleviated using caching. Once a segment is decoded, it willremain in the cache for a while. All integer accesses to thatsegment will be answered using the cache without decodingthe segment again, until the segment is replaced from thecache. We allocate a global cache pool for all inverted lists.More details are given in our experiments (Section VI).

C. Limitations of Existing Techniques

Most of these existing techniques were initially designedfor compressing disk-based inverted indexes. Using a com-pressed representation, we can not only save disk space,but also decrease the number of disk I/Os. Even with thedecompression overhead, these techinques can still improvequery performance since disk I/Os are usually the majorcost. When the inverted lists are in memory, these techinquesrequire additional decompression operations, compared tonon-compressed indexes. Thus, the query performance can onlydecrease. Another limitation is that these approaches havelimited flexibility in trading query performance with spacesavings. Next we propose two novel methods to reduce thesize of inverted lists, while retaining good query performance.

IV. D ISCARDING INVERTED L ISTS

In this section we study how to reduce the size of aninverted-list index by discarding some of the lists. That is,for all the grams from the strings inS, we only keep invertedlists for some of the grams, while we do not store those of theother grams. A gram whose inverted list has been discardedis called ahole gram, and the corresponding discarded list iscalled itshole list. Notice that a hole gram is different froma gram that has an empty inverted list. The former means theids of the strings with this gram are not stored in the index,while the latter means no string in the data set has this gram.

We study the effect of hole grams on query answering. InSection IV-A we analyze how they affect the merging thresh-old, the list merging and post-processing, and discuss how thenew running time of a single query can be estimated. For asingle query, we present a dynamic programming algorithmfor computing a tight lower bound on the number of nonholegrams shared by two similar strings assuming the edit distancefunction. Based on our analysis, we propose an algorithmto wisely choose grams to discard in the presence of spaceconstraints, while retaining efficient processing. We developvarious optimization techniques to improve the performanceof this algorithm (Section IV-B).

A. Effects of Hole Grams on a Query

For an approximate query with a strings if G(s) does nothave hole grams, then the query is unaffected. If it does havehole grams, then the query may be affected in the followingways:

• The lower bound on the number of nonhole grams sharedby a similar string in the collection could become smaller.

• As a consequence, the merging time and post-processingtime could both increase.

• However, due to having a lower number of elements amongthe lists to process, the merging time could also decrease.Also, in come cases the modified threshold could decreasethe post-processing time.

• If the new lower bound becomes zero or negative, then thequery can only be answered by a scan.

1) Effects on Merging Threshold: Consider a stringr in thecollectionS such thated(r, s) ≤ k. For the case without holegrams,r needs to share at leastT = (|s| + q − 1) − k × qcommon grams inG(s) (Equation 1). To find such anr, in thecorrespondingT -occurrence problem, we need to find stringids that appear on at leastT lists of the grams inG(s).

If G(s) does have hole grams, the id of stringr could haveappeared on some of the hole lists. But we do not know onhow many hole lists this stringr could appear, since these listshave been discarded. We can only rely on the lists of thosenonhole grams to find candidates. Thus the problem becomesdeciding a lower bound on the number of occurrences of stringr on thenonhole gram lists.

One simple way to compute a new lower bound is thefollowing. Let H be the number of hole grams inG(s), where|G(s)| = |s| + q − 1. Thus, the number of nonhole grams fors is |G(s)| − H. In the worst case, every edit operation candestroy at mostq nonhole grams, andk edit operations coulddestroy at mostk×q nonhole grams ofs. Therefore,r shouldshare at least the following number of nonhole grams withs:

T ′ = |G(s)| − H − k × q. (2)

We can use this new lower boundT ′ in the T -occurrenceproblem to find all strings that appear at leastT ′ times on thenonhole gram lists as candidates.

The following example shows that this simple way tocompute a new lower bound is pessimistic, and the real lowerbound could be tighter. Consider a query strings = irvine

with an edit-distance thresholdk = 2. Supposeq = 3. Thusthe total number of grams inG(s) is 8. There are two gramsirv andine as shown in Figure 3. Using the formula above,an answer string should share at least0 nonhole gram withstring s, meaning the query can only be answered by a scan.This formula assumes that a single edit operation could destroyat most3 grams, and two edit operations destroy at most 6grams. However, a closer look at the positions of the holegrams tells us that a single edit operation can destroy at most2 nonhole grams, and two edit operations can destroy at most4 nonhole grams. Figure 3 shows two deletion operations thatcan destroy the largest number of nonhole grams, namely4. Thus, a tighter lower bound is2 and we can avoid thepanic case. This example shows that we can exploit thepositions of hole grams in the query string to compute a tighterthreshold in theT -occurrence problem. We develop a dynamicprogramming algorithm to compute a tight lower bound on thenumber of common nonhole grams inG(s) an answer string

Fig. 3. A query stringirvine with two hole grams. A solid horizontalline denotes a nonhole gram, and a dashed line denotes a hole gram.

needs to share with the query strings with an edit-distancethresholdk (a similar idea is also adopted in an algorithmin [30] in the context of the VGRAM technique [21]).Subproblem: Let 0 ≤ i ≤ |s| and1 ≤ j ≤ k be two integers.Let P (i, j) be an upper bound on the number of grams thatcan be destroyed byj edit operations that are at positionsnogreater than i. The overall problem we wish to solve becomesP (|s|, k).

Initialization: Let Dd/si denote the maximum possible num-

ber of grams destroyed by a deletion or substitution operationat positioni, and letDins

i denote the maximum possible num-ber of grams destroyed by an insertion operation after positioni. For each0 ≤ i ≤ |s| we setP (i, 0) = max(D

d/si ,Dins

i ).Recurrence function: Consider the subproblem of computinga value for entryP (i, j). Let gi denote the gram starting fromposition i, which may either be a hole or a nonhole gram. Ifit is a hole then we can setP (i, j) = P (i − 1, j) because anedit operation at this position cannot be the most destructiveone. Recall that we have already discarded the list belongingto the gram ati. If the gram starting fromi is not a hole thenwe need to distinguish three cases:• We have a deletion/substitution operation at positioni.

This will destroy grams up to positioni−q+1 and consumeone edit operation. The number of grams destroyed isD

d/si . Therefore, we can setP (i, j) = P (i − q, j − 1) +

Dd/si .

• We have an insertion operation after positioni. This willdestroy grams up to positioni − q + 2 and consume oneedit operation. The number of grams destroyed isDins

i .Therefore, we can setP (i, j) = P (i−q+1, j−1)+Dins

i .• There is no operation ati. We can setP (i, j) = P (i−1, j).

The following is a summary of the recurrence function:

P (i, j) = max

gi is a hole: P (i − 1, j)

gi is a gram:

delete/substitutei :

P (i − q, j − 1) + Dd/si

insertion afteri :P (i − q + 1, j − 1) + Dins

i

noop i :P (i − 1, j)

(3)After we have computed the maximun number of grams

destroyed byk edit operations (denoted byDmax), we can getthe new bound using the same idea as in Equation 1, exceptthat the maximum number of grams destroyed is notk×q butDmax.

2) Effects on List-Merging: The running time of somemerging algorithms (e.g.HeapMerge, ScanCount [20]) isinsensitive to the merging thresholdT and mainly depends onthe total number of elements in all inverted lists. Therefore,the running time for these algorithms can only decrease bydiscarding some of the lists. Other merging algorithms (e.g.MergeOpt, DivideSkip [20]) separate the inverted lists intotwo groups which are processed separately: long lists andshort lists. These algorithms perform a heap-based mergingon the short lists to get candidate string ids that might occurrT times on all lists. In order to get the final solution to theT -occurrence problem, the algorithms verify those candidatesagainst the long lists using binary search. The performanceof these algorithms depends on which lists are put intothe group of short list and long lists. For the algorithmsmentioned here, the number of long lists is some functionof T . Hence, their performance is sensitive to changes inT . In general, by discarding lists their performance maybe affected positively or negatively. Which one is the casedepends on the function for separating the lists into groups,and on the choice of lists to discard, i.e. the discarded listsmay have belonged to the group of long lists or short lists.Another class of algorithms utilizesT to skip elements onthe inverted lists which cannot be answers to the query. Forexample, based onT the MergeSkip and DivideSkip [20]algorithms perform a binary search to locate the next elementon a list that could potentially be an answer to the query,skipping irrelevant elements. For these algorithms a higher Tsupports the skipping of elements. Therefore, decreasingT bydiscarding some lists can negatively affect their performance.However, by discarding some lists conceptually we skip allelements on those lists which may overcompensate for thedecreasedT , leading to a performance improvement. Note thatthe DivideSkip algorithm is sensitive toT in two ways: (1)because it uses the concept of short lists and long lists and (2)because it usesT to skip irrelevant elements.

3) Effects on Post-Processing: Intuitively, for a given query,introducing hole grams may only increase the number ofcandidates to post-process. We will show that this intuitionis correctif we use the simple formula for calculating the newthresholdT ′. Surprisingly, if we use the dynamic programmingalgorithm to derive a tighterT ′, then the number of candidatesfor post-processing can even decrease.Take the example given in Fig. 3. Suppose the edit distancethresholdk = 2. Say that some string idi only appears on theinverted lists ofirv and ine. SinceT = 2 it is a candidateresult. If we choose to discard the gramsirv andine as shownin Fig. 3, then each edit operation can destroy at most two non-hole grams and thereforeT ′ = T = 2. After discarding thelists, the stringi is not a candidate anymore since all the listscontaining it have been discarded. This reduces the cost forpostprocessing. Note that any string id which appears only onirv andine cannot be a result to the query and would havebeen removed from the results during post-processing. We willnow formalize the above notions.Consider a query with strings, allowing k edit operatios, a

bag of gramsG(s), a bag of hole-gramsH ⊆ G(s) and anoriginal merging thresholdT .Lemma 1: If Equation 2 is used to compute the new mergingthreshold, introducing hole grams cannot decrease the numberof candidates for postprocessing.Let us examine a string id that occurrede times on all listsbefore discarding. If the string was a candidate before, thene ≥ T . We may have discardedδ ≤ |H| lists on whicheoccurred. Therefore,e − δ ≥ T − |H|, and the string willstill be a candidate. If the string was not a candidate before,then e < T . The formulae − δ < T − |H| is not satisfiablefor all δ ≤ |H|, and therefore the string may become a newcandidate. Intuituvely, consider the case where we discarded|H| lists on whiche did not occurr. It may be thate ≥ T −|H|althoughe < T .Lemma 2: If the dynamic programming algorithm inEquation 3 is used to compute the new merging threshold,introducing hole grams could sometimes decrease the numberof candidates for postprocessing.The example given above ilustrates this and we shall nowgeneralize the idea. Due to favorable positions of the holegrams it may be that a tighter bound yieldsT ′ ≥ T −|H| andthereforeT −|H| ≤ T ′ ≤ T . A string id that occurrede timesbefore discarding withe ≥ T may cease to be a candidateafter introducing hole grams. This is because we may havediscardedδ ≤ |H| lists containinge and if δ > e − T thene − δ < T ′ ≤ T .

4) Estimating the Performance of a Query with HoleGrams: Since we are evaluating whether it is a wise choiceto discard a specific listli, we would like to know, bydiscarding list li, how the performance of a single queryQ will be affected using the indexing structure. In theprevious sections we have analyzed the effects of introducinghole grams on a query. We now quantify these effects byestimating the running time of a query with hole grams.

Estimating the Merging Time: If the new merging thresholdT ′ > 0, we need to solve theT -occurrence problem byaccessing the inverted lists of the nonhole grams to find stringids that appear at leastT ′ times. Here, we will consider one ofthe most efficient merging algorithmsDivideSkip [20] whichis sensitive to changes inT . Let the listsl1, l2, . . . , lN be sortedin a decreasing order of their size. Then the running time forDivideSkip can be estimated as:

C1∗

∑Nj=L+1 |lj |

T − L∗

L∑

i=1

log|li|+C2∗

N∑

j=L+1

|lj |∗log(N−L) (4)

For C1 = 1 and C2 = 1, the above equation represents theworst cost for theDivideSkip algorithm. The right side of thesum shows the cost of merging the short lists. The left side ofthe sum presents the cost for verifying the candidates from theshort lists against the long lists. The coefficientC1 representsthat the actual number of candidates to verify against the longlists can be smaller than the maximim possible one (given

inP

Nj=L+1

|lj |

T−L ). The coefficientC2 captures that skippingoperations are performed while merging the short lists, andtherefore not all elements need to be pushed and popped fromthe heap. To estimate the running time ofDivideSkip wedecide the coefficientsC1 andC2 offline by running a subsetof possible queries on the data set, collecting their mergingtimes, and doing a linear regression. If filtering techniques areapplied, we can decideC1 andC2 for each group separatelyto increase the accuracy of our estimation.

Estimating the Post-Processing Time: For each candidatefrom theT -occurrence problem, we need to compute the cor-responding distance to the query to remove the false positives.This time can be estimated as

(number of candidates) × (average edit-distance time).

Therefore, the main problem becomes how to estimate thenumber of candidates after solving theT -occurrence problem.This problem has been studied in the literature recently. Forinstance, Mazeika et al. [22] studied how to estimate thisnumber. The methodology in the SEPIA technique [15] inthe context of selectivity estimation of approximate stringqueries could be adopted with minor modifications to solveour problem. The technique in [19] might also be applicable.While these techniques could be used in our context, they havetwo limitations. First, their estimation is not 100% accurate,and an inaccurate result could greatly affect the accuracy ofthe estimated post-processing time, thus affecting the qualityof the selected nonhole lists. Second, this estimation may needto be done very often when choosing lists to discard, andtherefore needs to be efficient.

We develop an efficient,incremental algorithm that cancompute avery accurate number of candidates for queryQ iflist li is discarded. The algorithm is calledISC, which standsfor “Incremental-Scan-Count,” where its idea comes from analgorithm calledScanCount developed in [20]. Although theoriginal ScanCount is not the most efficient one for theT -occurrence problem, it has the nice property that it can be runincrementally.

Figure 4 shows the intuition behind thisISC algorithm.First, we analyze the queryQ on the original indexingstructure without any lists discarded. For each string id inthe collection, we remember how often it occurs on all theinverted lists of the grams in the query and store them in anarrayC. (Later we will discuss how to reduce the size of thisdata structure.) Now we want to know if a list is discarded,how it affects the number of occurrences of each string id. Foreach string idr on list l belonging to gramg to be discarded,we decrease the corresponding valueC[r] in the array by thenumber of occurrences ofg in the query string, since thisstring r will no longer haveg as a nonhole gram.

After discarding this list for gramg, we first compute thenew merging thresholdT ′. Then, we can find the new setof candidates by scanning the arrayC and recording thosepositions (corresponding to string ids) whose value is at leastT ′.

Fig. 4. Intuition behind the Incremental-Scan-Count (ISC) algorithm.

In the example in Fig. 5 the hole list includes string ids0, 2, 5, and 9. For each of the four ids, we decrease thecorresponding value in the array by1 (assuming the holegram occurrs once in the query). Suppose the new thresholdT ′ is 3. We scan the new array to find those string ids whoseoccurrence among all non-hole lists is at least3. These strings,which are 0, 1, and 9 (in bold face in the figure), are candidatesfor the query using the new threshold after this list is discarded.

Hole list

New merging threshold T’ =3

String ids

String ids

# of common grams before discarding

String ids

-1-1-1-1

Fig. 5. Running theISC algorithm.

Since most of the values in the array are zero, we can keepa vector to store references to those positions in the arraywhose original values (without any lists discarded) are notzero. Instead of scanning the whole array to find the new setof candidates we can scan this vector and follow the referencesto the elements in the array that may be candidates.Estimating the Scan Time: If the new merging thresholdT ′ > 0, then the query can only be ansered by a scan. Thetime of a scan can be estimated in the same manner as thepost-processing time, except that the number of candidatesisthe number of strings in the collection. If filtering techniquesare applied then the number of candidates is the sum of alldistinct string ids of groups that need to be considered by thequery.

B. Algorithms for Choosing Inverted-Lists to Discard

In this section, we study how to wisely choose lists todiscard in order to satisfy a given space constraint. Thefollowing are several simple approaches:• LongList: Choose the longest lists to discard.• ShortList: Choose the shortest lists to discard.• RandomList: Choose random lists to discard.

These naive approaches blindly discard lists without consider-ing the effects on query performance. Clearly, a good choiceof lists to discard depends on the query workload. Based on

our previous analysis, we present a cost-based algorithm calledDiscardLists, as shown in Figure 6. Given the initial set ofinverted lists, the algorithm iteratively selects lists todiscard,based on the size of a list and its effect on the average queryperformance for a query workloadQ if it is discarded.Q canbe obtained from query logs, or from a uniformly distributiedworkload over the dataset when initially there is no query logavailable. The algorithm keeps selecting lists to discard untilthe total size of the remaining lists meets the given spaceconstraint (line 2).

Algorithm: DiscardListsInput: Inverted listsL = {l1, . . . , ln}

ConstraintB on the total list sizeQuery workloadQ = {Q1, . . . , Qm}

Output: A set D of lists in L that are discardedMethod:1. D = ∅;2. WHILE (B < (total list size ofL)) {3. FOR (each listli ∈ L) {4. Compute size reduction∆i

size if discardingli5. Compute difference of average query time∆i

time

for queries inQ if discardingli}

6. Use∆isize’s and∆i

time’s of the lists to decide whatlists to discard

7. Add discarded lists toD8. Remove the discarded lists fromL

}9. RETURN D

Fig. 6. Cost-based algorithm for choosing inverted lists todiscard.

In each iteration (lines 3-8), the algorithm needs to evaluatethe quality of each remaining listli, based on the expectedeffect of discarding this list. The effect includes the reduction∆i

size on the total index size, which is the length of this list.It also includes the change∆i

time on the average query timefor the workloadQ after discarding this list.2

In line 6 in Figure 6, in each iteration of theDiscardListsalgorithm, we need to use the∆i

size’s and∆itime’s of the lists

to decide what lists should be really discarded. There are manydifferent ways to make this decision. One way is to choosea list with the smallest∆i

time value (notice that it could benegative). Another way is to choose a list with the smallest∆i

time/∆ispace ratio.

There are several ways to reduce the computation time ofthe estimation:

(1) When discarding the listli, those queries whose stringsdo not have the gram ofli will not be affected, since they willstill have the same set of nonhole grams as before. Therefore,we only need to re-evaluate the performance of the querieswhose strings have this gram ofli. In order to find these stringsefficiently, we build an inverted-list index structure for thequeries, similar to the way we construct inverted lists for thestrings in the collection. When discarding the listli, we can

2Surprisingly,∆itime can be both positive and negative, since in some cases

discarding lists can even reduce the average running time forthe queries.

just consider those queries on thequery inverted list of thegram for li.

(2) We run the algorithm on a random subset of the strings.As a consequence, (a) we can make sure the entire invertedlists of these sample strings can fit into a given amountof memory. (b) We can reduce the array size in theISCalgorithm, as well as its scan time to find candidates. (c) Wecan reduce the number of lists to consider initially since someinfrequent grams may not appear in the sample strings.

(3) We run the algorithm on a random subset of thequeries in the workloadQ, assuming this subset has the samedistribution as the workload. As a consequence, we can reducethe computation to estimate the scan time, merging time, andpost-processing time (using theISC algorithm).

(4) We do not discard those very short lists, thus we canreduce the number of lists to consider initially.

(5) In each iteration of the algorithm, we choose multiplelists to discard based on the effect on the index size and overallquery performance. In addition, for those lists that have verypoor time effects (i.e., they affect the overall performance toonegatively), we do not consider them in future iterations, i.e.,we have decided to keep them in the index structure. In thisway we can reduce the number of iterations significantly.

V. COMBINING INVERTED L ISTS

In this section, we study how to reduce the size of aninverted-list index bycombining some of the lists. Intuitively,when the lists of two grams are similar to each other, usinga single inverted list to store the union of the originial twolists for both grams could save some space. One subtlety inthis approach is that the string ids on a list are treated as asetof ordered elements (without duplicates), instead of abag ofelements. By combining two lists we mean taking theunionof the two lists so that space can be saved. Notice that theT lower bound in theT -occurrence problem is derived fromthe perspective of the grams in the query. (See Equation 1in Section II as an example.) Therefore, if a gram appearsmultiple times in a data string in the collection (with differentpositions), on the corresponding list of this gram the string idappears only once. [CHECK] If we want to use the positionalfiltering technique (mainly for the edit distance function)described in [9], [20], for each string id on the list of a gram,we can keep a range of the positions of this gram in the string,so that we can utilize this range to do filtering. When taking theunion of two lists, we need to accordingly update the positionrange for each string id.

We will first discuss the data structure and the algorithm forefficiently combining lists in Section V-A, and then analyzetheeffects of combining lists on query performance in Section V-B. We also show that an index with combined inverted listsgives us a new opportunity to improve the performance of list-merging algorithms (Section V-B.1). Based on our analysiswe propose theCombineLists algorithm for choosing lists tocombine in the presence of space constraints, while retainingefficient processing (Section V-C).

A. Combining Lists

This section studies the data structure and algorithm used tocombine lists. In the original inverted-list structure, each gramg is mapped to a listl of string ids:g → l. Combining twolists l1 and l2 will produce a new listlnew = l1 ∪ l2, whichis also sorted by string ids. All grams that were previouslymapped tol1 andl2 (there could be several grams due to earliercombining operations) will be mapped tolnew. In this fashionwe can support combining more than two lists iteratively. Atthe first glance a reverse mapping from lists to grams mayseem to be necessary, because we have to track all gramsassociated with a list in order to update them. However, adata structure calledDisjoint-Set with the algorithmUnion-Find [8] can be utilized to efficiently combine more than twolists. The Disjoint-Set data structure is a special forest inwhich each node holds only the pointer to its parent. To applythe Disjoint-Set structure to this problem, each inverted listcorresponds to a tree and each node corresponds to a gram.All nodes in the same tree share the same inverted list, whichis stored as a single copy attached to the root. Initially allgrams are disjoint tree roots, with their individual inverted listsattached to each of them. As we combine two trees, the rootof one of them becomes a child of the root of the other, whichtakes over all string ids from the list of the first tree to generatethe union list. Figure 7 shows theDisjoint-Set structure usedwith theUnion-Find algorithm when combining list ofg2 andlist of g3, while g2 and g1 are already sharing the same listl1 ∪ l2. After the combination of the two trees correspondingto the two lists, the result list isl1∪ l2∪ l3 and attached to rootg1. TheUnion-Find algorithm can also shorten the path fromnodes to their root to speed up inverted-list lookup operations.

g1g3

g2

1 23

g1

g2

1 2 3

g3After the

combination

Fig. 7. Combining list ofg2 with list of g3 usingUnion-Find

To satisfy a specific space constraint, we would like to knowthe exact size of saved space when combining lists. The sizereduction of combining two listsl1 and l2 can be computedas

∆(1,2)size = |l1| + |l2| − |l1 ∪ l2| = |l1 ∩ l2|.

The larger the intersection of two lists is, the more benefit wecan get from combining them.

B. Effects of Combining Lists on Query Performance

In this section we study how query performance will beaffected by combining lists. For a similarity query with a strings, if the lists of the grams inG(s) are combined (possibly withlists of grams not inG(s)), then the performance of this querycan be affected in the following ways:• Different from the approach of discarding lists, the lower

boundT in the T -occurrence problem remains the same,

since an answer still needs to appear at least this numberof times on the lists. Therefore, if a query was not in apanic case before, then it will not be in a panic case aftercombining inverted lists.

• The lists will become longer. As a consequence, it willtake more time to traverse these lists to find candidatesduring list merging. In addition, more false positives maybe produced to be post-processed.

1) Effects on List-Merging: As inverted lists get combined,some of them will become longer. Here we take one of the list-merging algorithmsDivideSkip as an example. In Forumla 4,as some|li|’s become larger, the time for merging the shortlists and the time for verifying the candidates against the longlists could increase. In this sense it appears that combining listscan only increase the list-merging time in query answering.However, the following observation opens up opportunitiesforus to further decrease the list-merging time, given an indexstructure with combined lists.

We notice that a gram could appear in the query strings multiple times (with different positions), thus these gramsshare common lists. In the presence of combined lists, itbecomes possible for even different grams inG(s) to sharelists. This sharing suggests a way to improve the perfor-mance of existing list-merging algorithms for solving theT -occurrence problem [27], [20]. A simple way to use one ofthese algorithms is to pass it a list for each gram inG(s).Thus we pass|G(s)| lists to the algorithm to find string idsthat appear at leastT times on these (possibly shared) lists. Wecan improve the performance of the algorithm as follows. Wefirst identify the shared lists for the grams inG(s). For eachdistinct list li, we also pass to the algorithm the number ofgrams sharing this list, denoted bywi. Correspondingly, thealgorithm needs to consider thesewi values when countingstring occurrences. In particular, if a string id appears onthe list li, the number of occurrences should increase bywi, instead of “1” in the traditional setting. In this way, wecan reduce the number of lists passed to the algorithm, thuspossibly even reducing its running time. The algorithms in [27]already consider different list weights, and the algorithmsin [20] can be modified slightly to consider these weights.3

2) Effects on Post-processing: After combining two listsl1andl2, more candidates will be generated for post-processing.The problem becomes computing the number of candidatesgenerated from the list-merging algorithm. [CHECK]We no-tice that although different list-merging algorithms can havedifferent performances, the results of theT -occurrence prob-lem are exactly the same, given the same index structure.Therefore we can use any list-merging algorithm to computethe number of candidates. Before combining any lists, thecandidate set generated from a list-merging algorithm containsall correct answers and some false positives. We are particu-

3Interestingly, our experiments showed that, even for the case we do notcombine lists, this optimization can already reduce the running time of existinglist-merging algorithms by up to 20% for a data set with duplicate grams ina string.

larly interested to know how many new false positives will begenerated by combining two listsl1 andl2. TheISC algorithmdescribed in Section IV-A.4 can be modified to adapt to thissetting.

Recall that in theISC algorithm, a ScanCount vector ismaintained for a queryQ to keep track of the number ofgramsQ shares with each string id in the data set. The stringswhose corresponding values in the ScanCount vector are atleast T will be candidate answers. By combining two listsl1 and l2, the lists of those grams that are mapped tol1 or l2will be conceptually extended. Figure 8 describes the intuitionof the algorithm. Every gram that was previously mapped tol1 or l2 will now be mapped tol1 ∪ l2. The extended partof l1 is ext(l1) = l2\l1. Let w(Q, l1) denote the numberof times grams ofQ referencel1. The ScanCount value ofeach string id inext(l1) will be increased byw(Q, l1). Sincefor each reference, all string ids inext(l1) should have theirScanCount value increased by one, the total incrementationwill be w(Q, l1) (not w(Q, l2)). The same operation needsto be done forext(l2) symmetrically. It is easy to see theScanCount values are monotonically increasing as lists arecombined. The strings whose ScanCount values increase frombelowT to at leastT become new false positives afterl1 andl2 are combined.

1 2

Query grams G(Q)

ScanCount

Vector1 2

Increase by w(Q,l1)

Increase by w(Q,l2)

Fig. 8. ISC algorithm for computing the number of candidates for a queryQ after combining listsl1 and l2.

Figure 9 shows an example, in whichl1 = {0, 2, 8, 9},l2 = {0, 2, 3, 5, 8}. Before combiningl1 and l2, two gramsof Q are mapped tol1 and three grams are mapped tol2.Therefore,w(Q, l1) = 2 and w(Q, l2) = 3. For every stringid in ext(l1) = {3, 5}, their corresponding values in theScanCount vector will be increased byw(Q, l1). Let C denotethe ScanCount vector.C[3] will be increased from 6 to 8,while C[5] will be increased from 4 to 6. Given the mergingthresholdT = 6, the change on C[5] indicates that string 5will become a new false positive. The same operation is carriedout onext(l2).

C. Algorithms for Choosing Lists to Combine

Based on the previous analysis, we present our algorithmsfor choosing inverted lists to combine.

1) Basic algorithm: The basic algorithm consists of twosteps: discovering candidate gram pairs that are correlated,and selecting some of them to combine.

List l1 List l2

Query grams G(Q)

ScanCount Vector

w(Q,l1) = 2 w(Q,l2) = 3

+2

+3

0

82

9

3

5+2

0

1

2

3

4

5

6

7

8

9

10

5

1

5

6

0

4

0

1

6

2

0

5

1

5

8

0

6

0

1

6

5

0

Merging threshold T = 6

New false

positive

Updated to

Fig. 9. Example ofISC for computing new false positives after combininglists l1 and l2.

Step 1: Discovering Candidate Gram PairsWe are onlyinterested in combiningcorrelated lists. Jaccard similarity isused to measure the correlation of two lists. It is defined as:

Corr(l1, l2) =|l1 ∩ l2|

|l1 ∪ l2|. (5)

Two lists are considered for combining only if their correlationis greater than a threshold. Clearly it is computationallyprohibitive to consider all pairs of grams. Here we presenttwo efficient methods for generating such pairs.• Using Adjacent Grams: We only consider pairs ofadjacent

grams in the strings. If we useq-grams to construct theinverted lists, we can just consider those(q + 1)-grams.Each such gram corresponds to a pair ofq-grams. Forinstance, if q = 3, then the4-gram tion correspondsto the pair (tio, ion). For each such adjacent pair, wetreat it as a candidate pair if the Jaccard similarity oftheir corresponding lists (Equation 5) is greater than apredefined threshold.

• Using Locality-Sensitive Hashing: The above approachcannot find strongly correlated grams that are not adjacentin strings. In the literature there are efficient techniquesforfinding strongly correlated pairs of lists. One of them iscalled Locality-Sensitive Hashing (LSH) [13]. Using asmall number of so-called MinHash signatures for eachlist, we can use LSH to find those gram pairs whoselists satisfy the above correlation condition with a highprobability.

Step 2: Selecting Candidate Pairs to CombineThe second step is selecting candidate pairs to combine.

We iteratively pick gram pairs and combine their lists if theircorrelation satisfies the threshold. Notice that each time weprocess a new candidate gram pair, since the list of eachof them could have been combined with other lists, we stillneed to verify their (possibly new) correlation before decidingwhether we should combine them. After processing all thesepairs, we check if the index size meets a given space constraint.If so, the process stops. Otherwise, we decrease the correlationthreshold and repeat the process above, until the new indexsize meets the given space constraint.

This basic algorithm blindly chooses candidate pairs tocombine without considering the consequence on the overallquery performance. Based on our previous discussions, wepropose a new algorithm to wisely choose lists to combine inthe second step.

2) Cost-Based Algorithm: Figure 10 shows the cost-basedalgorithm which takes the estimated cost of a query workloadinto consideration when choosing lists to combine. It itera-tively selects pairs to combine, based on the space savingand the impact on the average query performance of a queryworkloadQ. The algorithm keeps selecting pairs to combineuntil the total size of the inverted lists meets a given spaceconstraintB. For each gram pair(gi, gj), we need to get theircurrent corresponding lists, since their lists could have beencombined with other lists (lines 3 and 4). We check whetherthese two lists are the same list as reference (line 5), andalso whether their correlation is above the threshold (line6).Then we compute the size reduction (line 8) and estimate theaverage query time difference based on Formula 4 and theISCalgorithm (line 9), based on which we decide the next list pairto combine (lines 10 and 11).

Algorithm: CombineListsInput: Candidate gram pairsP = {(gi, gj)}

ConstraintB on the total list sizeQuery workloadQ = {Q1, . . . , Qm}

Output: Combined lists.Method:1. WHILE ((expected total index size)> B ) {2. FOR (each gram pair(gi, gj) ∈ P ) {3. li = current list ofgi

4. lj = current list ofgj

5. if (li and lj are the same list as reference6. or corr(li, lj) < δ)7. remove(gi, gj) from P and continue

8. Compute size reduction∆(li,lj)

size if combining li, lj

9. Compute difference of average query time∆(li,lj)

time

for queries inQ if combining li, lj}

10. Use∆(li,lj)

size ’s and∆(li,lj)

time ’s of the gram pairs to decidewhich pair to combine

11. Combine the two listsli and lj based on the decision12. Remove the combined gram pair fromP

}

Fig. 10. CombineLists algorithm to select gram pairs to combine.

We can use similar optimization techniques as described inIV to improve the performance ofCombineLists.

VI. EXPERIEMENTS

In this section, we report our experimental results of theproposed techniques on three real data sets.• IMDB Actor Names: It consists of the actor names down-

loaded from the IMDB website4. There were 1,199,299names. The average string-length was 17 characters.

4www.imdb.com

• WEB Corpus Word Grams: This data set5, contributedby Google Inc., contained word grams and their observedfrequency counts on the Web. We randomly chose 2million records with a size of 48.3MB. The number ofwords of a string varied from 3 to 5. The average string-length was 24.

• DBLP Paper Titles: It includes paper titles downloadedfrom the DBLP Bibliography site6. It had 274,788 papertitles. The average string-length was 65.

For all experiments the gram lengthq was 3, and theinverted-list index was held in main memory. Also, for thecost-basedDiscardLists and CombineLists approaches, bydoing sampling we guaranteed that the index structures of sam-ple strings fit into memory. We used theDivideSkip algorithmdescribed in [20] to solve theT -occurrence problem due to itshigh efficiency. From each data set we used 1 million strings toconstruct the inverted-list index (unless specified otherwise).We tested query workloads using different distributions, e.g.,a Zipfian distribution or a uniform distribution. To do so,we randomly selected 1,000 strings from each data set andgenerated a workload of 10,000 queries according to somedistribution. We conducted experiments using edit distance,Jaccard similarity, and cosine similarity. We mainly focusedon the results of edit distance (with a threshold2). We reportadditional results of other functions in Section VII-B. Allthealgorithms were implemented using GNU C++ and run on aDell PC with 2GB main memory, and a 3.4GHz Dual CoreCPU running the Ubuntu operating system.

A. Evaluating the Carryover-12 Compression Technique

We evaluated the performance of theCarryover-12 basedcompression technique discussed in Section III. We used asegment size of 128 4-byte integers, which was a good valuefor the performance and compression ratio, and examined thequery performance with different cache sizes. Figure 11 showsthat query performance benefits from using a larger cache.On the WebCorpus data set, when we used no cache theaverage query time was 64.4ms, which is more than 8 timesthe average query time with a cache of 5000 slots. Since thewhole purpose of compressing inverted list is to save space,itis contradictory to improve query performance by increasingthe cache size indefinitely. Figures 11(a) and 11(b) show that20,000 is a reasonable number of cache slots (approximately10MB), therefore we use this number slots for experimentswith Carryover-12 throughout this section.

B. Evaluating the DiscardLists Algorithm

In this section we evaluate the performance of theDis-cardLists algorithm for choosing inverted-lists to discard. Inaddition to the three basic methods to choose lists to discard(LongList, ShortList, RandomList), we also implemented thefollowing cost-based methods:• PanicCost: In each iteration we discard the list with the

smallest ratio between the list size and the number of

5www.ldc.upenn.edu/Catalog, number LDC2006T136www.informatik.uni-trier.de/∼ley/db

7

7.5

8

8.5

9

9.5

10

10.5

312 1250 5000 20000 80000

Avg

Run

time

(ms)

Cache Size (slot number)

web

(a)

25

30

35

40

45

50

55

312 1250 5000 20000 80000

Avg

Run

time

(ms)

Cache Size (slot number)

imdb

(b)

Fig. 11. Benefits of using cache when decoding compressed lists.

additional panic cases. Another similar approach, calledPanicCost+, discards the list with the smallest number ofadditional panic cases, disregarding the list length.

• TimeCost: It is similar to PanicCost, except that we usethe ratio between the list size and the total time effectof discarding a list (instead of the number of additionalpanics). Similarly, an approach calledTimeCost+ discardsthe list with the smallest time effect.

The index-construction time mainly consisted of two ma-jor parts: selecting lists to discard and generating the finalinverted-list structure. The time for generating samples wasnegligible. For theLongList, ShortList, and Random ap-proaches, the time for selecting lists to discard was small,whereas in the cost-based approaches the list-selection timewas prevalent. In general, increasing the size-reduction ra-tio also increased the list-selection time. For instance, forthe IMDB dataset, at a 70% reduction ratio, the totalindex-construction time for the simple methods was abouthalf a minute. The construction time forPanicCost andPanicCost+ was similar. The more complexTimeCost andTimeCost+ methods needed 108s and 353s, respectively.

Benefits of Tighter Bounds: We first examined the effectsof using the dynamic programming algorithm for computinga tighter bound in the presence of hole lists, as describedin Section IV-A.1. We used the DBLP data set, and rana workload of 10,000 queries with a Zipfian distribution.We used theLongList method to reduce the index size.Figure 12(a) shows the average running time for both the naivemethod for computing a bound using Equation 2 (marked as“Naive”in the figure) and the dynamic programming algorithmfor computing a bound (marked as “DP”). Thex-axis isthe total memory reduction ratio, where “0” means no sizereduction, while they-axis is the average query time. We seethat the dynamic programming approach indeed computed atighter bound, which notably reduced the average query time.This benefit is largely due to the decrease in the number ofpanics, as illustrated in Figure 12(b). In the interest of brevitywe omitted the results for the other data sets, since they showa very similar trend.Comparing Simple Methods with Cost-Based Methods:Next we compare the the simple methods for choosing liststo discard with the cost-based ones. For each data set, weconducted experiments by indexing one million strings andrunning a workload of 10,000 queries with a Zipfian distri-

0 1 2 3 4 5 6 7

0 0.2 0.4 0.6 0.8

Avg

Run

time

(ms)

Index-Size Reduction

NaiveDP

(a) Average Runtime

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8

Num

ber

of P

anic

s (k

)


NaiveDP

(b) Number of Panics

Fig. 12. Benefits of using dynamic programming to tighten bounds.

bution. For those cost-based approaches, we used a samplingratio of 0.1% for the data strings and a ratio of 25% for thequeries.

We first considered the three simple methods, namelyLongList, ShortList, Random. Figs. 13(a) and 14(a) showhow discarding lists affected the average query performanceas we increased the index-size-reduction ratio. (Due to spacelimitation, we do not show the results of the DBLP data set,which were consistent with the results of the other two datasets.) We can see thatLongList and Random methods areequally promising, providing compression ratios up to 50%with little penalty in the query execution time. For instance,in Fig. 13(a), when there was no index reduction, the originalaverage query time was 5.8ms. When the reduction ratio was50%, the running time increased to 18.8ms for theLongListmethod. For all the data sets, theShortList method performedsignificantly worse than the other two. The main reason isthat it caused a dramatic increase in the merge-time and post-processing time. For higher reduction ratios,RandomListsometimes outperformedLongList because the former avoidedmore panics, alleviated inLongList by discarding lists ofpopular grams.

Based on these results, we chose to omitShortList andRandom in the following experiments.Figures 13(b) and 14(b) show the benefits of employing thecost-based methods to select lists to discard. Most noteworthyof which is theTimeCost+ method, which consistently de-livered the best query performance. As shown in Fig. 13(b),the method achieved a 70% reduction ratio while increasingthe query processing time from the original 5.8ms to 7.4msonly. All the other methods increased the time up to at least96ms for that reduction ratio. Notice thatTimeCost+ ignoredthe list size when selecting a list to discard. The good resultsmay seem counter-intuitive. A deeper analysis suggests thatthere is really no obvious need to take the list size intoaccount when choosing lists to discard, since the effects ofa list on query performance may change significantly in thenext iteration, which could be very irrelevant to its size.Thus our main objective should be minimizing the penaltyin query performance, regardless of list sizes. We see theTimeCost method behaved similarly toLongList. The reasonis that the former preferred discarding long lists over shortlists because it tries to maximize the ratio of the list-sizeand the estimated running time of the query workload. ThePanicCost and PanicCost+ methods performed well on the

0

50

100

150

200

250

300

350

400

0 0.2 0.4 0.6 0.8

Avg

Run

time

(ms)


ShortListLongList

RandomList

(a) Simple methods

0

20

40

60

80

100

120

140

0 0.2 0.4 0.6 0.8

Avg

Run

time

(ms)


LongListTimeCost

TimeCost+PanicCost

PanicCost+

(b) Cost-based methods

3.5

4

4.5

5

5.5

6

0 0.1 0.2 0.3 0.4

Avg

Run

time

(ms)


LongListTimeCost

TimeCost+PanicCost

PanicCost+

(c) Improving query performance

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10

Avg

Run

time

(ms)

Dataset Size (100k)

LongListTimeCost

TimeCost+Original

(d) Scalability

Fig. 13. Reducing index size by discarding lists (IMDB Actors).

0

10

20

30

40

50

60

70

0 0.2 0.4 0.6 0.8

Avg

Run

time

(ms)


ShortListLongList

RandomList

(a) Simple methods

0

5

10

15

20

25

30

0 0.2 0.4 0.6 0.8

Avg

Run

time

(ms)


LongListTimeCost

TimeCost+PanicCost

PanicCost+

(b) Cost-based methods

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

0 0.1 0.2 0.3 0.4

Avg

Run

time

(ms)


LongListTimeCost

TimeCost+PanicCost

PanicCost+

(c) Improving query performance

1

2

3

4

5

6

7

1 1.2 1.4 1.6 1.8 2

Avg

Run

time

(ms)

Dataset Size (million)

LongListTimeCost

TimeCost+PanicCost

PanicCost+Original

(d) Scalability

Fig. 14. Reducing index size by discarding lists (WebCorpusword grams).

DBLP and WebCorpus data sets, because at higher reductionratios the cost for panics became prevalent. The two panic-based methods performed poorly in Fig. 13(b) because theyblindly minimized the number of panics while drasticallyincreasing postprocessing time.TimeCost+ over-topped allthe other methods because it can balance the merging time,post-processing time, and scan time.

Surprising Improvement on Performance: Figs. 13(c) and14(c) show more details in the previous set of figures when thereduction ratio was smaller (less than 40%). A surprising find-ing is that, for low to moderate reduction ratios, discarding listscould even improve the query performance! Take Fig. 13(c)for an example. All the methods reduced the average querytime from the original 5.8ms to 3.5ms for a 10% reductionratio. The main reason of the performance improvement is thatby discarding long lists we can help list-merging algorithmssolve theT -occurrence problem more efficiently. We see thatsignificantly reducing the number of total list-elements toprocess can overcompensate for the decrease in the threshold.

Scalability: For each data set, we increased its numberof strings, and used 50% as the index-size-reduction ratio.Figs. 13(d) and 14(d) show that theTimeCost+ methodperformed consistently well, even outperforming the corre-sponding uncompressed indexes (indicated by “original”).Inthe Figure 13(d), in which at 100,000 data strings the averagequery time increased from 0.61ms to 0.78ms forTimeCost+.As the data size increased,TimeCost+ began outperformingthe uncompressed index. For example, at 1 million stringsthe average query time decreased from 5.7ms (original) to4.6ms (TimeCost+). This benefit can be attributed to theskewed Zipfian query distribution this algorithm adapts to.

Notice that theTimeCost method behaves predictably, albeitbadly, whereas the two panic-based methods can behave un-predictably because the merge operation becomes increasinglyexpensive with a growing data set. Therefore, we excluded thecurves forPanicCost andPanicCost+ from 14(d) since theydeliver no clear trend.

Summary: (1) Discarding lists can achieve a significant indexsize reduction without significantly increasing query runningtime. (2) For small reduction ratios, we can even improvequery performance. (3) For most cases theTimeCost+ methodcan achieve a good query performance.

C. Evaluating the CombineLists Algorithm

We evaluated the performance of theCombineLists algo-rithm on the same three data sets. In step 1, we implementedthree methods to generate candidate list pairs: (1)Adjacent:Consider(q + 1)-grams. (2)LSH: Use LSH to find stronglycorrelated pairs (with a high probability). We generated 100MinHash signatures for each list. (3)Adjacent+LSH: Wetake the union of the candidate pairs generated by these twomethods. In step 2, we implemented both theCombineBasicand theCombineCost algorithms for iteratively selecting listpairs to combine.

Benefits of Improved List-Merging Algorithms : We firstevaluated the benefits of using the improved list-mergingalgorithms to solve theT -occurrence problem for queries oncombined inverted lists, as described in Section V-B. As anexample, we compared theDivideSkip algorithm in [20] andits improved version that considers duplicated inverted listsin a query. We used theAdjacent+LSH method to generatecandidate list pairs, and theCombineBasic algorithm toselect lists to combine. Figure 15 shows the average running

time for the basicDivideSkip (marked as “Basic”) and theimprovedDivideSkip algorithm (marked as “Improved”). Wecan see that when the reduction ratio increased, more listswere combined, and the improved algorithm did reduce theaverage query time.

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6

Avg

Run

time

(ms)


BasicImproved

(a) DBLP titles

2

4

6

8

10

12

14

16

0 0.1 0.2 0.3 0.4

Avg

Run

time

(ms)


BasicImproved

(b) IMDB actors

Fig. 15. Reducing query time using improved list-merging algorithm(DivideSkip).

Generating Candidate List Pairs: Figures 16(a) and 17(a)compare the index size reduction ratio for different candidate-generating methods, over various correlation thresholdδ, onthe two data sets. We used theCombineBasic method forcombining lists. The reduction ratio was plotted in a logarith-mic scale. We observed that on the IMDB and DBLP data sets,with a higher correlation threshold,LSH produced more candi-date pairs thanAdjacent, since the former can find correlatedpairs of grams that are not adjacent. The results also showthat a significant number of correlated pairs were indeed fromadjacent grams. As we decreased the correlation threshold,LSH was less capable of discovering all qualified candidatepairs without using more MinHash signatures. The results oftheAdjacent+LSH were better than the two previous methods,since it considered more candidate list pairs.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.1 0.2 0.3 0.4 0.5 1

Inde

x-S

ize

Red

uctio

n

Correlation Threshold

AdjacentLSH

Adjacent+LSH

(a) Generating candidate pairs

0

20

40

60

80

100

120

140

160

0 0.2 0.4 0.6

Avg

Run

time

(ms)


CombineBasicCombineCost

(b) Choosing lists to combine

2

3

4

5

6

7

8

9

0 0.2 0.4

Avg

Run

time

(ms)



(c) Improving performance

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10

Avg

Run

time

(ms)

Dataset Size (100k)


Original

(d) Scalability

Fig. 16. Reducing index size by combining lists (IMDB Actors).

Choosing Lists to Combine: We compared theCombineBa-sic algorithm with the cost-basedCombineCost algorithmfor choosing lists to combine, and the results are shown inFigure 17(b) and 16(b). The average query time was plotted

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1 0.2 0.3 0.4 0.5 1

Inde

x-S

ize

Red

uctio

n

Correlation Threshold

AdjacentLSH

Adjacent+LSH

(a) Generating candidate pairs

0

5

10

15

20

25

30

35

0 0.2 0.4 0.6

Avg

Run

time

(ms)



(b) Choosing lists to combine

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

0 0.1 0.2 0.3 0.4 0.5

Avg

Run

time

(ms)



(c) Improving performance

1.5

2

2.5

3

3.5

4

1 1.2 1.4 1.6 1.8 2

Avg

Run

time

(ms)

Dataset Size (million)


Original

(d) Scalability

Fig. 17. Reducing index size by combining lists (WebCorpus word grams).

over different reduction ratios for both algorithms. We observethat on all three data sets, the query running time for bothalgorithms increased very slowly as we increased the indexsize reduction ratio, until about 40% to 50%. That means, thistechnique can reduce the index size without increasing thequery time! As we further increased the index size reduction,the query time started to increase. For theCombineCostalgorithm, the time increased slowly, especially on the IMDBdata set. The reason is that this cost-based algorithm avoidedchoosing bad lists to combine, while theCombineBasic algo-rithm blindly chose lists to combine. The difference betweenthese two algorithms on the IMDB data set was larger thanthese two data sets, because theCombineBasic algorithmcaused queries to spend a lot of post-processing time on thisdata set.

Figures 16(c) and 17(c) show that when the reductionratio is less than 40%, the query time even decreased. Thisimprovement is mainly due to the improved list-mergingalgorithms. Figures 16(d) and 17(d) show how the algorithmsof combining lists affected query performance as we increasedthe data size, for a reduction ratio of 40%.

Summary: (1) Combining lists can also achieve a significantindex size reduction without increasing query running timetoo much. (2) For small reduction ratios, combining lists canalso improve query performance, attributed to the improvedlist-merging algorithms. (3) For most cases the cost-basedCombineCost algorithm gives a better index structure thanthe CombineBasic algorithm.

D. Comparing Different Compression Techniques

We compared the cost-basedDiscardLists and Com-bineLists techniques withCarryover-12 (see Section III) andVGRAM. Since Carryover-12 and VGRAM do not allowexplicit control of the compression ratio, for each of themwe reduced the size of the inverted-list index to determine

their compression ratio. Also, we compressed the inverted-list index usingDiscardLists andCombineLists at the samecompression ratio.

5

10

15

20

25

30

35

Imdb

Avg

Run

time

(ms)

Index-Size Reduction 0.58

OriginalCarryover-12

DiscardListsCombineLists

(a) IMDB Actors

1 2 3 4 5 6 7 8

WebCorupsA

vg R

untim

e (m

s)Index-Size Reduction 0.48

OriginalCarryover-12


(b) WebCorpus

Fig. 18. Comparison of Carryover-12 with DiscardLists and CombineListsat the same reduction ratio.

3

3.5

4

4.5

5

5.5

6

Imdb

Avg

Run

time

(ms)


OriginalVGRAM


(a) IMDB Actors

1.45 1.5

1.55 1.6

1.65 1.7

1.75 1.8

WebCorpus

Avg

Run

time

(ms)


OriginalVGRAM


(b) WebCorpus

Fig. 19. Comparison of VGRAM with DiscardLists and CombineLists atthe same reduction ratio.

Comparison with Carryover-12: Figures 18(a) and 18(b)compare the performance of our techniques with theCarryover-12 compression. ForCarryover-12, to achieve agood balance between the query performance and the index-sizen, we use fixed size segments of 128 4-byte integers and asynchronization point for each segment. The cache contained20,000 segment slots (approximately 10MB). The compressionratio achieved byCarryover-12 was 58% for the IMDBdataset and 48% for the WebCorpus dataset. We see that theonline decompression ofCarryover-12 has a profound impacton performance. It increased the average running time froman original 5.85ms to 30.1ms for the IMDB dataset, and froman original 1.76ms to 7.32ms for the WebCorpus dataset. TheCombineLists performed significantly better at 22.15ms forthe IMDB dataset and 2.3ms for the WebCorpus dataset. TheDiscardLists could even slightly decrease the running timecompared to the original index to 5.81ms and 1.75ms for theIMDB and WebCorpus dataset, respectively.Comparison with VGRAM : Figures 19(a) and 19(b) comparethe performance of our techniques withVGRAM. We set itsqmin parameter to 4. We did not take into account the memoryrequirement for the dictionary trie structures because it wasnegligible. The compression ratio was 30% for the IMDBdataset and 27% for the WebCorpus dataset. Interestingly,all methods could outperform the original, uncompressedindex. As suspected,VGRAM can considerably reduce therunning time for both datasets. For the IMDB dataset itcould reduce the time from an original 5.85ms to 4.02ms and

for the WebCorpus dataset from 1.76ms 1.55ms. Suprisingly,the CombineLists algorithm could reduce the running timeeven more thanVGRAM to 3.34ms for the IMDB datasetand to 1.47ms for the WebCorpus dataset. TheDiscardListsperformed competitively for the IMDB dataset at 3.93msand slighlty faster than the original index (1.67ms) on theWebCorpus dataset.

Summary: (1) CombineLists and DiscardLists can signif-icantly outperformCarryover-12 at the same memory re-duction ratio because of the online decompression requiredby Carryover-12. (2) For small compression ratiosCom-bineLists performs best, even outperformingVGRAM. (3) Forlarge compression ratiosDiscardLists delivers the best queryperformance. (4) WhileCarryover-12 can achieve reuductionsup to 60% andVGRAM up to 30%, neither allows explicit con-trol over the reduction ratio.DiscardLists andCombineListsoffer this flexibility with good query performance.

E. Changing Query Workloads

0

5

10

15

20

25

30

0.4 0.5 0.6

Avg

Run

time

(ms)


U2SUU2DUU2SZU2DZZ2SUZ2DUZ2SZZ2DZ

(a) DiscardLists on IMDB

1.5

2

2.5

3

3.5

4

4.5

5

0.4 0.5 0.6

Avg

Run

time

(ms)



(b) DiscardLists on WebCorpus

0

20

40

60

80

100

120

140

0.4 0.5 0.6

Avg

Run

time

(ms)



(c) CombineLists on IMDB

0 2 4 6 8

10 12 14 16 18 20 22

0.4 0.5 0.6

Avg

Run

time

(ms)



(d) CombineLists on WebCorpus

Fig. 20. Performance of DiscardLists and CombineLists on changingworkloads

The cost-based methods for discarding and combining listsassume a query workload. We experimentally evaluated theirperformance in the presence of changing query workloads.That is, we build an inverted-list index assuming one queryworkload and measured the performance of another queryworkload. We shall denote the set of queries used to build theindex as thetraining workload and the set of queries whosetime we measured as thetesting workload. For the testingworkload, we consider that it can differ from the trainingworkload in the distinct query strings, and in the distributionof the distinct queries in the workload. To explain this further,let us examine the legends in Fig. 20: The abbreviation schemereflects a) the distribution of the training workload,U foruniform andZ for Zipfian, b) whether the running workload

consisted of the same distinct queries (S) as the trainingworkload or entirely different distinct queries (D) and c)the distribution of the running workload (the lastU or Z).For example“U2SZ” means we used a uniformly distributedquery workload as a training workload, and generated atesting workload using the same distinct queries as in thetraining workload, but in a Zipfian distribution. Consideranother example,“U2DZ”. This means our training workloadwas a uniformly distributed query workload and the testingworkload was generated from entirely different distinct queriesto form a Zipfian distribution. Our query workloads consistedof 10,000 queries generated from 1,000 distinct queries. Figs.20(a) and 20(b) depict the average query performance ofvarious workloads on indexes compressed by discarding lists.Not surprisingly, as the reduction ratio increases so does thevariance in performance among the different workloads. Forexample, on the IMDB dataset at a 40% reduction ratio,Z2SZtook an average of 4.69ms per query versus a 4.97ms forZ2DU, while at a 60% reduction ratio the former increased to5.85ms and the latter to 23.6ms. Interestingly, the experimentsshow that a change in both the distribution and the queries doesnot always present the worst case for performance. Considerthe IMDB dataset at a 60% reduction ratio where one mightsuspect thatU2DZ perform worse thanU2DU. The formeryielded an average of 7.38ms and the latter 8.56ms. Noticethat for both the IMDB and the WebCorpus datasets theworst performance achieved by changing the workloads isstill competitive to aCarryover-12 compressed index. Takethe WebCorpus dataset, where at a 50% reduction,Z2DUpresents the worst change to 2.5ms whileCarryover-12 needsan average of 7.32ms at a 48% reduction ratio.Combining Lists: Figs. 20(a) and 20(b) show the averagequery performance of various workloads on indexes com-pressed by combining lists. Again, the variance in performanceincreases as the compression ratio increases. For exampleon the WebCorpus dataset, at a 40% reduction ratioU2SUneeded 2.09ms per query and “U2DZ” 2.08ms, while at a60% ratio the former increased to 14.82ms and the latterto 17.06ms. Overall, combining lists does not seem to beas sensitive to changes in the workload as discarding lists.This is because discarding lists can cause panic cases (re-quiring a scan) while combining lists cannot. Interestingly,on the WebCorpus dataset combining lists can outperform theCarryover-12 technique but not on the IMDB dataset. For theWebCorpus dataset at a 50% reduction ratio, the worst changein workload (U2DU) amounted to 4.43ms per query, while ata 48% reduction ratioCarryover-12 required 7.32ms.

Summary: (1) Not surprisingly, the cost-based methods ofcombining and discarding lists are sensitive to how representa-tive the chosen workload is. (2) The method of combining listsis less sensitive to changes in workloads than discarding lists,but it performs worse than discarding lists at high compressionratios. (3) Even for the worst changes in workloads that wetested, our techniques can mostly outperformCarryover-12at similar compression ratios.

VII. D ISCUSSION

A. Index Size Reduction with Filters

Various filtering techniques have been proposed to prunestrings that cannot be similar enough to a query string [9],[6], [20]. In general, a filter can partition strings into differentdisjoint groups. There are different ways to adopt our size-reduction techniques in the presence of filters. LetP1, . . . , Pn

be the groups of the strings generated by filters. (1)Globalreduction: We use the lists of the entire data set to decide whatlists to discard or combine, then apply these decisions to thegram lists within each groupPi. (2) Local reduction: We firstdecide how much reduction to allocate to each groupPi inorder to achieve a global reduction ratio. Within each groupGi, we decide the lists to discard or combine, and differentgroups have different decisions. Our cost-based size-reductionmethods can benefit from this divide-and-conquer strategy intwo ways: (a) Running theDiscardLists or CombineListsalgorithm for each group can be faster than running themonce on the entire data set. (b) The performance improvementcan allow a higher sampling-ratio, potentially enhancing thequality of the results. Notice that the query workload maynot be evenly distributed among all the groups. For thelocal policy, the main problem becomes distributing a spaceconstraint over all the groups such that the average queryperformance is maximized. The problem of deciding this sizeallocation needs future research.

B. Extension to Other Similarity Functions

Our discussion so far mainly focused on the edit dis-tance metric. We now generalize the results to the followingcommonly used similarity measures: Jaccard similarity andcosine similarity. To reduce the size of inverted lists based onthose similarity functions, the main procedure of algorithmsDiscardLists andCombineLists remains the same. The onlydifference is that inDiscardLists, to compute the mergingthresholdT for a query after discarding some lists, we needto subtract the number of hole lists for the query from theformulas proposed in [20]. In addition, for the estimation of thepost-processing time, we also need to replace the estimationof the edit distance time with that of Jaccard and cosine timerespectively. Figure 21 shows the average running time forthe DBLP data using variants of theTimeCost+ algorithmfor these two functions. The results on the other two data setswere similar and not included here. We see that the averagerunning time continuously decreased when the reduction ratioincreased to up to 40%. For example, at a 40% reduction ratiofor the cosine function, the running time decreased from 1.7msto 0.8ms.

The performance started degrading at a 50% reduction ratioand increased rapidly at a ratio higher than 60%. For a 70%reduction ratio, the time for the Cosine and Jaccard functionsincreased to 150ms and 115ms forLongList. Also, for high re-duction ratios theTimeCost andTimeCost+ methods becameworse than the panic-based methods, due to the inaccuracy inestimating the merging time. Note that the Cosine and Jaccard

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 0.1 0.2 0.3 0.4 0.5

Avg

Run

time

(ms)


LongListTimeCost

TimeCost+PanicCost

PanicCost+

(a) Jaccard function

0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

0 0.1 0.2 0.3 0.4 0.5

Avg

Run

time

(ms)


LongListTimeCost

TimeCost+PanicCost

PanicCost+

(b) Cosine function

Fig. 21. Reducing index size based on Jaccard and Cosine similarity functions(DBLP titles)

functions are expensive to compute, therefore the punishment(in terms of post-processing time) for inaccurately estimatingthe merging time can be much more severe than that for theedit distance.

C. Integrating Several Approaches

So far, we have studied traditional list-compression tech-niques, our new methods based on discarding lists and combin-ing lists, and theVGRAM technique independently to reducethe index size. Since these methods are indeed orthogonal, wecould even use any combination of them to further reduce theindex size and/or improve query performance. As an example,we integratedCombineLists with Carryover-12.

We first compressed the index usingCombineLists ap-proach with a reductionα, and then appliedCarryover-12on the resulting index. We variedα from 0 (no reductionfor CombineLists) to 60% in 10% increments. The resultsof the overall reduction ratio and the average query time areshown in the “CL+Carryover-12” curve in Figure 22. Theleftmost point on the curve corresponds to the case whereα = 0. For comparison purposes, we also plotted the results ofusing theCombineLists alone shown on the other curve. Theresults clearly show that using both methods we can achievehigh reduction ratios with a better query performance thanusingCombineLists alone. Consider the first point that onlyusesCarryover-12. It could achieve a 48% reduction with anaverage query time of 7.3ms. By first usingCombineLists ata 30# reduction ratio (4th point on the curve) we could achievea higher reduction ratio (61%) at a lower average query time(6.34ms).

0

5

10

15

20

25

30

0 0.2 0.4 0.6 0.8

Avg

Run

time

(ms)


CombineListsCL+Carryover-12

Fig. 22. Reducing index size using CombineLists with Carryover-12.

One way to integrate multiple methods is to distributethe global memory constraint among several methods. NoticesinceCarryover-12 andVGRAM do not allow explicit controlof the index size, it is not easy to use them to satisfy an

arbitrary space constraint. Several open challenging problemsneed more future research. First, similar to the space allocationproblem discussed in Section VII-A, we need to decide how todistribute the global memory constraint among different meth-ods. Second, we need to decide in which order to use them.For example, if we useCombineLists first, then we neverconsider discarding merged lists inDiscardLists. Similarly, ifwe runDiscardLists first, then we never consider combiningany discarded list inCombineLists.

VIII. C ONCLUSIONS

In this paper, we studied how to reduce the size of inverted-list index structures of string collections to support approxi-mate string queries. We studied how to adopt existing inverted-list compression techniques to achieve the goal, and proposedtwo novel methods for achieving the goal: one is based ondiscarding lists, and one based on combining correlated lists.They are both orthogonal to existing compression techniques,exploit a unique property of our setting, and offer new opportu-nities for improving query performance. We studied technicalchallenges in each method, and proposed efficient, cost-basedalgorithms for solving related problems. Our extensive exper-iments on real data sets show that our approaches provideapplications the flexibility in deciding the tradeoff betweenquery performance and indexing size and can outperformexisting compression techniques. An interesting and surprisingfinding is that while we can reduce the index size significantly(up to 60% reduction) with tolerable performance penalties, for20-40% reductions we can even improve query performancecompared to original indexes.

REFERENCES

[1] V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes.Inf. Retr., 8(1):151–166, 2005.

[2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins.In VLDB, pages 918–929, 2006.

[3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similaritysearch. InWWW, pages 131–140, 2007.

[4] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust andefficient fuzzy match for online data cleaning. InSIGMOD Conference,pages 313–324, 2003.

[5] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for stringpredicates: Overcoming the underestimation problem. InICDE, pages227–238, 2004.

[6] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator forsimilarity joins in data cleaning. InICDE, page 5, 2006.

[7] P. Fogla and W. Lee. q-gram matching using tree models.IEEE Trans.Knowl. Data Eng., 18(4):433–447, 2006.

[8] Z. Galil and G. F. Italiano. Data structures and algorithms for disjointset union problems.ACM Comput. Surv., 23(3):319–344, 1991.

[9] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S.Muthukrishnan,and D. Srivastava. Approximate string joins in a database (almost) forfree. In VLDB, pages 491–500, 2001.

[10] M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fastindexes and algorithms for set similarity selection queries.In ICDE,2008.

[11] M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashedsamples: Selectivity estimators for set similarity selectionqueries. InVLDB, 2008.

[12] B. Hore, H. Hacigumus, B. R. Iyer, and S. Mehrotra. Indexing text dataunder space constraints. InCIKM, pages 198–207, 2004.

[13] P. Indyk and R. Motwani. Approximate nearest neighbors:Towardsremoving the curse of dimensionality. InSTOC Conference, 1998.

[14] H. V. Jagadish, R. T. Ng, and D. Srivastava. Substring selectivityestimation. InPODS, pages 249–260, 1999.

[15] L. Jin and C. Li. Selectivity estimation for fuzzy stringpredicates inlarge data sets. InVLDB, pages 397–408, 2005.

[16] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee. n-Gram/2L: A spaceand time efficient two-level n-gram inverted index structure. In VLDB,pages 325–336, 2005.

[17] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similaritymeasures and algorithms. InSIGMOD Conference, pages 802–803,2006.

[18] P. Krishnan, J. S. Vitter, and B. R. Iyer. Estimating alphanumericselectivity in the presence of wildcards. InSIGMOD Conference, pages282–293, 1996.

[19] H. Lee, R. T. Ng, and K. Shim. Extending Q-grams to estimateselectivity of string matching with low edit distance. InVLDB, pages195–206, 2007.

[20] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms forapproximate string searches. InICDE, 2008.

[21] C. Li, B. Wang, and X. Yang. VGRAM: Improving performance ofapproximate queries on string collections using variable-length grams.In VLDB, pages 303–314, 2007.

[22] A. Mazeika, M. H. Bohlen, N. Koudas, and D. Srivastava. Estimatingthe selectivity of approximate string queries.ACM Trans. DatabaseSyst., 32(2):12, 2007.

[23] M. D. McIllroy. Development of a spelling list.IEEE Transactions onCommunications, 30(1):91–99, 1998.

[24] A. Moffat and J. Zobel. Self-indexing inverted files forfast text retrieval.ACM Trans. Inf. Syst., 14(4):349–379, 1996.

[25] G. Navarro. A guided tour to approximate string matching.ACMComput. Surv., 33(1), 2001.

[26] S. C. Sahinalp, M. Tasan, J. Macker, and Z. M.Ozsoyoglu. Distancebased indexing for string proximity search. InICDE, pages 125–, 2003.

[27] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates.In SIGMOD Conference, pages 743–754, 2004.

[28] C. Xiao, W. Wang, and X. Lin. Ed-join: An efficient algorithm forsimilarity joins with edit distance constraints. InVLDB, 2008.

[29] C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarityjoins fornear duplicate detection. InWWW, pages 131–140, 2008.

[30] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selectionfor string collections to support approximate queries efficiently. InSIGMOD Conference, 2008.

[31] J. Ziv and A. Lempel. Compression of individual sequencesvia variable-rate coding.IEEE Transactions on Information Theory, 24(5):530–536,1978.

[32] J. Zobel and A. Moffat. Inverted files for text search engines. ACMComput. Surv., 38(2):6, 2006.

Documents

Space-Constrained Gram-Based Indexing for Efﬁcient ...flamingo.ics.uci.edu/pub/icde2009-memreducer-full.pdf · Space-Constrained Gram-Based Indexing for Efﬁcient Approximate String