of 13/13
Front.Comput.Sci. DOI RESEARCH ARTICLE String Similarity Join With Dierent Similarity Thresholds Based On Novel Indexing Techniques Chuitian Rong 1 , Yasin N. Silva 2 , Chunqing Li 1 1 School of Computer Science & Software Engineering, Tianjin Polytechnic University,Tianjin 300387, China 2 Arizona State University, Tempe, AZ 85281, USA c Higher Education Press and Springer-Verlag Berlin Heidelberg 2012 Abstract String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approaches to solve the similarity join problem use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds. For instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate many more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports dierent similarity thresholds in a single operator. In order to support dierent thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs eciently while providing a more flexible threshold specification. Keywords Similarity Join, Similarity Aware Index, Similarity Thresholds Received month dd, yyyy; accepted month dd, yyyy E-mail: chuitian@tjpu.edu.cn 1 Introduction String is a fundamental data type and widely used in a variety of applications, such as recording product and customer names in marketing, storing publications in academic research, and representing the content of web sites. Frequently, dierent strings from dierent sources may refer to the same real-world entity due to various reasons. In order to combine heterogenous data from dierent sources and provide a unified view of entities, the string similarity join is proposed to find all pairs of strings between two string collections based on a string similarity function and a user specified threshold. The existing similarity functions fall into two categories: set-based similarity functions (e.g., Jaccard [1]) and character-based similarity functions (e.g., Edit Distance). The threshold is a value in [0,1] or a suitable integer according to the similarity function. In general, the implementation of the similarity join operation is dierent based on the type of similarity function that is used. For the character-based similarity functions, the implementations consider each string as a sequence of characters and employ a tree-based index structure, such as the B + -tree [2, 3] and the Trie-tree [4]. Unfortunately, the Trie-tree based indexing technique constrains itself to in-memory join operations and is inecient for long strings, while the B + -tree based indexing technique requires the whole index to be constructed in advance, and its performance suers in the case of short strings. Thus, the tree-based indices are not suitable for processing similarity joins over very large string collections. For

String Similarity Join With Different Similarity Thresholds ...ynsilva/publications/StringSimilarityJoin.pdf · String Similarity Join With Different Similarity Thresholds Based

  • View
    217

  • Download
    0

Embed Size (px)

Text of String Similarity Join With Different Similarity Thresholds...

  • Front.Comput.Sci.DOI

    RESEARCH ARTICLE

    String Similarity Join With Different Similarity ThresholdsBased On Novel Indexing Techniques

    Chuitian Rong 1, Yasin N. Silva2, Chunqing Li1

    1 School of Computer Science & Software Engineering, Tianjin Polytechnic University,Tianjin 300387, China2 Arizona State University, Tempe, AZ 85281, USA

    c Higher Education Press and Springer-Verlag Berlin Heidelberg 2012

    Abstract String similarity join is an essentialoperation of many applications that need to find allsimilar string pairs from two given collections. Aquantitative way to determine whether two strings aresimilar is to compute their similarity based a certainsimilarity function. The string pairs with similarityabove a certain threshold are regarded as results. Thecurrent approaches to solve the similarity join problemuse a unique threshold value. There are, however,several scenarios that require the support of multiplethresholds. For instance, when the dataset includesstrings of various lengths. In this scenario, longerstring pairs typically tolerate many more typos thanshorter ones. Therefore, we proposed a solution forstring similarity joins that supports different similaritythresholds in a single operator. In order to supportdifferent thresholds, we devised two novel indexingtechniques: partition based indexing and similarityaware indexing. To utilize the new indices and improvethe join performance, we proposed new filteringmethods and index probing techniques. To the best ofour knowledge, this is the first work that addresses thisproblem. Experimental results on real-world datasetsshow that our solution performs efficiently whileproviding a more flexible threshold specification.

    Keywords Similarity Join, Similarity Aware Index,Similarity Thresholds

    Received month dd, yyyy; accepted month dd, yyyy

    E-mail: [email protected] jpu.edu.cn

    1 Introduction

    String is a fundamental data type and widely used in avariety of applications, such as recording product andcustomer names in marketing, storing publications inacademic research, and representing the content ofweb sites. Frequently, different strings from differentsources may refer to the same real-world entity due tovarious reasons.

    In order to combine heterogenous data fromdifferent sources and provide a unified view of entities,the string similarity join is proposed to find all pairs ofstrings between two string collections based on a stringsimilarity function and a user specified threshold. Theexisting similarity functions fall into two categories:set-based similarity functions (e.g., Jaccard [1]) andcharacter-based similarity functions (e.g., EditDistance). The threshold is a value in [0,1] or asuitable integer according to the similarity function. Ingeneral, the implementation of the similarity joinoperation is different based on the type of similarityfunction that is used. For the character-based similarityfunctions, the implementations consider each string asa sequence of characters and employ a tree-basedindex structure, such as the B+-tree [2, 3] and theTrie-tree [4]. Unfortunately, the Trie-tree basedindexing technique constrains itself to in-memory joinoperations and is inefficient for long strings, while theB+-tree based indexing technique requires the wholeindex to be constructed in advance, and itsperformance suffers in the case of short strings. Thus,the tree-based indices are not suitable for processingsimilarity joins over very large string collections. For

  • 2Chuitian RONG et al.

    Table 1 Data SetsID String

    Rr1 Probabilistic set similarity joinsr2 Efficient exact set-similarity joinsr3 Efficient parallel set similarity joins using MapReduce

    Ss1 Top-k set similarity joinss2 Efficient exact set similarity joinss3 Efficient parallel set-similarity joins using MapReduce

    the set-based similarity functions, the implementationsusually divide each string into a set of tokens andextract its signatures, and then index the signaturesusing inverted indices [5, 6]. A pair of strings thatshare a certain number of signatures are regarded as acandidate pair. Our solution in this paper falls into thiscategory.

    To identify all similar string pairs in two givencollections, the methods that employ inverted indicesfollow a filter and refine process. In the filter step, theygenerate a set of candidate pairs that share a commontoken. In the verification step, they verify the candidatepairs to generate the final result. However, if thestrings contain popular tokens, the number of qualifiedcandidate pairs will be very large. To address thisproblem, the prefix filtering [7] method has beenproposed. According to this technique, all the stringsin the collections are sorted based on a global orderingand the first T tokens are selected as their prefix. Thenumber of T is determined by |s| (the number of wordsin s), the similarity function sim, and the user specifiedsimilarity threshold . Using the Jaccard similarityfunction as an example, T = |s| |s| + 1. Thismeans that for any other string r, the necessarycondition of sim(s, r) > is that the prefixes of s and rmust have at least one token in common [68]. Let usconsider the Example 1 and apply a prefix filteringtechnique to generate its candidate pairs, based on theJaccard similarity function.

    Example 1 Table 1 lists two string collections ofpublication titles from two different data sources. Wesorted the words of each string in R and S based onreverse alphabetical order and selected the first(|s| |s| + 1) words (prefixes) as their prefix.Then, we constructed two inverted indices for theprefix tokens of each string in S using = 0.8 and = 0.6, respectively. The prefix tokens of strings in Sare given in Table 2. The inverted indices for prefixtokens of strings in S are shown in Figure 1.

    During the similarity join process, we use the prefixtokens of every string in R to probe the correspondinginverted index and generate candidate pairs. When

    Table 2 Prefix Tokens Using Different Thresholds ID String

    0.8s1 Top-k similarity set joinss2 similarity set joins exact Efficients3 using set-similarity MapReduce parallel joins Efficient

    0.6s1 Top-k similarity set joinss2 similarity set joins exact Efficients3 using set-similarity MapReduce parallel joins Efficient

    Table 3 The Variation of Similarity ValueSimilarity Value Length of ri Length of si Different Tokenssim(r1, s1)=0.6 4 4 2sim(r2, s2)=0.5 4 5 3sim(r3, s3)=0.625 7 6 3

    = 0.8, we have two candidate pairs{< r2, s3 >, < r3, s3 >}. When = 0.6, we have sevencandidate pairs {< r1, s1 >, < r1, s3 >, < r2, s1 >, , < r3, s3 >, < r3, s3 >, < r3, s2 >} and two finalresults < r1, s1 > and < r3, s3 >. When = 0.5, wehave an additional result pair < r2, s2 >. Observe thatall the pairs < ri, si >(i=1,2,3) refer to the samepublications, respectively. In fact, the values ofsim(ri, si)(i=1,2,3) are different as there are somedifferent tokens between ri and si. In general, the samenumber of spelling differences will generate differentvalues of similarity distance depending on the lengthof the strings. This can be observed in Table 3. Forinstance, the string pairs < r2, s2 > and < r3, s3 > havethree different tokens from each other but the similarityvalue of the longer string pair (< r3, s3 >) is greaterthan the shorter one (< r2, s2 >). All the previouslyproposed methods use a predefined or uniquesimilarity threshold. By doing this, they can lose somepromising results. We propose a solution that supportsthe use of different thresholds for different strings. Inthis paper, we focus primarily on solutions to supportdifferent similarity thresholds, and leave the problemof assigning suitable thresholds to strings as a futurework.

    The use of a single threshold, considered inprevious methods, simplifies the design of efficientalgorithms. The support of different similaritythresholds present three challenges that need to beaddressed: (1) As there are a variety of differentthresholds, the widely used prefix filtering method cannot be applied for inverted index construction; (2)Since the thresholds are not know before the joinoperation, we should devise new indexingmechanisms; (3) We need to explore new indexprobing techniques to improve the performance.

    In summary, this paper makes the followingcontributions:

  • Front. Comput. Sci.3

    Top-k similarityusingset-similarity

    s1 s3 s3 s2

    Top-k similarityusingset-similarity set

    s1 s3 s3 s1s2

    s2

    MapReduce

    s3

    (a) = 0.8 (b) = 0.6

    Fig. 1 Inverted Index For Prefix Tokens of DataSet S

    We proposed a solution for string similarity joinwith different similarity thresholds. This is thefirst work that explores similarity joins withdiverse thresholds in one operation.

    We devised two novel indexing structures,partitioned-based index and similarity-awareindex, to support similarity joins with differentthresholds.

    We provide new index probing techniques andfiltering mechanisms to improve the joinperformance.

    The rest of the paper is organized as follows. Therelated work is given in Section 2. Section 3 presentsthe problem definitions and preliminaries. Section 4describes the index structures. Section 5 presents howto perform similarity joins by employing the proposedindexing structures and new probing techniques.Experimental evaluation is given in Section 6. Section7 concludes the paper.

    2 Related Work

    String similarity join is a primitive operation in manyapplications such as merge-purge [9], recordlinkage [10], object matching [11], referencereconciliation [12], deduplication [13, 14] andapproximate string join [15]. In order to avoidverifying every pair of strings in the dataset andimprove performance, string similarity join typicallyfollows a filtering and refine process [16, 17]. In thefiltering step, the signature assignment process orblocking process is invoked to group the candidates byusing either an approximate and exact approach,depending on whether some amount of error could betolerated or not. In the past two decades, more than tendifferent algorithms have been proposed to solve thisproblem [18]. In [18], the authors evaluated existingalgorithms under the same experimental frameworkand reported comprehensive findings. Since we aim toprovide exact answers, we will focus on the exactapproaches. Recent works that provide exact answers

    are typically built on top of some traditional indexingmethods, such as tree based and inverted index basedstructures. In [19], the Trie-tree based approach wasproposed for edit similarity search, where anin-memory Trie-tree is built to support edit similaritysearch by incrementally probing it. The edit similarityjoin method based on the Trie-tree was proposedin [4], in which sub-trie pruning techniques areapplied. In [2], a B+-tree based method was proposedto support edit similarity queries. It transforms thestrings into digits and indexes them in the B+-tree.In [3], the authors partitioned the string collection intoa number of groups according to a set of referencestrings and index all the strings into a B+-tree based onthe distances to their reference strings. However, thesealgorithms are constrained to in-memory processing,not efficient and scalable for processing large scaledataset. In [20],the authors proposed the pivotal prefixto shorten the prefix length by adynamic-programming algorithm. They also appliedan alignment filter to prune candidate pairs.

    The methods making use of the inverted index arebased on the fact that similar strings share commonparts and consequently they transform the similarityconstraints into set overlap constraints. Based on theproperty of set overlap [6], the prefix filtering wasproposed to prune false positives [68, 15]. In thesemethods, the partial result of the filtering step is asuperset of the final result. The AllPairs methodproposed in [7] builds the inverted index for prefixtokens and each string pair in the same inverted list isconsidered as a candidate. This method can reduce thefalse positives significantly compared to the methodthat indexes all tokens of each strings [5]. In order toprune false positives more aggressively, the PPJoinmethod uses the position information of the prefixtokens of the string. Based on the PPJoin, the PPJoin+uses the position information of suffix tokens to prunefalse positives further [8]. In [21], the authors observedthat prefix lengths have significant effect on pruningfalse positives and the join performance. Theyproposed the AdaptJoin method by utilizing different

  • 4Chuitian RONG et al.

    prefix lengths. [22] proposed the MGJoin method thatis based on multiple prefix filters, each of which isbased on a different global ordering. [23] studied theproblem with synonyms by utilizing a novel index thatcombines different filtering strategies. In [24], theauthors proposed a prefix tree to support stringsimilarity join and search on multi-attribute data. Theyproposed a cost model to guide the prefix treeconstruction process.

    All the aforementioned works applied a predefinedand unique threshold when performing joins.Furthermore, the index that is used to accelerate thejoin process is build using an specific distancethreshold. If the strings have significantly differentlengths, the same number of different tokens betweenpairs of strings will generate different effects on thesimilarity function value (see Table 3). So, applying aunique threshold may lose some results that are stillconsidered very similar in practical scenarios. In thispaper, we first propose a partition based inverted indexto support string similarity joins with differentsimilarity thresholds. In order to improve theperformance further, we devise a similarity-awareindex and new probing techniques.

    3 Preliminary

    In this section, we provide the problem definitionfollowed by a description of similarity measures andprefix filtering.

    3.1 Problem Definition

    In this paper, we consider a string as set of tokens, eachof which can be a word or n-gram. For example, thetokens set of r2 in R (Table 1) is {Efficient, exact, set,similarity, joins}. The string similarity join is definedas follows.

    Definition 1 String Similarity JoinGiven two string collections R and S, a similarity

    function Sim and each r R has its own join thresholdr., the string similarity join finds all the string pairs(r, s), such that r R , s S, and Sim(r, s) > r..

    3.2 Similarity Measures

    A similarity function measures how similar two stringsare. There are two main types of similarity functionsfor strings, set-based similarity functions andcharacter-based similarity functions. In this paper, we

    Table 4 Symbols and DefinitionsSymbols DefinitionR,S collections of strings| | the element number of a sett a token of sTr set of tokens for string rVr Tr transformed to Vector Space ModelT pr the first p tokens of string r pre-assigned thresholdsim(ri, s j) similarity between ri and s jO global orderingT pr (O) T

    pr under O

    Table 5 Similarity FunctionsSimilarity Function Definition Prefix Lengthsimdice(ri, s j)

    2|TriTs j ||Tri |+|Ts j |

    |Ts| |Ts| + 1

    sim jaccard(ri, s j)|TriTs j ||TriTs j |

    |Ts| |Ts| + 1

    simcosine(ri, s j)Vri Vs j|Vri ||Vs j |

    |Ts| |Ts| 2 + 1

    utilize three widely used set-based similarity functions,namely Dice [25], Jaccard [1], and Cosine [26], whosecomputation problem can be reduced to the set overlapproblem [7]. They are based on the fact that similarstrings share common components. Their definitionsare summarized in Table 5, in which Tr denotes thetoken set of r, Vr denotes the vector transformed fromTr and | | denotes the size of a set. Unless otherwisespecified, we use Jaccard as the default function, i.e.,sim(r, s) = sim jaccard(r, s). For example,sim jaccard(r2, s2) = 36 .

    3.3 Prefix Filtering

    The prefix filtering technique is commonly used in thefiltering step to generate candidate pairs that sharecommon prefix tokens. In [6, 7], the methods sort thetokens of each string based on some global ordering O(e.g., an alphabetical order or a term frequency order),select a certain number of its first tokens as the prefix,and use the tokens in the prefix1) as its signatures. Itshas been proved that the necessary condition for everytwo strings r and s to be a candidate pair is that theirprefixes T pr and T

    ps must have at least one token in

    common. The number of tokens in the prefix for eachstring can be computed using the formulas shown inTable 5. Essentially, the prefix length relies on thesimilarity function, the join threshold, and the lengthof the string.

    By applying the prefix filtering technique, candidate

    1) When there is no ambiguity, we will simply refer to token pre-fixes as prefixes.

  • Front. Comput. Sci.5

    pairs with no overlap in their corresponding prefixescan be safely pruned. In this manner, the number ofcandidate pairs can be significantly reduced sincepopular words or frequent tokens can be set to the endof the global ordering so that they are not probable tobe selected as prefixes.

    3.4 Inverted Index

    The naive method to perform joins is to enumerate allpairs < r, s > R S and verify whether they sharecommon prefix tokens. This approach is ratherexpensive when processing large scale datasets. Inorder to improve the performance, the inverted index iswidely employed to find all pairs < r, s > R S thatshare common prefix tokens. We first construct aninverted index for prefix tokens of each string in one ofthe collections, e.g., S. By doing so, the strings thatshare the same prefix tokens are mapped into the sameinverted list. Then, we can process each string r R toget its prefix tokens, and then to merge thecorresponding inverted lists for the prefix tokens to getall candidate pairs. Take r3 as an example, assumer3. = 0.6, its prefix tokens are {using, similarity, set}.We merge the inverted lists for these three prefixtokens, as shown in Figure 1(b), and get threecandidate pairs: {< r3, s1 >, < r3, s2 >, < r3, s3 >}.

    4 Index Technique to Support DifferentSimilarity Thresholds

    All the previous work on similarity joins use apredefined and unique threshold for all the objects inthe input collections. When a unique threshold is used,it is easy to implement and optimize the similarity joinalgorithms. The application of a unique threshold maylose some promising results or be too rigid to expressthe nature of similarities that users want to identify onthe datasets. The prefix filtering technique is widelyused in existing works due to its effective pruningpower. The prefix filtering is based on a necessarycondition for similar pairs of strings, which is that theymust share at least one prefix token when sorted by thesame global order. The prefix tokens of two strings aredetermined by their length and the unique threshold.Supporting different thresholds in similarity join hasits own challenges. Particularly, when constructing theinverted index for prefix tokens of each string s S,we cannot predict the threshold of r R that will beused to perform the join with s. The problem is that we

    cannot exactly identify the number of prefix tokens ofs S. In this section, we propose the index techniquesthat can support different similarity thresholds.

    4.1 Straightforward Approach

    As stated previously, since we cannot predict thethreshold of r R, we cannot identify the number ofprefix tokens of s S and cannot construct theinverted index as in the case of single thresholdapproaches. The straightforward approach is to map alltokens of s S to inverted lists. When performingjoins, we process each string r R and get its prefixtokens. Then, we can get the candidate pairs bymerging the corresponding inverted lists. Obviously,this approach will map unnecessary tokens intoinverted lists and increase the index size. As a result,this approach incurs on unnecessary overhead whenprobing the inverted lists and thus the similarity joinperformance is degraded.

    4.2 Partition-Based Inverted Index

    The similarity join performance is mainly determinedby the prefix filtering power and the efficiency ofmerging inverted lists. In order to decrease theunnecessary index probing cost, we proposed aPartition Based inverted Index, denoted as PBI forabbreviation.

    When using different thresholds during the joinoperation, different strings may have differentthresholds. As the threshold is a value in [0,1], wechoose some representative values within the intervaland build incremental inverted indices with them. Forexample, if we choose 0.8, 0.6 and 0.4, we build theinverted index in the following way. First, we map theprefix tokens of each s S using = 0.8 into invertedlists as done in existing works (we denote this invertedindex as I0.8). Then, we map the prefix tokens using = 0.6 into inverted index I0.6, excluding the tokensthat have been mapped into I0.8. Similarly, we map theprefix tokens using = 0.4 into inverted index I0.4,excluding the tokens that have been mapped into I0.8

    and I0.6. Figure 2 shows the partition-based invertedindex for the collection S in Table 1.

    The pseudo-code of the PBI constructionalgorithm is shown in Algorithm 1. The algorithmtakes a string collection S that is sorted by stringlength, a global ordering O, and the number ofrepresentative thresholds N as input. Its output is aPBI, such as I,I1 , ....,IN1 . It first selects N

  • 6Chuitian RONG et al.

    Top-k similarityusingset-similarity

    s1 s3 s3 s2

    similarity set

    s1 s2

    MapReduce

    s3

    set joins

    s1 s2

    parallel

    s3

    Fig. 2 Partition-Based Inverted Index

    Algorithm 1: PartitionBasedIndex(S,O,N)Input :

    S: the collection of stringO: one global orderingN : the number of representative thresholds

    Output:PBI: I,I1 , ....,IN

    1 [N] select N representative thresholds from [0,1];2 foreach s S do3 TokenList(s) Tokenize(s,O);4 for i = 0 N do5 T ps ([i]) getPrefix(TokenList(s), [i]);6 if i == 0 then7 foreach t T ps ([i]) do8 Map t into I[i][t];

    9 else10 foreach t {T ps ([i]) T ps ([i 1])} do11 Map t into I[i][t];

    12 foreach t {Ts T ps ([N 1])} do13 Map t into I[N][t];

    representative thresholds that are used to construct thepartition based inverted index (Line 1). For each s S,it first generates a sorted token list using the functionTokenize, which tokenizes the string s into a token setand then sorts it by the global ordering O (Line 2). Forthe sorted token list, the algorithm gets different prefixtokens using different thresholds in [N]. It maps eachprefix token t T ps ([0]) into inverted list I[t], whichbelongs to I. For other thresholds in [N], it mapseach prefix token in T ps ([i]) T ps ([i 1]) intoinverted list I[i][t], which belongs to I[i] (Lines5-10). Finally, it maps the remaining tokens inTs T ps ([N 1]) into inverted index I[N].

    The partition-based inverted index can decrease theprobing cost significantly, as it partitions the invertedlist of the same token into several smaller lists, asshown in Figure 2. For r R, assume r. = 0.8, whenperforming joins we can only probe and merge theinverted lists in I0.8. If r. = 0.7, we should probe andmerge the inverted lists I0.8 and I0.6.

    Algorithm 2: S imilarityAwareIndex(S,O)Input :

    S: the collection of stringO: one global ordering

    Output:SAI: similarity aware index

    1 foreach s S do2 TokenList(s) Tokenize(s,O);3 N TokenList.size;4 for n = 0 N do5 TUB(n) Nn+1N ;6 t TokenList[n];7 Map < t,TUB(n) > into S AI[t];

    4.3 Similarity Aware Index

    The partition-based inverted index can decrease theunnecessary probing cost to some extent, but there arestill ways to improve the performance. In order to dothis, we proposed a Similarity Aware Index thatexploits the relationship of token positions andsimilarity thresholds, denoted as SAI.

    Definition 2 Threshold Upper BoundWhen a string r R (or s S) is tokenized into a tokenset Tr and sorted using the global orderingO, the tokensthat are selected as prefix tokens are determined by thevalue of the threshold. The maximum threshold for atoken to be selected as a prefix token is referred to asthe threshold upper bound, denoted as TUB.

    Theorem 1 Let token t Tr be located at position nwhen Tr is sorted by global ordering O. If t is a prefixtoken, TUB(n) = |Tr |n+1|Tr | .

    Proof For the token set Tr, the number of prefixtokens |T pr | is determined by the formula.

    |T pr | = |Tr | |Tr | + 1Let |T pr | = n, that is to say the last prefix token islocated at the position n. Then, we can get

    |Tr | n < |Tr | 6 |Tr | n + 1We can get 6 |Tr |n+1|Tr | . So,

    TUB(n) = |Tr |n+1|Tr | .

  • Front. Comput. Sci.7

    similarity set joins Efficient

    Fig. 3 Similarity Aware Inverted Index

    Based on the above analysis, we can conclude thatthe selection of a token t Tr as a prefix token dependson its position n and TUB(n) when |Tr | is sorted bya global ordering. So, we can map both the token tand TUB(n) to the inverted index when constructingthe index for collection S. When probing the invertedindex, it can support any threshold of r R. Unlikethe partition-based inverted index that sorts the invertedlists by the length of the strings, we sort the invertedlists by the value of TUB(n). Figure 3 shows part ofthe similarity-aware inverted index for the collection Sin Table 1.

    The pseudo-code of the similarity aware indexconstruction algorithm is shown in Algorithm 2. Foreach string s S, it generates a sorted token list byapplying the Tokenize function. Then, for each token inTokenList, it computes the TUB(n) based on thetokens position n and the size of TokenList. Finally, itmaps the pair < t,TUB(n) > into SAI[t]. Observethat, each inverted list is sorted by the value of TUB.

    5 String Similarity Join Processing

    In this section, we describe how similarity joinoperation can be performed using the proposedindexing mechanism. We present the details about howto handle different thresholds in one join procedure.

    In general, the similarity join methods employinginverted indexes follow a filter and refine frameworkand consist of three phases.

    Index Construction In this phase, the strings inthe collection S are sorted based on a globalordering, such as alphabetical order or tokensterm frequency. Then, a number of tokens will beextracted from each string according topredefined rules and mapped into inverted lists.

    Candidate Pairs Generation The strings in thecollection R will be processed sequentially. Foreach string in R, the string will be sorted and anumber of tokens will be extracted from it as

    done in the index construction step. The extractedtokens are considered as the strings signaturesand are used to probe corresponding inverted listsusing filtering strategies. As a result, we canderive the candidate set for each string.

    Verification We compute the similarity betweenthe string and its candidates one by one to find thefinal results. The final result will be composed ofpairs with similarity not less than .

    In order to support different similarity thresholds instring similarity joins, the straightforward approach isto extend the PPJoin and PPJoin+ [8] methods usingthe index as described in section 1. As this approachmaps all tokens into inverted lists, the join procedurewill probe the index with unnecessary overhead.

    5.1 Similarity Join on PBI

    Using the partition-based index to support differentsimilarity thresholds, the first problem is to select anumber of representative values from [0,1]. Theselected values will be used to partition the thresholdrange into several intervals, and then partition theinverted lists into several small inverted lists. As anexample, we can select the following representativevalues 0.8, 0.6, 0.4. The index for string collection S isconstructed as indicated by Algorithm 1.

    After the index construction step, each string r Ris sorted based on a global ordering O, and then theprefix tokens T pr are extracted according to the lengthand its own similarity threshold r., as described insection 3. For each prefix token, the correspondinginverted lists with representative thresholds not smallerthan r. and the first inverted list with representativethreshold equal or smaller than r. will be probed togenerate candidates.

    The details of similarity join using PBI is shown inAlgorithm 3.

    When performing similarity join operations, eachstring in collection R will be tokenized by the sameglobal ordering O that was used on string collection S(Line 2). Take the string r3 R and the selectedrepresentative values [3] ={0.8,0.6,0.4} for example.When r3. = 0.8, its prefix token set T

    pr3 ={using,

    similarity} (Line 3). So, the two inverted listsI[0.8][using] and I[0.8][similarity] that belong to I[0.8]

    will be probed and merged. The acquired candidateswill be added to C for verification (Lines 4-7). Finally,each candidate pair in C will be verified to check thattheir similarity is not less than r3. using the selected

  • 8Chuitian RONG et al.

    Algorithm 3: S imJoinOnPBI(R,S)Input :

    R: the string collection RS: the string collection S

    Output:< r, s >: similar pairs

    1 foreach r R do2 TokenList(r) Tokenize(r,O);3 T pr getPrefixToken(TokenList(r),r.);4 foreach t T pr do5 while k < N r. 6 [k + +] do6 results listProbing(PBI[k][t]);7 C.push(results);

    8 Verification(C);

    similarity function. When r3. = 0.7, Tpr3 ={using,

    similarity, set}. The inverted lists I[0.8][using] andI[0.8][similarity] that belong to I[0.8], I[0.6][similarity]and I[0.6][set] that belong to I[0.6] will be probed andmerged respectively, as shown in Figure 2.

    5.2 Similarity Join on SAI

    The inverted lists of PBI that were constructed for Sare sorted in ascending order of the records length. Inorder to avoid probing the inverted lists iteratively, thestring collection R is also sorted by the same orderbefore performing joins. As the SAI is different toPBI, each inverted list is sorted in ascending order ofthe TUB values. In order to improve the efficiency ofsimilarity joins using SAI, the string collection R issorted in descending order of the string thresholds. Bydoing that, we can utilize the TUB filter whenprobing the inverted lists.

    Definition 3 TUB FilterFor a string r R with threshold r. and prefix tokenset T pr , if s S sim(r, s) > r., then t

    s t T pr t

    = t TUB(t ) > r..

    The TUB filter is a necessary condition for twostrings to be similar. It can be used to decrease the costof probing inverted lists. By applying this filter duringthe join operations, we can just probe the first part ofthe corresponding inverted lists to get the results. Thepseudo-code of similarity join on SAI is shown inAlgorithm 4.

    Each string r R is tokenized by the same globalordering O that was applied to S. Its prefix tokens T prcan be acquired based on the number of tokens and itsown threshold r. (Lines 2-3). Then, the inverted listsof tokens in T pr will be probed and merged. During theprocessing, two filters are applied. The first is to probe

    Algorithm 4: S imJoinOnS AI(R,S)Input :

    R: the string collection RS: the string collection S

    Output:< r, s >: similar pairs

    1 foreach r R do2 TokenList(r) Tokenize(r,O);3 T pr getPrefixToken(TokenList(r),r.);4 foreach t T pr do5 while i + + < I[t].size do6 if r. > I[t].tub then7 break;

    8 if r.length > S[I[t].rid].lenght r.length < S[I[t].rid].lenght/ then

    9 continue;

    10 C.push(S[I[t].rid]);

    11 Verification(C);

    Table 6 Data Sets Information

    Data Set Records Size Distribution Mean Std

    CiteSeer 568,237 69.7MNormal 0.8078 0.0420Poisson 0.6865 0.0481Uniform 0.8000 0.1156

    DBLP 1,021,062 218MNormal 0.7774 0.0531Poisson 0.7227 0.0676Uniform 0.7923 0.1122

    the inverted list until r. is not less than the tokensTUB. This filter assures that only the first part ofinverted lists are used during probing (Line 6). Thesecond is the length filter. If two strings are similartheir length must satisfy a length constraint. If r issimilar to string s S, the length of r must be in(s.length , s.length/) (Line 8). Only the strings thatpass the two filters are considered as candidates (to belater verified).

    6 Experimental Evaluation

    We conducted extensive experiments to evaluate theperformance and scalability of our techniques tosupport similarity joins with different similaritythresholds.

    6.1 Experiments Setup

    We selected two publicly available real data sets ofbibliography records from two different data sourcesfor the experiments. They cover a wide range of datadistributions and were widely used in previous studies.In order to evaluate the efficiency of our techniques tosupport similarity join with different similarity

  • Front. Comput. Sci.9

    0

    5

    10

    15

    20

    0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

    Cou

    nt(1

    04)

    Threshold

    NormalPoissonUniform

    0

    5

    10

    0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

    Cou

    nt(1

    04)

    Threshold

    NormalPoissonUniform

    (a) DBLP (b) CiteSeer

    Fig. 4 Threshold Distribution Statistic

    thresholds, we generated datasets with threshold valuesthat were produced using three different distributions:Uniform, Poisson and Normal.

    DBLP is a snapshot of the bibliography recordsdownloaded from the DBLP website2). It contains1,021,062 records, each of which is theconcatenation of the author name(s) and the titleof a publication. The minimum, maximum, andaverage length (number of tokens) of the recordsin this dataset are 2, 207, 13, respectively.

    CiteSeer is also a snapshot of the bibliographyrecords downloaded from the CiteSeer website3).It contains 568,237 records. Each record is aconcatenation of the author names and the title ofa publication. The minimum, maximum, andaverage length of records in this data set are 1, 84,7, respectively.

    Figure 4 and Table 6 show the distribution of thethreshold values (), the number of records, and thesize of the two datasets.

    All the experiments were carried out on a singlemachine with AMD 15x4 cores 1GHz and 60GB mainmemory. The operating system is CentOS withinstalled GCC 4.3. The algorithms were implementedin C++ and compiled using GCC with -O3 flag.

    The PPJoin+ [8] is the state-of-art method forset-based string similarity join. However, as it basedon prefix filtering, it requires a predefined and uniquethreshold before performing joins. Furthermore, theindex that is used to accelerate the join operations isbuilt for an specific threshold value and will need to berebuilt for other threshold values. If multiplethresholds are supported, the index cannot be built as

    2) http://www.informatik.uni-trier.de/ley/db3) http://citeseerx.ist.psu.edu

    in [8] and thus the PPJoin+ approach cannot be usedto perform joins with different similarity thresholdsdirectly. [21] proposed a delta-based-index method, inwhich the authors grouped the strings based on theirlengths and built delta indices for the groups. The deltaindices can support similarity search with multiplethresholds. However, it cannot support similarity joinwith multiple thresholds in one join procedure.

    In this paper, we have extended the PPJoin+ byconstructing the index as discussed in Section 1 andmodified the inverted lists probing techniques. Wedenoted the extended PPJoin+ method asExtendedJoin. We have also constructed the deltaindex and implemented the AdaptJoin approach [21] tosupport similarity joins in one procedure. In order toverify the efficiency of our proposed methods, wedevised two kinds of indices: partition based invertedindex and similarity aware inverted index. We denotedthe join methods that are based on the two differentindices as PBIJoin and SAIJoin, respectively.

    6.2 Comparison on Different Indices

    In this section, we evaluate the efficiency of four joinmethods based on different indices. In thisexperiments, we used two data sets CiteSeer andDBLP, each of which has three threshold distributionsas shown in Figure 4. We conducted three experimentsusing the Jaccard similarity measure on two datasetsunder different threshold distributions, including oneRS-Join (CiteS eer DBLP), and two Self-Joins(DBLP DBLP and CiteS eer CiteS eer). Weplotted the running time cost of the four methods oneach join type in Figure 5.

    From the three sub-figures in Figure 5, we can seethat SAIJoin outperforms PBIJoin followed by

  • 10Chuitian RONG et al.

    0

    10

    20

    30

    40

    50

    60

    Normal Uniform Poisson

    Tim

    e C

    ost(

    Sec

    )

    Distribution

    ExtendJoinAdaptJoin

    PBIJoinSAIJoin

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    Normal Uniform Poisson

    Tim

    e C

    ost(

    Sec

    )

    Distribution

    ExtendJoinAdaptJoin

    PBIJoinSAIJoin

    0

    5

    10

    15

    20

    25

    30

    35

    Normal Uniform Poisson

    Tim

    e C

    ost(

    Sec

    )

    Distribution

    ExtendJoinAdaptJoin

    PBIJoinSAIJoin

    (a)CiteSeer Jaccard DBLP (b) DBLP Jaccard DBLP (c) CiteSeer Jaccard CiteSeer

    Fig. 5 Comparison On Different Indices

    AdaptJoin and ExtendedJoin, regardless of the jointype and threshold distribution. This is due to theoverhead of the indexing structures that are used toaccelerate the join operations. The ExtendedJoin isbased on a straightforward indexing approach whereall the tokens of each string are mapped into invertedlists. In order to utilize the length filter and improvethe performance of PPJoin+, it requires sorting thedataset and the inverted lists by the record length.When applying a predefined and unique threshold,PPJoin+ can probe the inverted list sequentially, assimilar strings must be located nearby each other ininverted lists. However, when applying differentthresholds, that list probing technique is not suitable.After sorting by the record length, the strings that arelocated near each other have similar length butpossibly different thresholds. So, their lengthconstraints to identify similar records can be verydifferent. In order to get all possible candidates, eachstring r R that is joined with S must probe a largerange of the inverted list, not sequentially. Thisproblem can not be avoided in AdaptJoin and deltaindex. So, AdaptJoin and ExtendedJoin cannotperforms as well as the other two methods.

    From Figure 5, we can see that PBIJoinoutperforms ExtendedJoin under different experimentconditions. This is due to the advantages of thepartition-based inverted index that was applied inPBIJoin. Before constructing the partition-basedinverted index, a number of values from [0,1] areselected as representative thresholds. These selectedrepresentative values are used to partition the invertedindex used in ExtendedJoin into several small invertedindices as shown in Figure 1. When performing joins,PBIJoin will select related parts of the index and proberelated inverted lists according to r. and T pr . By doingso, much of the unnecessary list probing cost can be

    avoided.SAIJoin has the best performance among the three

    methods. This is due to the similarity-aware invertedindex and the TUB filter. In order to utilize the SAIindex and TUB filter in an efficient way, the stringcollections are sorted by their thresholds, and also theinverted lists are sorted by the tokens TUB values.This type of string collection arrangement and invertedlists sorting bring the promising results to the head ofthe inverted lists. For each string r R, we can get itsthreshold r. and prefix token set T pr . Whenperforming joins, we just probe the lists I[t] of t T prand stop when r. > t.tup. This makes SAIJoin probe asmall part of the related lists and generates the bestperformance.

    6.3 Comparison on Different Similarity Measures

    In order to evaluate the scalability and robustness ofour proposed methods, we conducted extensiveexperiments using another widely popular similaritymeasure, Cosine. As in the previous experiments, weused two datasets, DBLP and CiteSeer, each of whichwith three different threshold distributions. We carriedout three types of joins, one RS-Join and twoSelf-Joins. The experiment results are plotted in Figure6.

    From the experiment results, we can observe thatthe SAIJoin outperforms PBIJoin and ExtendedJoin.SAIJoins execution time is about 50% of the one ofPBIJoin for most experiments. This is becauseExtendedJoin and PBIJoin apply the framework andfilters of PPJoin+, which is based on a predefined andunique threshold. The filters are not efficient when thedata has different similarity thresholds, especiallywhen using Cosine. Under the different similaritythresholds, the SAI index and TUB filter achievehigh pruning power. So, SAIJoin performs well.

  • Front. Comput. Sci.11

    0

    50

    100

    150

    200

    250

    300

    Normal Uniform Poisson

    Tim

    e C

    ost(

    Sec

    )

    Distribution

    ExtendJoinPBIJoinSAIJoin

    0

    50

    100

    150

    200

    250

    300

    350

    400

    Normal Uniform Poisson

    Tim

    e C

    ost(

    Sec

    )

    Distribution

    ExtendJoinPBIJoinSAIJoin

    0

    20

    40

    60

    80

    100

    120

    Normal Uniform Poisson

    Tim

    e C

    ost(

    Sec

    )

    Distribution

    ExtendJoinPBIJoinSAIJoin

    (a)CiteSeer Cosine DBLP (b) DBLP Cosine DBLP (c) CiteSeer Cosine CiteSeer

    Fig. 6 Comparison Of Cosine Similarity Measure

    Observe also that the performance of SAIJoin isbetter with Cosine than with Jaccard as shown inFigure 5 and Figure 6. This can be explained by thedefinition of similarity measures shown in Table 5. Forthe same threshold, the number of prefix tokens ofstring r is larger when using Cosine than that of usingJaccard. Since the join method using Cosine mustprobe more inverted lists for each string, it consumesmore time than using Jaccard.

    6.4 Comparison on Different Distributions ofThreshold

    In this experiment, we verify the efficiency androbustness of our methods on different datasets withdifferent thresholds. We conducted experiments on twodatasets and performed three different join operationsusing Jaccard. The results are shown in Figure 7. Inthis figure, R represents CiteSeer and S representsDBLP for simplicity, and X denotes the join operation.

    From Figure 7, we can observe that the SAIJoinoutperforms PBIJoin by a wide margin followed byExtendedJoin regardless of the kind of distribution.Before performing joins, PBIJoin and ExtendedJoinsorted the string collections by the string length. Bydoing that, the collections that are being joined can bescanned sequentially by applying length constraints.However, since the strings have different thresholds,the strings that are located as neighbors by the sortingstep can have different length constraints. So, the joinoperation cannot be performed as PPJoin using thelength filter on the sorted datasets. PBIJoin andExtendedJoin must probe a range of inverted lists toget the exact results. While, SAIJoin performs joinsusing SAI index and applies the TUB filter to ensurethat only a small section of the inverted lists areprobed. So, it can get high performance under differentthreshold distributions.

    For all the three join methods, they perform best onthe dataset whose threshold distribution is Normal,followed by Uniform and Poisson. This phenomenoncan be observed in Figure 5, Figure 6 and Figure 7.This can be explained using the threshold distributionfeatures. For the DBLP dataset, when the thresholddistribution is Normal, the average of threshold valuesis 0.8078 and std is 0.0420. This is, the thresholddistribution is more compact than in the other twocases and the average threshold value is higher then theones of the other two distributions. Higher thresholdvalues generate smaller number of prefix tokens andless time on probing the inverted list. Consequently,the join operation takes less time with Normaldistribution than with the other two distributions.When the distribution is Uniform, the threshold valuesare uniformly distributed. The average thresholds is0.8000, while the std is 0.1156. This is, the thresholdvalues varied significantly. For the Poissondistribution, the average threshold value is 0.6865 andthe std is 0.0481. The threshold values are distributedmore compactly and most of them are small thresholdvalues. Since small threshold values generate largenumber of prefix tokens, the join operations mustprobe more inverted lists. Thus, all the three joinmethods spent more time on the dataset with Poissondistribution.

    7 Conclusions

    In this paper, we have studied the problem of similarityjoins with different similarity thresholds. We devisedtwo new indexing techniques that can support differentthresholds to process join operations. Although theproposed PBI index reduces the unnecessary probingcost in comparison with the straightforward approach,

  • 12Chuitian RONG et al.

    0

    10

    20

    30

    40

    50

    60

    70

    RxS RxR SxS

    Tim

    e C

    ost(

    Sec

    )

    Joins

    ExtendJoinPBIJoinSAIJoin

    0

    10

    20

    30

    40

    50

    60

    RxS RxR SxS

    Tim

    e C

    ost(

    Sec

    )

    Joins

    ExtendJoinPBIJoinSAIJoin

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    RxS RxR SxS

    Tim

    e C

    ost(

    Sec

    )

    Joins

    ExtendJoinPBIJoinSAIJoin

    (a)Uniform (b) Normal (c) Poisson

    Fig. 7 Comparison On Different Threshold Distribution

    there is still much space for improvement.In order to improve the performance, we proposed

    theSAI index and a new index probing technique. Theexperimental results on real-world datasets show thatour proposed solution can support different similaritythresholds efficiently.

    For our future work, we plan to address the problemof assigning suitable similarity thresholds to differentstrings.

    Acknowledgements This material is based upon work supportedby the National Natural Science Foundation of China under grantNo.61402329 and No.51378350 .

    References

    1. Monge A, Elkan C. The field matching problem: Algorithms

    and applications. In: SIGKDD. 1996, 267270

    2. Zhang Z, Hadjieleftheriou M, Ooi B, Srivastava D. Bed-tree:

    an all-purpose index structure for string similarity search based

    on edit distance. In: SIGMOD. 2010, 915926

    3. Lu W, Du X, Hadjieleftheriou M, Ooi B C. Efficiently support-

    ing edit distance based string similarity search using b+-trees.

    TKDE, 2014, 26(12): 29832996

    4. Wang J, Feng J, Li G. Trie-join: Efficient trie-based string simi-

    larity joins with edit-distance constraints. VLDB, 2010, 3(1-2):

    12191230

    5. Sarawagi S, Kirpal A. Efficient set joins on similarity predi-

    cates. In: SIGMOD. 2004, 743754

    6. Chaudhuri S, Ganti V, Kaushik R. A primitive operator for sim-

    ilarity joins in data cleaning. In: ICDE. 2006, 6172

    7. Bayardo R, Ma Y, Srikant R. Scaling up all pairs similarity

    search. In: WWW. 2007, 131140

    8. Xiao C, Wang W, Lin X, Yu J. Efficient similarity joins for near

    duplicate detection. In: WWW. 2008, 131140

    9. Hernndez M, Stolfo S. The merge/purge problem for large

    databases. In: SIGMOD. 1995, 127138

    10. Winkler W. The state of record linkage and current research

    problems. In: Statistical Research Division. 1999

    11. Sivic J, Zisserman A. Video google: A text retrieval approach to

    object matching in videos. In: Computer Vision. 2003, 1470

    1477

    12. Dong X, Halevy A, Madhavan J. Reference reconciliation in

    complex information spaces. In: SIGMOD. 2005, 8596

    13. Sarawagi S, Bhamidipaty A. Interactive deduplication using

    active learning. In: SIGKDD. 2002, 269278

    14. Arasu A, R C, Suciu D. Large-scale deduplication with con-

    straints using dedupalog. In: ICDE. 2009, 952963

    15. Gravano L, Ipeirotis P, Jagadish H, Koudas N, Muthukrishnan

    S, Srivastava D. Approximate string joins in a database (almost)

    for free. In: VLDB. 2001, 491500

    16. Elmagarmid A, Ipeirotis P, Verykios V. Duplicate record detec-

    tion: A survey. TKDE, 2007, 19(1): 116

    17. Naumann F, Herschel M. An Introduction to Duplicate De-

    tection. Synthesis Lectures on Data Management, 2010, 2(1):

    187

    18. Jiang Y, Li G, Feng J, Li W S. String similarity joins: An

    experimental evaluation. In: PVLDB. 2014, 625636

    19. Chaudhuri S, Kaushik R. Extending autocompletion to tolerate

    errors. In: SIGMOD. 2009, 707718

    20. Deng D, Li G, Feng J. A pivotal prefix based filtering algorithm

    for string similarity search. In: SIGMOD. 2014, 673684

    21. Wang J, Li G, Feng J. Can we beat the prefix filtering? an adap-

    tive framework for similarity join and search. In: SIGMOD.

    2012, 8596

    22. Rong C, Lu W, Wang X, Du X, Chen Y, Tung A K. Efficient

    and scalable processing of string similarity join. TKDE, 2013,

    25(10): 22172230

    23. Lu J, Lin C, Wang W, Li C, Wang H. String similarity measures

    and joins with synonyms. In: SIGMOD. 2013, 373384

    24. Li G, He J, Deng D, Li J. Efficient similarity join and search on

    multi-attribute data. In: SIGMOD. 2015, 11371151

    25. Salton G, McGill M. Introduction to modern information re-

    trieval. McGraw-Hill, Inc., 1986

    26. Witten I H, Moffat A, Bell T C. Managing Gigabytes: Com-

    pressing and Indexing Documents and Images, Second Edition.

    Morgan Kaufmann, 1999

  • Front. Comput. Sci.13

    Chuitian Rong is an associate

    professor at Tianjin Polytech-

    nic University. He received his

    Ph.D. degree from Renmin Uni-

    versity of China in 2013. His re-

    search interests are database sys-

    tem, information retrieval, big

    data analysis.

    Yasin N. Silva is an Assistant

    Professor of Applied Computing

    in the School of Mathematical

    & Natural Sciences at Arizona

    State University. He received his

    Ph.D. (2010) and M.S. (2006) in

    Computer Science from Purdue

    University and his B.S. in Com-

    puter Engineering from the Pontificia Universidad Catolica,

    Peru (2000). Yasins research areas deal with data man-

    agement systems and privacy preservation in general. More

    specifically, he has been working on the areas of query pro-

    cessing and optimization, privacy assurance in database sys-

    tems, Big Data management systems, scientific database sys-

    tems, and the integration of new data processing technologies

    into the computing curricula.

    Chunqing Li is a professor

    at Tianjin Polytechnic Univer-

    sity. His research interests are

    database system and applica-

    tions, big data analysis, network

    management and applications.