11
Optimizing Near-Synonym System Siyuan Zhou and Zichang Feng Carnegie Mellon University Abstract Phrasal near-synonym extraction is crucial to AI tasks such as natural language processing. Near-Synonym System(NeSS) is a corpus-based model for finding near- synonym phrases, but suffers from performance prob- lems. This report presents an optimized version of NeSS that builds an index on the suffix array to reduce the complex- ity dependency on corpus size and uses an efficient ap- proach for parallel execution to improve the scalability. We applied several other techniques along with the in- dexed suffix array to achieve an approximately 20x-40x speedup. We further did experiments to break down the speedup brought by different optimization approaches. 1 Introduction Synonymy has various degrees ranging from complete contextual substitutability to near-synonymy [4]. The word length of the synonymy can also range from single- word synonyms to multi-word synonyms or phrasal near- synonyms. The later one has to consider the semantics of the combination of multiple words, instead of solely the meaning of each words in the phrase. For example, it is fair to say is a phrasal near-synonym to the phrase we all understand. However, the individual components of the two phrases are not directly related to each other. Phrasal near-synonym extraction is very important in natural lan- guage processing, information retrieval, text summariza- tion and other AI tasks[6]. Near-Synonym System (NeSS)[6], the system we aim to optimize, is an unsupervised corpus-based model for finding phrasal synonyms and near synonyms based on a large corpus. It differs from other approaches since it doesnt require parallel resources or use pre-determined sets of patterns. Instead of storing the mapping of near- synonyms in databases, given a query phrase, NeSS gen- erates the near synonyms at runtime. NeSS selects near- synonymic candidates by identifying common surround- ing context based on an extension of Harris Distribu- tional Hypothesis[7], which states that words that occur in the same contexts tend to have similar meanings. To be more specific, NeSS tokenizes the corpus and constructs a suffix array at its initialization phase. Upon receiving a query phrase, it searches all occurrences of the query phrase in the corpus as contexts using the suffix array. It then finds all the candidates of near-synonyms by searching the contexts in the corpus by suffix array. The ranking of the candidates is based on the number of matching contexts between candidates and the query phrase and how they are matched. Since NeSS finds near-synonyms dynamically from the corpus, the performance (in terms of latency of a sin- gle query phrase) becomes a big challenge. NeSS needs to be a real-time on-line service since the huge amount of possible queries make it impossible to do off-line pre- process. We address three performance problems in the original NeSS: 1. Part of the code has low efficiency, and thus leads to long latency to process a user request. In the orig- inal system, it takes three to four minutes to pull the near synonyms of a single query phrase on a 16 core machine. Apparently, this latency isnt accept- able for the purpose of a real-time on-line service. 2. The system doesnt scale well with number of cores. To be more specific, the original system only has only 1.1 - 1.5 speedups with selected query phrases when running on 16 cores compared to one cores. To make the problem worse, original NeSS takes longer time with 36 cores compared to with s a sin- gle core on part of the phrases. 3. The complexity of the original system doesn’t scale when the corpus size increases. However, a larger corpus size leads to more accurate results for near- synonym searches. Thus the system will take much

Optimizing Near-Synonym System

Embed Size (px)

Citation preview

Page 1: Optimizing Near-Synonym System

Optimizing Near-Synonym System

Siyuan Zhou and Zichang FengCarnegie Mellon University

Abstract

Phrasal near-synonym extraction is crucial to AI taskssuch as natural language processing. Near-SynonymSystem(NeSS) is a corpus-based model for finding near-synonym phrases, but suffers from performance prob-lems.

This report presents an optimized version of NeSS thatbuilds an index on the suffix array to reduce the complex-ity dependency on corpus size and uses an efficient ap-proach for parallel execution to improve the scalability.We applied several other techniques along with the in-dexed suffix array to achieve an approximately 20x-40xspeedup. We further did experiments to break down thespeedup brought by different optimization approaches.

1 Introduction

Synonymy has various degrees ranging from completecontextual substitutability to near-synonymy [4]. Theword length of the synonymy can also range from single-word synonyms to multi-word synonyms or phrasal near-synonyms. The later one has to consider the semantics ofthe combination of multiple words, instead of solely themeaning of each words in the phrase. For example, it isfair to say is a phrasal near-synonym to the phrase we allunderstand. However, the individual components of thetwo phrases are not directly related to each other. Phrasalnear-synonym extraction is very important in natural lan-guage processing, information retrieval, text summariza-tion and other AI tasks[6].

Near-Synonym System (NeSS)[6], the system we aimto optimize, is an unsupervised corpus-based model forfinding phrasal synonyms and near synonyms based ona large corpus. It differs from other approaches since itdoesnt require parallel resources or use pre-determinedsets of patterns. Instead of storing the mapping of near-synonyms in databases, given a query phrase, NeSS gen-erates the near synonyms at runtime. NeSS selects near-

synonymic candidates by identifying common surround-ing context based on an extension of Harris Distribu-tional Hypothesis[7], which states that words that occurin the same contexts tend to have similar meanings.

To be more specific, NeSS tokenizes the corpus andconstructs a suffix array at its initialization phase. Uponreceiving a query phrase, it searches all occurrences ofthe query phrase in the corpus as contexts using the suffixarray. It then finds all the candidates of near-synonymsby searching the contexts in the corpus by suffix array.The ranking of the candidates is based on the numberof matching contexts between candidates and the queryphrase and how they are matched.

Since NeSS finds near-synonyms dynamically fromthe corpus, the performance (in terms of latency of a sin-gle query phrase) becomes a big challenge. NeSS needsto be a real-time on-line service since the huge amountof possible queries make it impossible to do off-line pre-process. We address three performance problems in theoriginal NeSS:

1. Part of the code has low efficiency, and thus leads tolong latency to process a user request. In the orig-inal system, it takes three to four minutes to pullthe near synonyms of a single query phrase on a 16core machine. Apparently, this latency isnt accept-able for the purpose of a real-time on-line service.

2. The system doesnt scale well with number of cores.To be more specific, the original system only hasonly 1.1 - 1.5 speedups with selected query phraseswhen running on 16 cores compared to one cores.To make the problem worse, original NeSS takeslonger time with 36 cores compared to with s a sin-gle core on part of the phrases.

3. The complexity of the original system doesn’t scalewhen the corpus size increases. However, a largercorpus size leads to more accurate results for near-synonym searches. Thus the system will take much

Page 2: Optimizing Near-Synonym System

longer to pull a result if a user needs better accuracy.

We present an optimized version of NeSS system thatallows real-time interactive query for near synonymsphrases. Our contributions can be concluded as the fol-lowing. Firstly, we carefully optimized some of the im-plementation details to allow faster computations with-out losing the accuracy of the system. We built an in-dex on top of the suffix array in the original system toallow O(L) time to search all the occurrences of a sub-string, where L is the length of the query string. Wemodified the algorithm in fetching candidates to improveefficiency. Besides major modifications to the originaldesign, we also made optimizations such as punctuationfiltering. Secondly, we changed the way that the systemis parallelized to improve the scalability on multi-coremachines. Our optimized system gets 6x speed-up whenrunning with 16 cores compared to that with one core.

Thirdly, we avoided splitting the suffix array into mul-tiple parts such that the system has an overall view ofthe corpus. The original system splits the suffix arrayinto multiple parts and let each thread hold one part ofthe suffix arrays. However, this leads to different scoringand ranking of the candidates. We believe that the resultsfrom one whole piece of suffix is the most accurate one,and thus only keep one copy of the suffix array globally.

Finally, we changed the complexity of the algorithmto be less dependent to the length of the corpus size. Weachieved this by building an index on the suffix arraysuch that searching a substring takes O(L) time insteadof O(L+ log(N)) where L is the length of the substringand N is the length of the corpus. Therefore, the systemcan achieve better accuracy by using a larger corpus asinput without sacrificing too much performance.

In this report, we introduce background in Section 2.We describe our four optimizations in Section 3. Theresults and evaluations will be presented in Section 4.We then discuss our results in 5 and summarize relatedworks in Section 6. Finally, we conclude our work in 7.

2 Background

Near-Synonym System (NeSS)[6], the system we aimto optimize, is an unsupervised corpus-based model forfinding phrasal synonyms and near synonyms based ona large corpus. NeSS selects near-synonymic candidatesby identifying common surrounding context based on anextension of Harris Distributional Hypothesis[7]. Theidea of this hypothesis is that words that occur in thesame contexts tend to have similar meanings. To iden-tify the contexts that the query phrase occurs, the NeSSsystem used a suffix array[9] for look-up. In this section,we briefly describe NeSS and the suffix array.

2.1 Near-Synonym System

At the initialization of NeSS, it accepts documents as theinput. The documents are preprocessed and concatenatedto form a large corpus. NeSS converts the words in thedocument into tokens and assigns each word a uniqueword id, and keeps a dictionary for the mapping betweenwords and ids. The ids identify the words and avoidsstring comparison when searching for a word. Viewingthe corpus as a long string, NeSS builds a suffix array ofthe corpus. The suffix array helps substring search whenfinding near synonyms of the query phrase.

Given the query phrase, NeSS searches the occurrencein the corpus by finding the query phrase as a substringin the corpus as a long string. Since the suffix array is asorted lexicographically, all occurrences are returned asa range of indices of the suffix array. The surroundingwords can be then fetched based on the positions of thequery phrase in the corpus. The surrounding words arereferred to as contexts, and can be classified into left con-text, right context and cradle, which is the left and rightcontexts.

After the contexts are filtered, they are used to searchcandidates. Candidates are the phrases that share thesame left context, right context or cradle with the queryphrase. Given a found candidate, each matching candi-date will contribute a score to the candidate. For left andright context, this is done by again searching the con-texts in the corpus and fetching the words next to thecontexts. Finding candidates between cradles takes moreeffort. First, the occurrences of the left context in thecradle is searched using the suffix array. Next, for eachoccurrence and for each valid candidate phrase length,the words behind the occurrence of the left context arefetched. The right context in the cradle is then comparedwith the words behind the supposed candidate. If theymatch, the supposed candidate is a real candidate. Sincethere are plenty of contexts, finding candidates is one ofthe most time consuming function in the system, espe-cially candidates between cradles.

The candidates will then be ranked based on the afore-mentioned score. Top N candidates will be returned tothe user as the near synonym phrases, where N is a pa-rameter defined by the user. In addition, part of the topcandidates will then be directed for the KL-divergencecomputation. The KL-divergence gives a more reliableranking of the candidates. One thing to note is that, sincethe computation of KL-divergence involves a lot of math-ematical operations, its also one of the hot spot in termof the runtime.

NeSS parallelizes on a multi-core machine by splittingthe suffix array into multiple parts, and let one threadbe responsible for searching substrings of the partial suf-fix array and subsequent operations associated with the

2

Page 3: Optimizing Near-Synonym System

Figure 1: Overview of the process of searching near synonyms[6]

matching substrings. For example, in the process of find-ing candidates from contexts, each thread finds contextsfrom its partial suffix array and then finds matching can-didates the contexts it just found. The reason of whyNeSS parallelizes this way is based on the fact that thesuffix array search is one of the most time consumingfunction in the system. However, this leads to only par-tial results when scoring and ranking the candidates. Oneproof of this is that NeSS generates different results onthe same corpus with different number of threads, andthus different number of suffix array splits.

2.2 Suffix Array

Suffix array[9] is a data structure that allows substringsearch in O(P + log(N)) time, where P is the length is thelength of the substring and N is the length of the wholestring. Compared to suffix tree, suffix array consumesmuch less memory in proactive. A suffix array is a sortedarray of all suffixes of the original array. Searching asubstring can be done by performing a binary search onthe suffix array.

In NeSS, the suffix array is built from the whole cor-pus. The process of finding the occurrences of queryphrases or contexts, can be converted to searching a sub-

string in a long string (corpus). Since the words in thecorpus are tokenized to ids, one word in the corpus isequivalent to a char in the suffix array. Since when NeSSsearches a phrase, it needs all occurrences of it, the suf-fix array search in NeSS will return a range of suffix ar-ray indices. The indices of occurrences in suffix arrayare contiguous because the suffix array is sorted lexico-graphically.

3 Optimizations

3.1 Index on Suffix ArraySince searching and counting are the core operations inNess, and both of them rely on the functionality pro-vided by the suffix array, the performance of suffix arraywould greatly affect the performance of the entire sys-tem. Therefore, we first focus on accelerating the searchon suffix array.

The method we use is to create a multi-level index onthe suffix array. Its feasibility is based on the followingobservations:

1. The only operation performed on the suffix arrayis to search a phrase. There’s no insertion or dele-tion in the runtime. Thus the suffix array’s structure

3

Page 4: Optimizing Near-Synonym System

Figure 2: The structure of an index

won’t change once it’s created in the initializationphase.

2. Although the number of words in the corpus is verylarge and will increase quickly as the corpus be-comes larger, the size of vocabulary is much smallerand normally stays in a fixed scale.

3. The number of words contained by most of thequeries, contexts and candidates are no more than3.

The first observation indicates we can build the index onsuffix array in advance and carefully organize its struc-ture to specifically optimize for search operation. Thesecond and third observations tell us a node in the indexwon’t contain too many keys and the index only needsfew levels, which means we can use a reasonable amountof memory to store the whole data structure.

As figure 2 shows, the index is organized as a multi-way tree. Each node in the tree contains an interval, andeach edge is associated with a word. The construction ofthe tree is described in algorithm 1.

Theorem 3.1.1. The time complexity of algorithm 1 isO(L) where L is the length of suffix array.

Proof. For each level of the tree, the algorithm needs toscan the whole suffix array and add at most O(L) edges,where each edge can be added in O(1) time by usinghashmap. Therefore it will take O(L) time to constructone level of the tree. Since the number of levels is fixedat 3 in our algorithm, the construction of the entire treewill take O(3∗L) = O(L) time.

The worst case space complexity of algorithm 1 wouldbe O(3 ∗ L). The space complexity in average cases ishard to estimate since it’s highly related to the contents

Algorithm 1 Construct Indexed Tree1: Input: A suffix array S2: Output: An indexed tree T3: Let L be the length of S4: T← CONSTRUCT(0,L,0)5:6: function CONSTRUCT(start,end,depth)7: Create a new node R8: R.le f t← start9: R.right← end

10: if start = end or depth = 3 then11: return R12: end if13: pw← NULL14: ps←−115: for i = start to end do16: p← S[i]17: w← p[depth]18: if w 6= pw then19: if pw 6= NULL then20: C← CONSTRUCT(ps, i−1,depth+1)21: R.addChild(pw,C)22: end if23: pw = w24: ps = i25: end if26: end for27: C← CONSTRUCT(ps,end,depth+1)28: R.addChild(pw,C)29: return R30: end function

of corpus. Based on our experience, the memory con-sumption is mostly smaller than the value in worst caseand is acceptable to a common commodity machine.

The tree produced by algorithm 1 has the followingproperty:

Property: For a node u in the tree, let p(u) =w1w2...wn be the path from root node to u, and (l,r) bethe interval contained in u, then all the suffixes whose po-sitions in the suffix array are in (l,r) have p(u) as theirprefixes.

Since a phrase p in the corpus must be the prefixesof some suffixes, we can get all its occurrences in thecorpus by finding all the suffixes starting with p. Giventhe above property, this task can be easily done by usingalgorithm 2.

Although for some long phrases, algorithm 2 stillneeds to perform a binary search on the suffix array tofurther narrow down the range, most of the phrases havea short length, and the search of them can be done byonly using the index. Therefore, we can get the follow-ing conclusion:

4

Page 5: Optimizing Near-Synonym System

Algorithm 2 Search with Index1: Input: An index I, a suffix array S and a phrase p2: Output: An interval (s,e)3: Let L be the number of words in p4: Let R be the root node of I5: u← R6: pos← 07: while pos < L and u is not a leaf node do8: v← u.getChild(p[pos])9: if v 6= NULL then

10: u← v11: pos← pos+112: else13: return (−1,−2)14: end if15: end while16: if pos = L then17: return (u.le f t,u.right)18: else19: return BINARYSEARCH(S, p, pos,u.le f t,u.right)20: end if

Theorem 3.1.2. The time complexity of algorithm 2 insearching a phrase p with L(L≤ 3) words is O(L).

Proof. The while loop from line 7 to line 15 will executeat most L times. Since the operation of finding a childnode by a given word in line 8 can be done with hashmap,each execution of the loop will take O(1) time. Afterthe loop, either pos is equal to L thus the algorithm candirectly return, or the last node is a leaf node thus thebinary search would run in O(L) time. Therefore, thealgorithm will take at most O(L+L) = O(L) time.

For a phrase that doesn’t appear in the corpus, algo-rithm 2 will return a pair(−1,−2) which indicates anempty interval. For a phrase that occurs at least one timein the corpus, the algorithm will return an interval in thesuffix array where all the suffixes starts with the givenphrase. The system can then iterate every position in theinterval to do further computations for a phrase such asfinding contexts or extracting candidates.

3.2 Multi-threadingSince the search of different phrases in the corpus is in-dependent, NeSS uses multiple threads to perform thesearch in parallel to achieve a better performance. Asfigure 3 demonstrates, the original system splits the suf-fix array into several disjointed parts and assigns eachpart to one of the threads. Each thread then iterates ev-ery context of the input query and uses the part of suffixarray assigned to it to calculate the frequency of contextand extract candidates. The results produced by a thread

will be added into a global hashmap which is protectedby a global lock from being accessed by multiple threadssimultaneously. Although this method can improve theperformance of the system in some degree, the resultingspeedup and scalability are not good enough due to thefollowing reasons:

1. The time complexity to search C contexts with av-erage length L in a suffix array with length N isO(C ∗ log(N) ∗ L) by using a single thread in theoriginal system. When increasing the number ofthreads to T , the time complexity is reduced toO(C ∗ log(N/T )∗L). However, the latter complex-ity is not much smaller than the former one since thevalue of log() function decreases very slowly withits parameter.

2. There’s additional overhead on the synchronizationof the global hashmap. When the system uses morethreads, the overhead will also increase and restrictsthe parallelism.

In addition, the multi-threaded design in the original sys-tem won’t bring any benefit if we build an index on thesuffix array since the time to search a phrase will nolonger be related to the size of suffix array. To address theproblems in the original design and take advantage of theindex, we proposed a new way to do the multi-threading.

Figure 3: The design of multi-threading in the originalsystem

As figure 4 shows, our method uses only one suffixarray which has index built on it and is shared by all thethreads. We split the contexts into several disjointed partsand assign each part to one of the thread. In this way,each thread is only responsible for its own contexts anduses the shared suffix array to do searching. In addition,a thread will first store the results in its own hashmap andput the data in local hashmap into a global hashmap afterfinishing all the computations.

5

Page 6: Optimizing Near-Synonym System

Figure 4: The design of multi-threading in the new sys-tem

When using T threads, our method can reduce the timecomplexity of searching from O(C ∗ L) to O(C ∗ L/T ),which is a much better speedup compared to the originalmethod. Moreover, the use of per-thread hashmap alsoavoids most of the synchronizations and is more cachefriendly since different cores can store this data struc-ture in its own cache without interfering each other. Fi-nally, the cost of synchronization on the global hashmapcan be reduced by using the ConcurrentHashmap in Java,which provides a fine-grained lock that enables the inser-tion of keys which are located in different buckets to beperformed in parallel.

3.3 Candidate SearchAs mentioned in Section 2.1, candidate searches fromcontext is one of the core and the most time consum-ing part in the system. Among three types of the con-texts: left contexts, right contexts and cradles, which arethe combining of left and right contexts, the candidatesearches from cradles takes most of the time. The reasonfor this is that suffix array doest provide such functional-ity to search two substrings apart from each other with afixed length in one search. Such cradle searches have tobe done by searching one side of the cradle and manuallycomparing the other side.

In the original system, the process of finding candidatefrom a cradle context can be described as the following.If we note a cradle context by L1L2L3QR1R2R3, whereL1 to L3 are the left context of the cradle, R1 to R3 arethe fight context of the cradle and Q is the query phrase.For each cradle context found in the previous step, NeSSfinds all the occurrences of L1 to L3. For each occurrenceof L1 to L3, and for each valid candidate length, NeSSwill fetch the words with the length of the right context,beginning at L3 plus the candidate length. The fetchedwords are then compared with R1 to R3 to check whether

this occurrence is a match of the whole cradle. In thisexample, if we represent one of the occurrences of L1 toL3 to be L1L2L3W1W2W3W4W5, NeSS will first fetch W2to W4 for a supposed candidate length of one. W2 to W4will then be compared with R1 to R3. If W2 to W4 matchR1 to R3, NeSS has found an match of the cradle contextin the corpus, since both left and right contexts match.W1 will then be compared with Q to see if W1 is the queryphrase. If not, W1 is regarded as a candidate and will beadded to the candidate table. In the next iteration, W3 toW5 will be fetched in order to check supposed candidateswith length of two. This process will keep iterating untilall valid candidate lengths are checked before it moveson to the next occurrence of the left context in the cradle.An important detail of this process is that when W2 to W4are fetched from the corpus for comparison with R1 toR3, a new Java array will be allocated, and the contentsare copied from the corpus array to the newly-allocatedarray.

We address three problems in the process of findingcandidates from cradle contexts in the original imple-mentation.

1. The loop on all valid candidate lengths leads tofetches same chars and comparison of the samechars with the right context several times. In theexample, W4 will be fetch three times across findingcandidates with lengths of one to three.

2. The allocation of a new array when the words be-hind the left context are to be returned is unneces-sary. It involves an unnecessary system call to al-location memory in heap and sequential checks onthe results of the system. Also, the copying of thecontent from the corpus array to the new array in-troduces additional overhead.

3. Since the heap memory are managed by the JVMinstead of the programmer, the allocation of new ar-rays will lead to massive garbage. If the heap sizeisnt large enough, this will cause frequent garbagecollects, degrading the performance.

To solve these problems, we revised the algorithmused to find candidates from cradle contexts. We de-scribe the revised algorithm as below. For a cradleL1L2L3QR1R2R3, we find all occurrences of L1 to L3 us-ing suffix array. For each occurrence, we directly fetchall words after L1 to L3 in the corpus, which are W1 toW5. Then we perform a substring search of R1R2R3 inW1W2W3W4W5. For each substring match of R1R2R3, thewords before the match are compared with the queryphrase and regarded as a candidate, since both left andright contexts match. When fetching W1W2W3W4W5 fromthe corpus, instead of allocating and copying the content,

6

Page 7: Optimizing Near-Synonym System

we directly pass the beginning and ending indices of thewords.

Our algorithm differs from the original one in that weavoid fetching and comparing same chars from the cor-pus multiple times, thus reducing the operations needed.Also, we changed the way of fetching the supposed rightcontext such that no unnecessary memory allocation andmemory copying are needed. This reduces the overheadof allocating and copying the context, as well as the timespent for garbage collection.

3.4 Punctuation Filter

After collecting contexts and candidates, NeSS filters outpunctuations. The punctuation filter in the original im-plementation uses a regular expression match to deter-mine whether a token is a punctuation. However, theoriginal code didnt take advantage of the fact that theregular expression for punctuations stays the same fordifferent contexts and candidates to filter. The originalimplementation compiles the regular expression for eachfiltering operation. Thus the overhead of compiling theregular expression is incurred for each context and eachcandidate.

Our first optimization on the filter takes advantage ofthe unchanged regular expression and compile a staticpattern according to the regular expression to filter outthe punctuations.

Next, we further exploit the fact that the punctuationcontexts and candidates are mostly one character long.We thus changed the code to eliminate the need to use aregular expression match. We do this by first constructan array of 255 boolean elements. Each of the booleanelements stands for a character in ASCII code and repre-sents whether the character is a punctuation. At runtime,to check whether a character is a punctuation, NeSS willaccess the corresponding element in the boolean array.In this way, a regular expression match is replaced withan array lookup, which is much cheaper than the originalimplementation.

4 Results

4.1 Environment

We tested our optimized NeSS on Elastic Cloud Comput-ing in Amazon Web Service. We launched a c4.8xlargeinstance which has 36 virtual CPUs and 60GB memory.We used a 2.2GB document from the very large EnglishGigaword Fifth Edition[11], an archive of newswire textdata, as our corpus input.

4.2 Overall Performance

Figure 5: The performance comparison between the orig-inal system and our system with all the optimizations

We randomly selected six phrases from the test phrasesin Guptas paper[6]. We ran the optimized and the orig-inal systems with the query phrases several times to getan average runtime of pulling near synonyms. Figure 5shows the runtime comparison between the original sys-tem and our optimized one. The blue bars are the averageruntimes in seconds of the original NeSS with differentquery phrases, while the green bars represents the aver-age runtimes of the optimized system. We marked thespeedups, which is the runtimes of the original systemdivided by the ones of our optimized system. We alsoincluded error bars to represent max and min values inour test.

From Figure 5, we can observe a speedup rangingfrom 17x to 41x, depending on the query phrase. Theaverage speedup across different query phrases is 30x.The runtimes across multiple runs against the same queryphrase are quite stable, as shown by the error bars.

The speedup achieved by our optimization changedthe latency of searching a single query phrase from sev-eral minutes to seconds, which allows a user to interac-tively search near synonyms on-line.

4.3 Performance Impact of each Optimiza-tion

In this section we analyze the performance impact ofeach single optimization we applied. We evaluate thechange of the performance by adding optimizations oneat a time on top of the previously added optimizations,in the same order as in Section 3. For example, whenevaluating the impact of optimizing the punctuation fil-ter, we compare the final version with the version havingthe first three optimizations. The reason we evaluate in acumulative way is that the impact of a later optimizationis noticeable only if the previously dominant hotspot is

7

Page 8: Optimizing Near-Synonym System

removed by earlier optimizations such that the currentlyapplied one is the hotspot. This process of evaluation fol-lows the similar way as our optimization process: oncea hotspot is resolved by applying an improvement tech-nique, we find the next hotspot to raise another optimiza-tion.

In the following figures, we refer to the index on thesuffix array as Optimization 1, the improvement on themulti-threading as Optimization 2, the modifications incandidate search as Optimization 3 and the changes inpunctuation filter as Optimization 4.

Figure 6: The performance comparison between the orig-inal system and our system with optimization 1

The impact of building the index on the suffix ar-ray can be shown in Figure 7. The notions of the fig-ure is the same as the previous section. The speedupsvary from 3.5x to 6.5x with an average of 4.65x. Thespeedup due to the index is highly dependent on the cor-pus size, since the index reduces the complexity of sub-string search from O(L+ log(N)) to O(L). We expect tosee a further speedup with a larger corpus size.

Figure 7: The performance comparison between the sys-tem with optimization 1 and the one with optimization1,2

Figure 8 shows the performance impact of improv-ing multi-threading. As mentioned earlier, the compar-

ison is between the version with the first two optimiza-tions and the version with only the index optimization.The speedups vary from 2.8x to 5x, with an average of3.3x. We ran the system with the same number of cores,36, against the same set of query phrases. The speedupshows that the improved system scales better than theoriginal implementation.

Figure 8: The performance comparison between the sys-tem with optimization 1,2 and the one with optimization1,2,3

As shown by Figure 8, the speedup due to the mod-ification on candidate search is small. This is becausethe candidate search only dominates the runtime in theshared context ranking phrase. However, the time spentin the KL-Divergence computation takes most of thetime. Users can choose to disable the KL-Divergence,and in such case, the speedup of this optimization is ap-proximately 1.2x.

Figure 9: The performance comparison between the sys-tem with optimization 1,2,3 and the one with optimiza-tion 1,2,3,4

The impact of changing the punctuation filter is shownin Figure 9. This optimization leads to approximately 2xspeedup on average.

8

Page 9: Optimizing Near-Synonym System

4.4 Scalability

Figure 10: The scalability of our system under 36 cores

We evaluate the scalability of the optimized system withdifferent core counts by comparing the speedups with theone with one core, as shown in Figure 10. Our improvedsystem scales near-linearly with core count equal or lessthan 4. However, we observe little improvement whenwe keep increasing number of cores above 16. We willdiscuss the reasons in Section 5.2

5 Discussions

5.1 Optimizations with Little EffectThe original system uses Java HashMap to keep all thecontexts and candidates, as well as their scores. Inour optimized version, we also used HashMap to storeeach level of the suffix array index. We tried to re-place the Java HashMap with Trove hash map[1]. Theidea of Trove hash map is that it avoids allocation of ob-jects of primitive types (Java Integer, for example), andthus saves memory. It also claims to have better hashfunctions and thus better performance. However, afterswitching to Trove hash map, we didnt observe a signifi-cant performance gain. The reason for this might be thatthe hash maps we are building have keys of integers, andthus hash functions dont play a big role. Also, Java 7improve the performance in HashMap.

We also tried to think as a compiler and optimize someunnecessary code. For example, in original NeSS, therewas a while loop that only calls a function inside. Thefunction contains an if statement that is never true. Weeliminated the while loop, but didnt see any performancegain. We suspect the reason is that the Java compilerhad already optimized out the code that would never beexecuted.

The third optimization that didnt work parallelizingfunctions that takes a very short time to execute. For

such cases, the overhead of creating threads and contextswitching defeats the benefits of parallel execution.

Fine-grained locks in compute intensive parts has littleeffect in our optimization. For example, in the processof ranking candidates, when building candidate map, thecomputation time is much longer than the time to putthe result in a shared HashMap. In this case, changingthe HashMap to ConcurrentHashMap doesnt work well,since the situation of blocking to acquire the lock rarelyhappens.

5.2 ScalabilityOur improved NeSS scales well under 4 cores, butdoesnt have much improvement when core count in-creases above 16. We suspect there are three reasons:

1. Sequential part of the system limits the speedup bymulti-thread. There are multiple parts in the algo-rithm that must be run in sequential, such as findingcontexts of query phrase and ranking candidates.According to Amdahl’s law[2], the speedup by par-allel execution is limited by the sequential functionsof the algorithm.

2. The contention of locks of shared data structurelimits the benefits of multi-threading. The datastructure that stores all candidates and their associ-ated matching contexts has two levels of HashMaps.Since the logic of the execution and the need ofthe organization of the two-level HashMap con-flicts with each other, preventing from fined-grainedlocking. Thus we see a bottleneck when rankingcandidates as we increase the core count.

3. The memory bandwidth limits the maximum exe-cution of the program. Since the system containsplenty of memory accesses and a little mathemati-cal computations, we believe the system is memory-bounded when the number of cores is high enoughto make the computation fast.

6 Related Work

The NeSS finds near-synonym phases based on an unsu-pervised corpus-based model. There are several papersaddressing related work to finding near synonyms andour optimization techniques.

In Mitchell and Lapatas paper [10], a set of compo-sition functions were proposed to combine vectors ofwords in a phrase into a single one. Reddy [13] arguedthat not all the features are relevant to the phrase, andthus further presented ways to select relevant senses ofwords in a phrase. However, these papers decomposephrases into words and analyze the semantics at the word

9

Page 10: Optimizing Near-Synonym System

level. They ignored the case that the phrases as a wholemay have completely different meanings from the mean-ings of individual component words.

Some approaches were presented to find synonymsby parallel resources. Methods based on monolin-gual text corpus like Discovery of Inference Rules fromText(DIRT) [8] spot paraphrases that share the same in-terpretation in a foreign language. However, this mightalso find phrases that are not related to the originalphrase. Ganitkevitch [5] used monolingual distributionalsimilarity to rerank the extracted paraphrases. Theybuilt a webpage that responds to paraphrase queries bylooking up interpretations in foreign languages in thedatabase. However, this approach needs parallel re-sources and can only search for phrases that are presentin the database.

In addition, Pasca[12] introduced an unsupervisedmethod to retrieve near-synonym in arbitrary web textusing linguistically-motivated text anchors identified inthe context of documents. The quality of the paraphrasescan be further improved by a filtering mechanism using aset of categorized names from online documents. How-ever, this method requires document-dependent linguis-tic patterns defined. The documents also need languagespecific resources such as part-of-speech taggers.

The Near-Synonym System (NeSS)[6], which is ourgoal to optimize, differs from previous paraphrasing sys-tems in terms that it doesnt need parallel resources likePPDB[5] or predefined patterns like the method intro-duced by Pasca[12]. The algorithm in the NeSS systemselect near-synonymic candidates by identifying com-mon surrounding context based on an extension of HarrisDistributional Hypothesis[7]. The idea of this hypothesisis that words that occur in the same contexts tend to havesimilar meanings. To identify the contexts that the queryphrase occurs, the NeSS system used a suffix array forlookup.

A suffix array[9] constructs an array of suffixes of astring. The array is sorted so searching for a substringin the original string takes O(P + log(N)) time, where Pis the length of the query string and N is the size of thesuffix array. In the context of the NeSS system, the min-imum token is a word and the string is the whole corpus.As the size of corpus grows, the runtime of context andcandidate searches also increases.

Another approach for substring searching other thansuffix array is suffix tree[14]. Intuitively, the suffix treeis a Trie tree of the suffixes. A path to each node of thetree is a prefix of a suffix. If the node is a leaf, the pathis a suffix. If the lookup table in a node is maintainedusing hash maps, search a phrase takes O(P) time whereP is the number of words in the query. Although suf-fix tree provides faster lookup, it consumes much morememory than suffix array. Also, depending on the imple-

mentation, the data structure may not take advantage ofcaching like suffix array.

Hashing is another way to do quick searches. Locality-sensitive hashing [3] provides that the probability twostrings are hashed to the same bucket is proportional tothe similarity of them. If applied to the NeSS system,the problem of finding a substring can be converted tofinding a similar string. Although the algorithm providesconstant-time string look-up in average, defining similar-ity is not trivial and using locality-sensitive hashing losesaccuracy in some level.

Our indexed suffix array differs from normal suffix ar-ray in the way that the index can help lookup the firstthree words in the query phrase in O(3) time (constanttime). Since nearly all suffix array lookups have no morethan three words, the indexed suffix array can achieve aconstant-time substring lookup. It also differs from thesuffix tree because it doesnt consume as much memory.

7 Conclusions

In this report, we introduce Near Synonym Sys-tem(NeSS) and addressed some performance problems:long latency upon users requests, not satisfying scalabil-ity and a complexity highly dependent on the size of cor-pus. We presented an optimized version of NeSS thatsolves these problems by building an index on the suf-fix array, changing the approach for parallelism, improv-ing efficiency of candidate search and optimizing thepunctuation filter. The experiments showed a speedupof approximately 20x-40x compared to the original im-plementation. Our optimized NeSS demonstrated near-linear scalability with 8 cores or less.

References[1] Trove high performance collections of java. http://trove.

starlight-systems.com/overview.

[2] AMDAHL, G. M. Validity of the single processor approach toachieving large scale computing capabilities. In Proceedings ofthe April 18-20, 1967, spring joint computer conference (1967),ACM, pp. 483–485.

[3] CHARIKAR, M. S. Similarity estimation techniques from round-ing algorithms. In Proceedings of the thiry-fourth annual ACMsymposium on Theory of computing (2002), ACM, pp. 380–388.

[4] CURRAN, J. R. From distributional to semantic similarity.

[5] GANITKEVITCH, J., VAN DURME, B., AND CALLISON-BURCH, C. Ppdb: The paraphrase database. In HLT-NAACL(2013), pp. 758–764.

[6] GUPTA, D., CARBONELL, J., GERSHMAN, A., KLEIN, S., ANDMILLER, D. Unsupervised phrasal near-synonym generationfrom text corpora.

[7] HARRIS, Z. S. Distributional structure. Springer, 1970.

[8] LIN, D., AND PANTEL, P. Discovery of inference rules fromtext, Dec. 5 2006. US Patent 7,146,308.

[9] MANBER, U., AND MYERS, G. Suffix arrays: a new method foron-line string searches. siam Journal on Computing 22, 5 (1993),935–948.

10

Page 11: Optimizing Near-Synonym System

[10] MITCHELL, J., AND LAPATA, M. Vector-based models of se-mantic composition. In ACL (2008), pp. 236–244.

[11] PARKER, R., GRAFF, D., KONG, J., CHEN, K., AND MAEDA,K. English gigaword fifth edition, june. Linguistic Data Consor-tium, LDC2011T07 (2011).

[12] PASCA, M. Mining paraphrases from self-anchored web sentencefragments. In Knowledge Discovery in Databases: PKDD 2005.Springer, 2005, pp. 193–204.

[13] REDDY, S., KLAPAFTIS, I. P., MCCARTHY, D., AND MAN-ANDHAR, S. Dynamic and static prototype vectors for semanticcomposition. In IJCNLP (2011), pp. 705–713.

[14] WEINER, P. Linear pattern matching algorithms. In Switchingand Automata Theory, 1973. SWAT’08. IEEE Conference Recordof 14th Annual Symposium on (1973), IEEE, pp. 1–11.

11