Interface for Finding Close Matches from Translation Memory

Interface for Finding Close Matches from Translation Memory

Group Members

NipunEdara 10010119

Priyatham Bollimpalli 10010148

Gunamgari Sharath Reddy 10010174

Pasumarthi Venkata Sai Dileep 10010180

1. IR Search Engine To retrieve the top similar sentences for translation, as the first step, an Information Retrieval Engine over all the sentences is absolutely necessary. Reasons for our own Search Engine The simplest approach is to follow an inbuilt python module like whoosh. But it is undesirable for the following reasons.

Difficulty in customizing the ranking function. Just simple ranking based on BM25 may not be give optimal results since phrasal searches and proximity measures are not considered in BM25 since it is essentially a TF-IDF based ranking system. But that functionality is required since for best sentences for translation (top results) have maximum matching words as a group together. Even though there is an option for proximity consideration in whoosh, it is strict. But the IR system needs to be flexible for good results.

Flexibility in index size. Whoosh has an index size which is larger than the conventional index since it assumes that it user needs all the features in it. Contrary to that building our own index reduced the size by 50%. Furthermore, whoosh tries to load entire index into main memory depending on the query as a result of which some spurious memory errors result in for large query. But our system is run separately on server which is easily convertible to a distributed system if required. Another approach we have done is splitting the index and loading those parts in distributed systems, which is extremely fast.

Flexibility in query model. Whoosh has a strict query model where there is only one factor AND/OR between the terms. But we used our own model which uses the fastest and best approach – first using only BM25 the top results are retrieved which are further filtered based on own algorithm which is based on POS tagging, proximity. Further it is easy to incorporate our own method for query expansion in own search engine than whoosh. Due to this our search engine is observed to give accurate and more number of results compared to whoosh.

Preprocessing Stage: Indexing Preprocessing is done for getting some overall parameters in the dataset such as average document length, term frequencies etc. and building index. For building index, Patricia inverted index tree is used. This data structure is found to be perfect considering the trade-off between size of the index and speed of retrieval. The below image gives an abstract view of the data structure built on few sample words.

There are two approaches to index the data. Conventional Indexing The conventional approach is that we store the document ids and term frequencies in each document at the leaf (end of each term) of Patricia tree. This is directly used for BM25 ranking algorithm which is given below.

For a document, it’s score with respect to given query is obtained as given above. f(qi, D) is term frequency in the document , D is the length of the document in words, and avgdl is the average document length in the text collection from which documents are drawn. K1 and b are free parameters. For IDF, N is the total number of documents in the collection, and n(qi) is the number of documents containing qi

As seen above, BM25 requires 3 values which are used for calculation and normalization of the rank. These are length of document, number of documents containing term, average document length. They are also calculated first during pre-processing and dumped into a file. Optimal value for constants are used for calculation of the rank i.e. k1 = 1.2, b = 0.75. To retrieve the sentences from the document id quickly, a dictionary is constructed during pre-processing stage. All the documents which are pre-processed are loaded only once in the server for any number of users and any number of queries.

Additional Indexing All the steps mentioned in the above conventional indexing are followed. If we consider only the term frequency in the document, the positional information related to the document will not be considered. For proximity search this is required. Hence in addition to the total frequency of the term in a document, its positions in that document are also stored at the leaf (end of each term) of Patricia tree. Using this procedure we calculate the proximity rank score which in combination with BM25 score gives the total ranking score. For this proximity score, the paper “An Exploration of Proximity Measures in Information Retrieval” by Tao Tao from Microsoft Corporation Redmond and ChengXiang Zhai from Department of Computer Science, University of Illinois is used. Out of the possible proximity scores mentioned in the paper, average pair distance in giving the best result since in general the sentence length is less. This is defined as follows.

For example for a document with terms d = t1, t2, t1, t3, t5, t4, t2, t3, t4 in that order, for the query Q = {t1, t4, t5} is (1 + 2 + 3)/3 = 2. This is combined with BM25 as given in the paper by first normalizing it and combining as shown below.

Here (Q, D) is the proximity score calculated and (Q, D) is its normalized score. R(Q, D) is the total ranking score which is used for retrieval. Query Expansion Every query as well as sentences in the documents during indexing is subjected to the following.

1. Converting to lower case 2. Tokenization and Normalization. For example it’s is converted to it is 3. Removing punctuations. 4. Stemming. Porter stemmer is used.

Query expansion is optional to the user as sometimes he may desire only translations for the words which he desired. If the option for query expansion is selected the following algorithm is implemented for query expansion. WordNet module in python is used.

First for each word maximum one extra word if needed is added. It is added at the end since the original query should not be disturbed because the order is important for phrasal and proximity searching. To consider the word which is closest to the word, the average of the path similarity between all the synsets of the words is considered.

For each word in the query, its synset is considered. For the best word which is synonymous to the word, its lemma words are considered.

Now, among these lemma words, the one closest to the original word is appended at the end. In case no other synonym words are available or no word is closer to the given word, nothing is appended. With the help of the word ‘automobile’, we describe the algorithm. The synsets of automobile are ('car.n.01'), ('automobile.v.01'). The lemma words of car.n.01 are ['car', 'auto', 'machine', 'motorcar']. Now the similarity of these words with automobile is 0.145, 0.5, 0.118855042017, and 0.5. So either auto or motorcar is appended at the end. The reason we are appending only one word for each given word is to prevent too long query as a result of which irrelevant results may creep up. It also results in too much latency which is undesirable.

Architectural Overview Web2py is used for serving the clients with queries. This is an open source full-stack python web framework for scalable, secure and portable web applications.

Initially, query server is loaded with the index, and is ready is to serve the clients.

Multiple users can connect to the web2py application through browser. For each query a user gives, web2py sends the query to the query server.

Any query given by the client is sent to the query sever. The query server creates a thread for the user and sends the query, pointer to index to the ranking server.

The ranking server computes the ranking score of the documents and retrieves top 50 of documents which are further processed and the final unique results are sent back to the ranking server. They are sent back to the query server which sends to web2py server which formats them to display the results. The entire flow chart is given below.

So this makes it scalable. In case of many users, the number of Ranking Servers can be increased. In case of too much load on the web2py server, multiple query servers can be deployed and requests can be scheduled accordingly such that the load is equally distributed.

Next sections discuss the various matching metrics used to further rank the sentences good for translation.

2. Normalized Edit Distance We have used the Levenshtein Edit Distance only initially for ordering the results. Later we have normalized it as suggested in the presentation as

NED(S1, S2) = ED(S1, S2)/max_len(S1, S2)

But there are some issues with this. One aspect to consider is that this doesn’t satisfy triangle inequality. In general proper normalization of edit distance is still an open research problem. There are quite a few papers on this topic which try to address this issue by using length of strings and edit distance paths. Few of them are given below.

Computation of Normalized Edit Distance and Applications by AndrCs Marzal and Enrique Vidal

An Efficient Uniform-Cost Normalized Edit Distance Algorithm by Abdullah N. Arslan and Omer Egecioglu

Some thoughts about normalizing an edit distance by Xavier Dupré A Normalized Levenshtein Distance Metric by Li Yujian and Liu Bo

Problem with Edit Distance

Example based Machine Translation has several problems such as the performance degrades when input sentences are long and when style of inputs and that of example in corpus are different. To address this we use "meaning equivalent sentences". A meaning equivalent sentence shares the main meaning with a despite lacking some unimportant information. The closest match is based on content words, modality, and tense.

Two main problems in applying EBMT are

1. As the length of a sentence becomes long, the number of retrieved similar sentences greatly decreases. This often results in no output when translating long sentences.

2. The other problem arises due to the differences in style between input sentences and the example corpus.

A meaning-equivalent sentence means a sentence having the main meaning of an input sentence despite lacking some unimportant information. Matching of meaning-equivalent sentences is based on content words. This provides robustness against long inputs and in the differences in style between the input and the example corpus.

3. Meaning Equivalent Sentence (own algorithm)

A sentence that shares the main meaning with the input sentence despite lacking some unimportant information. It does not contain information additional to that in the input sentence.

Input Sentence Unimportant? 1 Would you take a picture of me? Yes 2 Would you take a picture of this painting? No

Basic Idea of Matching The Matching of meaning-equivalent sentence depends on content words and basically does not depend on functional words. Independence from functional words brings robustness to the difference in styles.

Features:

Content Words: Words categorized as either noun, pronoun , adjective, adverb, or verb are recognized as content words. Interrogatives are also included.

Functional Words: Words such as particles, auxiliary verbs, conjunctions, and interjections are recognized as functional words. Matching and Ranking:

The ranking for matching is based on the following scores.

# of identical content words # of synonymous words # of common functional words # of different functional words # of different content words

Given a query and a sentence's matching score is calculated as follows

Get content words of the query (A) Get functional words of the query (B) Get synonyms of the content words of the query (C) Get content words of the sentence (D) Get functional words of the sentence (E)

E1) Identical content words = Number of matching words in A and D

E2) Identical synonymous words = Number of matching words between C and D

E3) Identical functional words = Number of matching words between B and E

D1) Different content words = #(A) + #(D) - 2*( E1 )

D2) Different functional words = #(B) + #(E) - 2*( E3 )

Weights are given for the above quantities and total score is calculated. E1 is given the highest weight then E2 and E3. Negative weights are to be given for D1 and D2 as they must be penalized. The highest scored sentences are returned by the system.

4. GET_CLOSE_MATCHES (python library)

Uses SequenceMatcher to return list of the best "good enough" matches.

SequenceMatcher :

The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

Time-complexity: cubic for worst case.

PATTERN MATCHING: THE GESTALT APPROACH

Gestalt is a word that describes how people can recognize a pattern as a functional unit that has properties not derivable by summation of its parts. For example, a person can recognize a picture in a connect-the-dots puzzle before finishing or even beginning it. This process of filling in the missing parts by comparing what is known to previous observations is called gestalt.

The Ratcliff/Obershelp pattern-matching algorithm uses this same process to decide how similar two one-dimensional patterns are. Since text strings are one dimensional, this algorithm returns a value that you can use as a confidence factor, or percentage, showing how alike any two strings are.

The algorithm works by examining two strings passed to it and locating the largest group of characters in common. The algorithm uses this group of characters as an anchor between the two strings. The algorithm then places any group of characters found to the left or the right of this anchor on a stack for further examination. This procedure is repeated for all substrings on the stack until there is nothing left to examine. The algorithm calculates the score returned as twice the number of characters found in common divided by the total number of characters in the two strings; the score is returned as an integer, reflecting a percentage match.

Example: suppose you want to compare the similarity between the word `Pennsylvania' and a mangled spelling as `Pencilvaneya.' The largest common group of characters that the algorithm would find is `lvan.' The two sub-groups remaining to the left are `Pennsy' and `Penci,' and to the right are ìa' andèya.' The algorithm places both of these string sections on the stack to be examined and advances the current score to eight, two times the number of characters found in common. The substrings ìa' and èya' are next to come off of the stack

and are then examined. The algorithm finds one character in common: a. The score is advanced to ten. The substrings to the left---'i' and èy'---are placed on the stack, but then are immediately removed and determined to contain no character in common. Next, the algorithm pulls `Pennsy' and `Penci' off of the stack. The largest common substring found is `Pen.' The algorithm advances the score by 6 so that it is now 16. There is nothing to the left of `Pen,' but to the right are the substrings `nsy' and `ci,' which are pushed onto the stack. When the algorithm pulls off `nsy' and `ci' next, it finds no characters in common. The stack is now empty and the algorith ready to return the similarity value found. There was a score of 16 out of a total of 24. This result means that the two strings were 67 percent alike.

5. Results The following are the results of some of the queries. For each query, there are 3 results depending on the matching scoring function used. That can be seen in the dropdown box in the screen shot. Edit Distance means normalized levenshtein edit distance. Matching Metric means our own implementation of the algorithm to get meaning equivalent sentences. Diff Matcher means the get_close_matches library in python which is based on Gestalt approach.

Query-1: ladies and gentlemen we are going to sleep

Query-2: it's been a great pleasure

Query-3: think about the public health problem

Query-4: democracy and the rule of law

Query-5: Approval of Parliament

Query-6: automobiles are important today.

We show 2 results. One uses synonym matching and other without it. Both are for Diff Matcher case. Note that in case of synonym matching the results having car are also retrieved.

6. Conclusion We can see that the results are as close as possible to the original query. Furthermore, using our own metric gives results that are as good as and also sometimes better (to look at) than the get closest matches and normalized edit distance since it considers the same factors of the other metrics and in addition considers the meaning of the sentences.

Quantitatively comparing our own metric with these metrics and improving it further is a promising future work which can be carried out.

Engineering

Interface for Finding Close Matches from Translation Memory