Normalization of Vietnamese Tweets on Twitter

Normalization of Vietnamese tweets on Twitter

Vu H. Nguyen1, Hien T. Nguyen1, and Vaclav Snasel2

1 Faculty of Information Technology, Ton Duc Thang University, Ho Chi MinhCity,Vietnam

{nguyenhongvu, hien}@tdt.edu.vn2 Faculty of Electrical Engineering and Computer Science, VSB-Technical University

of Ostrava, Czech [email protected]

Abstract. We study a task of noisy text normalization focusing on Viet-namese tweets. This task aims to improve the performance of applicationsmining or analyzing semantics of social media contents as well as othersocial network analysis applications. Since tweets on Twitter are noisy,irregular, short and consist of acronym, spelling errors, processing thosetweets is more challenging than that of news or formal texts. In this pa-per, we proposed a method that aims to normalize Vietnamese tweetsby detecting non-standard words as well as spelling errors and correct-ing them. The method combines a language model with dictionaries andVietnamese vocabulary structures. We build a dataset including 1,360Vietnamese tweets to evaluate the proposed method. Experiment resultsshow that our method achieved encouraging performance with 89% F1-Score.

Keywords: Normalization of noisy texts, spelling error detection andcorrection, Twitter.

1 Introduction

Nowadays, many popular online social networks (OSNs) such as Twitter or Face-book have become some of the most effective channels for users to communicateand share information. Due to the huge magnitude of online social network users,an enormous amount of content has been continuously created every day. Ac-cording to the Statistics in 2011, the number of tweets people sent on Twitterper day has been up to 140 million tweets1. Unlike news or authored textual webcontent, the contents on OSNs are short, noisy, irregular, temporal dynamics,and they are usually short. In the case of Twitter, a posting on Twitter is lim-ited in 140 characters. Therefore, a user tends to use acronyms, non-standardwords, or social tokens. Moreover, he/she tends to compose tweets and commentsquickly, which may cause spelling mistakes or typo.

In this paper, we propose a method to normalize Vietnamese tweets by de-tecting non-standard words as well as spelling errors and correcting them. The1 https://blog.twitter.com/2011/numbers

method helps improving the performance of applications mining or analyzingsemantics of social media contents as well as other social network analysis appli-cations. There have been many methods proposed for text normalization. Mostof them are to normalize texts written in English [1, 2, 7, 9, 10, 19, 24, 25, 27] andsome other languages such as Chinese [16, 26, 28] or Arabic [15, 23]. We foundseveral methods proposed to Vietnamese spell checking in literature. However,those methods did not take postings on online social networks into account. Ourwork presented in this paper aims to build a bridge for filling the gap.

This paper presents the first attempt to build a system for normalizing Viet-namese tweets. We propose a method consists of three steps: (i) the first step isto preprocess tweets, (ii) the second is to detect non-standard words, misspelledwords by typing, and (iii) the last step is to normalize and correct those errors.For example, spelling errors in tweets: "Tooi đi hocj" can be normalized as "Tôiđi học" (I go to school) or "Banj teen gif?" can be normalized as "Bạn têngì?" (What is your name?). Other examples of text messages from Twitter andtheir corresponding normalized forms are shown in Table 1. In this paper, wealso propose a method to improve the similarity coefficient of two words accord-ing to Dice coefficient [5]. When applying our improving method, the similaritycoefficient has increased significantly.

Our contributions in this paper is three-fold: (1) propose a method to detectand normalize Vietnamese tweets based on dictionaries and Vietnamese vocab-ulary structures combining with a language model, (2) improve Dice coefficientto measure similarity of two words, and (3) build a dataset including 1,360Vietnamese tweets to evaluate methods proposed for normalizing short texts onOSNs. The rest of this paper is organized as follows: Section 2 presents relatedwork, Section 3 presents our proposed method, Section 4 presents experimentsand results, and finally, we draw conclusion in Section 5.

Table 1. spelling error tweets and their normalized.

Spelling error tweets Normalized tweets

Trời ddang mưa Trời đang mưa (It is raining)

Hôm nay, siinh viên DDaijjhọc Tôn DDuwcss Thắng đượcnghỉ học

Hôm nay, sinh viên Đại học Tôn Đức Thắng đượcnghỉ học (Today, student of Ton Duc Thang univer-sity was allowed to absent.)

ngày maii là Têt rồi ngày mai là Tết rồi (tomorrow is our traditionalTet’s holiday)

2 Related work

Nowaday, there are many researches for the spelling error detection and nor-malization. Normally, every research focuses handle for a specific language. For

examples, in the domain of English, most earlier work on automatic error cor-rection addressed spelling errors and built models of correct using on nativeEnglish data ( [9] [2] [1]). In [25], to normalize non-standard words they devel-oped a taxonomy of non-standard words then they investigated the applicationof several general techniques including n-gram language models, decision treesand weighted finite-state transducers to the entire range of non-standard wordstypes. In ( [10] [7]), they used discriminative model to propose a mechanismfor error detection and error correction at word-level respectively. In [24], theyproposed a graph-based text normalization method that utilizes both contextualand grammatical features of text. A log-linear model was used in [27] to charac-terized the relationship between standard and non-standard tokens. In [19], theyproposed a two character-level methods for the abbreviation modeling aspect ofthe noisy channel model including a statistical classifier using language-basedfeatures to decide whether a character is likely to be removed from a word, anda character-level machine translation model. With Chinese language, the ma-jority of studies used model language processing ( [26] [28] [16]) and [18] usedunsupervised model and discriminative reranking. With Arabic language, recentresearches used supervised learning [15], character based language model [23].Considering for Vietnamese language, we had several researches involved an-alyzing the word, phrase, sentence analysis, handling ambiguity, constructiondictionary (VLSP2 [11] [8]) and most recently, studies using ngram languagemodel ( [21] [17]).

In the field of social network, we just have several researches to handle spellingerrors. For example, in ( [12] [13]), they have been detected and handled errorbased on the morphophonemic similarity. In [3], they detected and handled ofnon-standard words in online social network by using diverse coefficient methodsuch as Dice, Jaccard, Ochiai. [14] used random walks on a contextual similar-ity bipartite graph constructed from n-gram sequences on large unlabeled textcorpus to normalize social text. In [25], they proposed a novel method for nor-malizing and morphologically analyzing Japanese noisy text by generating bothcharacter-level and word-level normalization candidates and using discrimina-tive methods to formulate a cost function. An approach to normalize the MalayTwitter messages based on corpus-driven analysis was proposed in [22]. And thelatest is the research of [4], this research proposed a modular approach for lexicalnormalization applied to Spanish tweets, the system proposed including modulesdetection and correction candidates for each out of vocabulary word, rank thecandidates to select the best one.

In this paper, we proposed a mechanism to detect and normalize spelling errorfor Vietnamese tweet based on dictionary and Vietnamese vocabulary structurecombining with a language model. The tweet with spelling errors will be detectedbased on the vocabulary structures, after error detection phase, the system willfirst normalize the error tweets based on the structure of vowels, consonants andthen use the language model to calculate the degree of similarity. In the calcula-tion of the similarity degree task, we have studied and proposed an improvement

2 http://vlsp.vietlp.org:8080/demo/

model based on the model of Dice [5]. The language model using here is SRILM3

to generate the 3-gram model. The sentence with the highest degree of similaritywas selected as the final data.

3 Proposed method

3.1 The theoretical background

Currently, there are several point of views on what is a Vietnamese word. How-ever, to meet the goals automatic error detection, the authors use the views inthesis of Dinh Dien [6], A Vietnamese word is composed of Vietnamese mor-phemes. And according to the syllable dictionary of Hoang Phe [20], our teamsplit a word into two basic parts: the consonant and syllable, there may existwords without the consonant.

– Consonant: Vietnamese language has 26 consonant: "b", "ch", "c", "d","đ", "gi", "gh", "g", "h", "kh", "k", "l", "m", "ngh", "ng", "nh", "n", "ph","q", "r", "s", "th", "tr", "t", "v", "x" and 8 tail consonants: "c", "ch", "n","nh", "ng", "m", "p", "t", Single vowel: Vietnamese language has 12 singlevowels including: "a", "ă","â", "e","ê", "i","o", "ô","ơ", "u","ư", "y".

– Syllable: the combining of vowels and final consonant. According to the syl-lable dictionary of Hoang Phe, Vietnamese language has total 158 syllables.And also according to this dictionary, the vowes do not occur consecutivelymore than once except “ooc” and “oong” syllable.

3.2 Preprocessing

The original text of a tweet can contain various noisy contents such as emotionsymbols (e.g: ¤¤,..), hashtag symbol, link url @username, etc. Those noisysymbols can affect the performance of the system. Therefore, we try to clean upthose noisy symbols.

Clean up the repeated character: many tweets have repeated character (e.g:Anh yêuuuuuuuu emmmm nhiềuuuuuuuuuu lắmmmmmmmmmm), after cleanup it must become: Anh yêu em nhiều lắm (I love you so much). With Vietnameselanguage, we can clean those tweets based on Vietnamese vowels and consonants.Normally, Vietnamese vowels do not appear more than twice and Vietnameseconsonants just appear once.

3.3 Spelling error detection

To perform spelling error detection, we have synthesized and built a dictionaryfor all Vietnamese words. This dictionary includes more than 7,300 words. Aword will be identified error if it does not appear in the dictionary. After a wordwas identified error, it will be analyzed to classify error and process.

Normally, Vietnamese includes two kind of errors. The first is the error causedby the typing error and the second is misspelled.3 http://www.speech.sri.com/projects/srilm/

3.3.1 Typing error

To compose Vietnamese text, there are two popular typing: Telex typing andVNI typing. Each input method will be have the combination of the lettersto form the Vietnamese vowel marks. Comparison with Latin characters, Viet-namese characters have some extra vowels: â, ă, ê, ô, ơ; one more consonant: đ;Vietnamese has 5 types of mark: acute accent – “á”, grave accent – “à”, “hookaccent” – “ả”, tilde – “ã”, heavy accents – “ạ”. The combination of vowels, marksforming its own identity for Vietnamese language.

Example:

– With Telex typing, we have the combination of character to form Vietnamesevowels: aa: â, aw: ă, ee: ê; oo: ô; ow: ơ; uw: ư and we have one consonant dd:đ. For marks, we have: s: acute; f: grave accent, r: hook accent, x: tilde, j:heavy accents.

– Similarly, we have VNI typing: a6: â, a8: ă, e6: ê, o6: ô, o7: ơ, u7: ư, d9: đ.For marks we have: 1: accent, 2: grave accent, 3: hook accent, 4: tilde, 5:heavy accents.

Because the tweet is very short and the speed of typing, the typings will causethe error. Example:– With the word “Nguyễn” can type with error “nguyeenx”, “nguyênx” or “nguy-

eenxx” with Telex typing, and “nguye6n4”, “nguyên4” or “nguye6n44” withVNI typing.

– With the word “người” can type with error “ngươif”, “ngươfi”, “nguowfi”,“nguowif”, “nguofwi”, “nguofiw”, “nguoifw”, “nguoiwf” ,“nguowff” with Telextyping, and “nguwowi2”, “ngươ2i”, “nguo72i”, “nguo7i2”, “nguo27i”, “nguo2i7”,“nguoi27”, “nguoi72” with VNI typing.

To handle this issue, we proposed a set of syllable rules to map each syllablecombination with mark. For example, with syllable "an", when combined withmark we have five Vietnamese syllables: “àn”, “án”, “ản”, ”ãn”, “ạn”. Then thesystem will build the set of rules to map the correct syllable to errors, respectivelyas follows:– "án": "asn", "ans", "a1n", "an1"– "àn": "afn", "anf", "a2n", "an2"– "ản": "arn", "anr", "a3n", "an3"– "ãn": "axn", "anx", "a4n", "an4"– "ạn": "ajn", "anj", "a5n", "an5"

3.3.2 Misspelled

This kind of error is popular in Vietnamese. This kind of error usually occursdue to mistakes in pronunciation. Examples of some misspellings:– Error due to using the wrong mark: “quyển sách” (book) to “quyễn sách”– Initial consonant error: “bóng chuyền” (volleyball) to “bóng truyền”– End consonant errror: “bài hát” (song) to “bài hác”– Region error: “tìm kiếm” (find) to ”tìm kím”

3.4 Normalization

For the detected spelling error, the system first uses vocabulary structure, theset of syllable rules to normalize, then the result will be input to the next phaseto measure the similarity with the word in the dictionary to find words with thehighest similarity degree. In the case the result word still exists in the dictionary,the system will use n-gram to normalize the error word.

3.4.1 Two words similarity

In this paper, to measure the similarity of two words, we use the results inthe research of Dice [5] with our improvement. To use research of Dice, wemust split all characters of word to bigrams. Assume that we have two words“nguyen” và “nguye”, bigram of these two word can be represented by as follows:bigramnguyn={ng, gu, uy, yn} và bigramnguyen={ng, gu, uy,ye,en}.

Dice Coefficient:

Dice coefficient is a statistic approach for comparing the similarity of two samplesdeveloped by Lee Raymond Dice [5]. Dice coefficient of two words wi and wj

according to bigram can calculate by equation 1:

Dice(wi, wj) =2× | bigramwi

⋂bigramwj |

| bigramwi | + | bigramwj |(1)

Where:

– | bigramwi | and | bigramwj |: total bigram of wi and wj

– | bigramwi |⋂| bigramwj |: number of bigrams appears in wi and wj at the

same time.

If two words are the same, Dice coefficient is 1. The higher of Dice coefficient,the higher degree of similarity and vice versa.

Proposed method to improve Dice Coefficient:

As observed from the experimental data using the Dice coefficient. We foundthat, the above methods will accurately with misspelled words at the end. Withthose misspelled words in the characters close to the last character at least wewill lose the similarity of the two last gram. Especially, for word that has 3characters, the degree of similarity is 0. For example: Dice(“rất”, “rát”) = 0;Dice(“gân”, “gần”) = 0;

From the above problem, we proposed a method to improve the coefficientof Dice. The improvement of coefficient was performed by combination the firstcharacter with the last character of two word to form a new pair of bigram. If

this pair is different, the system will use the coefficients as shown in equation(1). In contrast, we use equation (2) as below:

iDice(wi, wj) =2× (| bigramwi

⋂bigramwj | + 1)

| bigramwi | + | bigramwj | + 2(2)

Let fbigramw be an additional bigram of w. Each fbigram is the pair of the firstand the last character of w. We can express the formula to improving the Dicecoefficient in equation (3) as below:

fDice(wi, wj) =

{Dice(wi, wj) : if fbigramwi is different from fbigramwj

iDice(wi, wj) : Otherwise (3)

To illustrate for the improvement of Dice coefficient, suppose we have twowords to measure the degree of similarity is "nguyen" and "nguyn" as presentedin the previous section, we have | bigramwi

⋂bigramwj |= 3. Combining the

first and the last character of two word we have the new pair of diagram whichhas the same result “nn”. So, using the improvement of Dice coefficient we havefDice(“nguyen”,”nguyn”)=0.727. If we use the normal coefficient of Dice wehave Dice(“nguyen”,”nguyn”)=0.667.

Table 2 shows the results of measuring the similarity of two words withthe Dice Coefficient and the improvement Dice coefficient methods. With theimprovement methods, the similarities are obvious improvement.

Table 2. The result of measuring the similarity of two words comparing between thenormal Dice Coefficient and the improvement of Dice coefficient.

Error Word Correct word Dice fDice

rat rất 0 0.333

rat rác 0 0

Nguễn Nguyễn 0.667 0.727

Nguễn Nguy 0.571 0.571

Tượg Tượng 0.571 0.667

Tượg Tương 0.286 0.444

3.4.2 Two sentences similarity

Suppose we need to measure the similarity of two sentences S1 = w1, w2, w3, · · · , wn

and S2 = w′1, w′2, w

′3, · · · , w′n. We compare the similarity of each pair words ac-

cording to the improvement Dice coefficient. Then we compute the similarity oftwo sentences by Equation (3) belows:

Sim (S1, S2) =Σn

i=1fDice (wi, w′i)

n(4)

Where:

– wi and w′i: corresponding words of S1 and S2.– n: number of words.

If two sentences are the same, the degree similarity (Sim) of two sentences is 1.The higher of Sim coefficent, the higher degree of similarity and vice versa.

4 Experiments

4.1 3-gram language model and data using for it

In this paper, to handle misspelled and spelling error that can not normalize byVietnamese structure, set of syllabus rules, We use 3-gram language model. Thismodel was built from SRILM with a huge data collected from online newspa-pers (www.vnexpress.net, http://nld.com.vn/, http://dantri.com.vn/...). Datacollected from many fields such as current events, world, law, education, science,business, sports, entertainment ... with a total of over 429,310 articles. The totalamount of data about 1,045 MB. The 3-gram model was built from SRILM isabout 1,460 MB. To ensure the accuracy of results, all trigrams on model fromSRILM was selected if occurrence frequency of it is greater than 5 times and3-gram model with frequency of occurrence more than 5 times is about 81 MB

4.2 Experiment results

To test our system, we use data set which randomize collect from Vietnamesetweets. The data set includes 1,360 tweets completely different.

In order to make comparisons for the impact of the data set in the languagemodel. We ran the test two times with the language model built from two inputdata sets: The first set includes 130 MB randomized data from 1,045 MB datamention above and the second set includes entire 1,045 MB data. The 3-grammodel with frequency of more than 5 times of the first set is about 8 MB. Inthis case, we use the improvement Dice coefficient to measure the similarity oftwo sentences (trigrams). The results of this test was show in Table 3. From theTable 3, the results of 3-gram model with data from second set achieved thehigher accuracy than the results of 3-gram model with data from first set.

Table 3. The results using improvement Dice coefficient combines with 2 set data of3-gram model.

Data set Total error Detected error Correct fixed Wrong fixed Precision

1 1,360 1,342 1,072 270 79.88%

2 1,360 1,342 1,207 135 89.94%

To evaluate the improvement Dice coefficient with normal Dice coefficient.We ran the test with 3-gram model built from entire data set (1,045 MB) usingDice and fDice to measure the similarity of two sentences. We also use threemetrics Precision, Recall and Balance F-Measure to evaluate our system.

– Precision (P): number of correct fixed divided by the total of detected error.– Recall (R): number of of correct fixed divided by the total error.– Balance F-measure (F1): F1 = 2∗P∗R

p+R

Combining with the results above, we have the entire results showed in Table 4.

Table 4. The results using fDice and Dice with 3-gram model built from entire data.

Method Precision Recall F-Measure

Dice 84.8% 83.68% 84.23%

fDice 89.94% 88.75% 89.34%

The results in Table 4 show that our improvement Dice coefficient achieveshigher performance than normal Dice coefficient.

5 Conclusion

In this paper, we present the first attempt to normalize Vietnamese tweets onTwitter. Our proposed method combines a language model with dictionaries andVietnamese vocabulary structures. We also extended original Dice coefficientto improve performance of similarity measure between two words. To evaluatethe proposed method, we build a dataset including 1,360 Vietnamese tweets.The experiments results show that our proposed method achieves relative highperformance with precision approximating to 90%, recall over 88.7% and F-Measure over 89%. Moreover, our improvement on measure the similarity of twowords based on the Dice coefficient outperforms original Dice coefficient. Weplan to collect larger datasets and build as well as test the language model withnot only 3-gram but also 2-gram and 4-gram so that we can have the comparisonof some datasets and models.

References

1. Banko, Michele, Brill, E.: Scaling to very very large corpora for natural languagedisambiguation. In: Proceedings of the 39th Annual Meeting on Association forComputational Linguistics. pp. 26–33 (2001)

2. Carlson, Andrew, Fette, I.: Memory-based context-sensitive spelling correction atweb scale. In: Proceedings of the Sixth International Conference on Machine Learn-ing and Applications. pp. 166–171 (2007)

3. Choi, Kim, et al.: A method for normalizing non-standard words in online socialnetwork services: A case study on twitter. In: Context-Aware Systems and Appli-cations Second International Conference, ICCASA 2013. pp. 359–368 (2014)

4. Cotelo, J.M., et al.: A modular approach for lexical normalization applied to span-ish tweets. Expert Systems with Applications 42.10, 4743–4754 (2015)

5. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology26.3, 297–302 (1945)

6. Dien, D.: Building an English – Vietnamese Bilingual Corpus. Ph.D. thesis, Uni-versity of Social Sciences and Humanity of HCM City, Vietnam (2005)

7. Duan, Huizhong, et al.: A discriminative model for query spelling correction withlatent structural svm. In: Proceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Computational Natural LanguageLearning. pp. 1511–1521 (2012)

8. Duy, N.T.N., et al.: An approach in Vietnamese spell checking. In: Vietnamese.Bachelor’s thesis, University of Science Ho Chi Minh city (2004)

9. Golding, R., A., Roth, D.: A winnow-based approach to context-sensitive spellingcorrection. Machine learning 34.1-3, 107–130 (1999)

10. Habash, Nizar, Roth, R.M.: Using deep morphology to improve automatic er-ror detection in arabic handwriting recognition. In: Proceedings of the 49th An-nual Meeting of the Association for Computational Linguistics: Human LanguageTechnologies-Volume 1. pp. 875–884 (2011)

11. Hai, N.D., et al.: Syntactic parser in Vietnamese sentences and its applicationin spell checking. In: Vietnamese. Bachelor’s thesis, University of Science Ho ChiMinh city (1999)

12. Han, Bo, Baldwin, T.: Lexical normalisation of short text messages: Makn sensa# twitter. In: Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies-Volume 1. p. 368–378(2011)

13. Han, B., et al.: Lexical normalization for social media text. ACM Transactions onIntelligent Systems and Technology 4.1, 621–633 (2013)

14. Hassan, H., Menezes, A.: Social text normalization using contextual graph randomwalks. In: Proceedings of the 51st Annual Meeting of the Association for Com-putational Linguistics. p. 1577–1586. Association for Computational Linguistics(2013)

15. Hassan, Y., et al.: Arabic spelling correction using supervised learning. In: Pro-ceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing.pp. 121–126. Association for Computational Linguistics (2014)

16. Huang, Qiang, et al.: Chinese spelling check system based on tri-gram model. In:Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese LanguageProcessing. pp. 173–178 (2014)

17. Huong, N.T.X., et al.: Using large n-gram for vietnamese spell checking. In: Pro-ceedings of Sixth International Conference KSE 2014. pp. 617–627. Springer Inter-national Publishing (2015)

18. Li, C., Liu, Y.: Improving text normalization via unsupervised model and discrim-inative reranking. In: Proceedings of the ACL 2014 Student Research Workshop.pp. 86–93. Association for Computational Linguistics (2014)

19. Pennell, D.L., Liu, Y.: Normalization of informal text. Computer Speech and Lan-guage 28.1, 256–277 (2014)

20. Phe, H.: syllable Dictionary. Dictionary center, Hanoi encyclopedia Publishers(2011)

21. Quang, N.: Language model and word segmentation in Vietnamese Spell checking.In: Vietnamese. Bachelor’s thesis, University of Engineering and Technology, HanoiNational University (2012)

22. Saloot, M.A., et al.: An architecture for malay tweet normalization. InformationProcessing & Management 50.5, 621–633 (2014)

23. Shaalan, K.F., et al.: Arabic word generation and modelling for spell checking.In: Proceedings of the Eight International Conference on Language Resources andEvaluation (LREC’12). European Language Resources Associations (2012)

24. Sonmez, C., Ozgur, A.: A graph-based approach for contextual text normalization.In: Conference on Empirical Methods in Natural Language Processing (EMNLP).pp. 313–324. Association for Computational Linguistics (2014)

25. Sproat, R., et al.: Normalization of non-standard words. Computer Speech andLanguage 15.3, 287–333 (2001)

26. Wu, Shih-Hung, et al.: Reducing the false alarm rate of chinese character errordetection and correction. In: Proceedings of CIPS-SIGHAN Joint Conference onChinese Language Processing (CLP 2010). pp. 54–61 (2010)

27. Yang, Y., Eisenstein, J.: A log-linear model for unsupervised text normalization.In: Proceedings of the 2013 Conference on Empirical Methods in Natural LanguageProcessing. p. 61–72. Association for Computational Linguistics (2013)

28. Yeh, Jui-Feng, et al.: Chinese word spelling correction based on n-gram rankedinverted index list. In: Proceedings of the Seventh SIGHAN Workshop on ChineseLanguage Processing (SIGHAN-7). pp. 43–48 (2013)

Documents

Normalization of Vietnamese Tweets on Twitter