Upload
emmett
View
31
Download
0
Embed Size (px)
DESCRIPTION
Language Identification of Search Engine Queries. Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd. - PowerPoint PPT Presentation
Citation preview
Language Identification of Search Engine Queries
Hakan Ceylan Yookyung KimDepartment of Computer Science Yahoo! Inc.University of North Texas 2821 Mission College Blvd.Denton,TX,76203 Santa Clara,CA,[email protected] [email protected]
ACL 2009
outline
• Introduction• Data Generation• Language Identification• Conclusions and Future Work
Introduction(1)
• Decide in which language a given text is written
• It is heavily studied• It is critical importance to search engines for
queries• Challenges : lack of any standard or publicly
available data set
Introduction(2)
• A case where a correct identification of language is not necessary.
example : query ”homo sapiens” , a user enter this query from Spain. Add a non-linguistic feature to system
Introduction(3)
Data Generation(1)
• Data set : Constructed by the queries with clicked urls From : Yahoo! Search Engine for each language Time : three months time period
Data Generation(2)
• Preprocess : remove any numbers or special characters or
extra spaces. lowercase all the letters of the queries. Calculating the frequencies of the urls for
each query.• A web page is 474 words on the average• Identify the language for web page using one of
the existing methods.
Data Generation(3)
• Using Table 1(T1) and Table 2(T2) to store the above information
T1 : [ q , u , fu ] T2 : [ u , l ] q : query u : a unique url u : url l : language identified for u fu : the frequency of u
• Combine T1 and T2 into T3 T3 : [ q , l , fl , cu,l ]
l : a language fl : the count of clicks for l cu,l : the count of unique urls in language l
Data Generation(4)
• It has many noise. 1. A query maps to more than one language. solve : Giving a weight wq,l for each query to a language set a threshold parameter W if wq,l < W then remove this query
2.navigational query example : ACL 2009
Data Generation(5)
Solve : set two threshold parameter F and U if Fq > F or Uq < U then remove this query• Algorithm
Data Generation(6)
• How to turn our parameter dependent on the size of data set (Silverstein et al.,1999) W = 1 , F = 50 , U = 5
• How many query will be filter 5%~10% of the queries
• Pick 500 queries randomly and annotate them by human
Category-1: If the query does not contain any foreign terms. Category-2: If there exists some foreign terms but the query would still be expected to bring web pages in the same language. Category-3: If the query belongs to other languages, or all the terms are foreign to the annotator.
Data Generation(7)
• How much of this multi-linguality parameter selection eliminate? result : Category-1 : 47.6% Category-1+2 : 60.2%
Language Identification(1)
• Implement three models use a different existing feature
1.statistical model 2.knowledge based model 3.morphological model• EuroParl Corpora• Combine all three models in a machine learning
framework using a novel approach• Add a non-linguistic
Language Identification(2)
• Test set-3500 human annotated queries
Statistical model
• Character based n-gram feature (n=1 to 7)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these
count• Above work can use SRILM Toolkit with
Kneser-Ney Discounting and interpolation
Knowledge based model
• Word based n-gram feature (n=1)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these
count
Morphological model
• Gather the affix information from corpora in an unsupervised(Harald Hammarstr¨om 2006)
• Give a score for each affix
Language Identification(3)
• Performance
Decision tree classification
• Each model can complement the other in certain cases
• Train data : automatically annotated data set• Feature : confidence score• Use the Kurtosis measure
Decision tree classification
• An example : query “the sovereign individual” and statistical model identifies it as English k = 7.6 > = = ( 4.47 + 1.96 ) so this query’s confidence score is “en-HIGH”• Implement DT classifier by the Weka Machine
Learning Toolkit (Witten and Frank,2005)
Decision tree classification
• Outperform all the models for each size on average
Decision tree classification
Mli,lj : language li misclassified by the system as lj
non-linguistic feature
• Non-linguistic feature is the language information of the country
• It helps the search engine in guessing the language
example : query “how to tape for plantar fasciits”(it is labelled as Category-2) It is classified to Porteguese query
non-linguistic feature
• Increase test set size to 430 queries
Conclusions
• A completely automated method to generate a reliable data set
• Built a decision tree classifier that improves the results on average
• Built a second classifier that takes into account the geographical information of the users
Feature Work
• To improve the accuracy of data generation• More careful examination in parameter values• To extend the number of languages in data set• Consider other alternatives to the decision
tree framework