26
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd. Denton,TX,76203 Santa Clara,CA,95054 [email protected] [email protected] ACL 2009

Language Identification of Search Engine Queries

  • Upload
    emmett

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Language Identification of Search Engine Queries. Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd. - PowerPoint PPT Presentation

Citation preview

Page 1: Language Identification of Search Engine Queries

Language Identification of Search Engine Queries

Hakan Ceylan Yookyung KimDepartment of Computer Science Yahoo! Inc.University of North Texas 2821 Mission College Blvd.Denton,TX,76203 Santa Clara,CA,[email protected] [email protected]

ACL 2009

Page 2: Language Identification of Search Engine Queries

outline

• Introduction• Data Generation• Language Identification• Conclusions and Future Work

Page 3: Language Identification of Search Engine Queries

Introduction(1)

• Decide in which language a given text is written

• It is heavily studied• It is critical importance to search engines for

queries• Challenges : lack of any standard or publicly

available data set

Page 4: Language Identification of Search Engine Queries

Introduction(2)

• A case where a correct identification of language is not necessary.

example : query ”homo sapiens” , a user enter this query from Spain. Add a non-linguistic feature to system

Page 5: Language Identification of Search Engine Queries

Introduction(3)

Page 6: Language Identification of Search Engine Queries

Data Generation(1)

• Data set : Constructed by the queries with clicked urls From : Yahoo! Search Engine for each language Time : three months time period

Page 7: Language Identification of Search Engine Queries

Data Generation(2)

• Preprocess : remove any numbers or special characters or

extra spaces. lowercase all the letters of the queries. Calculating the frequencies of the urls for

each query.• A web page is 474 words on the average• Identify the language for web page using one of

the existing methods.

Page 8: Language Identification of Search Engine Queries

Data Generation(3)

• Using Table 1(T1) and Table 2(T2) to store the above information

T1 : [ q , u , fu ] T2 : [ u , l ] q : query u : a unique url u : url l : language identified for u fu : the frequency of u

• Combine T1 and T2 into T3 T3 : [ q , l , fl , cu,l ]

l : a language fl : the count of clicks for l cu,l : the count of unique urls in language l

Page 9: Language Identification of Search Engine Queries

Data Generation(4)

• It has many noise. 1. A query maps to more than one language. solve : Giving a weight wq,l for each query to a language set a threshold parameter W if wq,l < W then remove this query

2.navigational query example : ACL 2009

Page 10: Language Identification of Search Engine Queries

Data Generation(5)

Solve : set two threshold parameter F and U if Fq > F or Uq < U then remove this query• Algorithm

Page 11: Language Identification of Search Engine Queries

Data Generation(6)

• How to turn our parameter dependent on the size of data set (Silverstein et al.,1999) W = 1 , F = 50 , U = 5

• How many query will be filter 5%~10% of the queries

• Pick 500 queries randomly and annotate them by human

Category-1: If the query does not contain any foreign terms. Category-2: If there exists some foreign terms but the query would still be expected to bring web pages in the same language. Category-3: If the query belongs to other languages, or all the terms are foreign to the annotator.

Page 12: Language Identification of Search Engine Queries

Data Generation(7)

• How much of this multi-linguality parameter selection eliminate? result : Category-1 : 47.6% Category-1+2 : 60.2%

Page 13: Language Identification of Search Engine Queries

Language Identification(1)

• Implement three models use a different existing feature

1.statistical model 2.knowledge based model 3.morphological model• EuroParl Corpora• Combine all three models in a machine learning

framework using a novel approach• Add a non-linguistic

Page 14: Language Identification of Search Engine Queries

Language Identification(2)

• Test set-3500 human annotated queries

Page 15: Language Identification of Search Engine Queries

Statistical model

• Character based n-gram feature (n=1 to 7)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these

count• Above work can use SRILM Toolkit with

Kneser-Ney Discounting and interpolation

Page 16: Language Identification of Search Engine Queries

Knowledge based model

• Word based n-gram feature (n=1)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these

count

Page 17: Language Identification of Search Engine Queries

Morphological model

• Gather the affix information from corpora in an unsupervised(Harald Hammarstr¨om 2006)

• Give a score for each affix

Page 18: Language Identification of Search Engine Queries

Language Identification(3)

• Performance

Page 19: Language Identification of Search Engine Queries

Decision tree classification

• Each model can complement the other in certain cases

• Train data : automatically annotated data set• Feature : confidence score• Use the Kurtosis measure

Page 20: Language Identification of Search Engine Queries

Decision tree classification

• An example : query “the sovereign individual” and statistical model identifies it as English k = 7.6 > = = ( 4.47 + 1.96 ) so this query’s confidence score is “en-HIGH”• Implement DT classifier by the Weka Machine

Learning Toolkit (Witten and Frank,2005)

Page 21: Language Identification of Search Engine Queries

Decision tree classification

• Outperform all the models for each size on average

Page 22: Language Identification of Search Engine Queries

Decision tree classification

Mli,lj : language li misclassified by the system as lj

Page 23: Language Identification of Search Engine Queries

non-linguistic feature

• Non-linguistic feature is the language information of the country

• It helps the search engine in guessing the language

example : query “how to tape for plantar fasciits”(it is labelled as Category-2) It is classified to Porteguese query

Page 24: Language Identification of Search Engine Queries

non-linguistic feature

• Increase test set size to 430 queries

Page 25: Language Identification of Search Engine Queries

Conclusions

• A completely automated method to generate a reliable data set

• Built a decision tree classifier that improves the results on average

• Built a second classifier that takes into account the geographical information of the users

Page 26: Language Identification of Search Engine Queries

Feature Work

• To improve the accuracy of data generation• More careful examination in parameter values• To extend the number of languages in data set• Consider other alternatives to the decision

tree framework