7
0018-9162/07/$25.00 © 2007 IEEE 34 Computer Published by the IEEE Computer Society C O V E R F E A T U R E back can provide the information for personalization and domain adaptation, it isn’t clear how a search engine can operationalize this information. Clearly, implicit feedback is noisy and biased, making simple learning strategies doomed to failure. We show how, through proper interpretation and experiment design, implicit feedback can provide cheap and accurate training data in the form of pairwise pref- erences. We provide a machine-learning algorithm that can use these preferences, and demonstrate how to inte- grate everything in an operational search engine that learns. INTERPRETING IMPLICIT FEEDBACK Consider the example search shown in Figure 1. The user issued the query “Jaguar” and received a ranked list of documents in return. What does the user clicking on the first, third, and fifth link tell us about the user’s preferences, the individual documents in the ranking, and the query’s overall success? User behavior To answer these questions, we need to understand how users interact with a search engine (click) and how this relates to their preferences. For example, how significant is it that the user clicked on the top-ranked document? Does this tell us that it was relevant to the user’s query? Viewing results. Unfortunately, a click doesn’t neces- sarily indicate that a result was relevant, since the way search engines present results heavily biases a user’s Search-engine logs provide a wealth of information that machine-learning techniques can harness to improve search quality. With proper interpretations that avoid inherent biases, a search engine can use training data extracted from the logs to automatically tailor ranking functions to a particular user group or collection. Thorsten Joachims and Filip Radlinski, Cornell University E ach time a user formulates a query or clicks on a search result, easily observable feedback is pro- vided to the search engine. Unlike surveys or other types of explicit feedback, this implicit feedback is essentially free, reflects the search engine’s natural use, and is specific to a particular user and collection. A smart search engine could use this implicit feedback to learn personalized ranking func- tions—for example, recognizing that the query “SVM” from users at computer science departments most likely refers to the machine-learning method, but for other users typically refers to ServiceMaster’s ticker symbol. With a growing and heterogeneous user population, such personalization is crucial for helping search advance beyond a one-ranking-fits-all approach. 1 Similarly, a search engine could use implicit feedback to adapt to a specific document collection. In this way, an off-the-shelf search-engine product could learn about the specific collection it’s deployed on—for example, learning that employees who search their company intranet for “travel reimbursement” are looking for an expense-report form even if the form doesn’t contain the word “reim- bursement.” Sequences of query reformulations could pro- vide the feedback for this learning task. Specifically, if a significant fraction of employees searching for “travel reimbursement” reformulate the query, and eventually click on the expense-report form, the search engine could learn to include the form in the results for the initial query. 2 Most large Internet search engines now record queries and clicks. But while it seems intuitive that implicit feed- Search Engines that Learn from Implicit Feedback

Search Engines that Learn from Implicit Feedback

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

0018-9162/07/$25.00 © 2007 IEEE34 Computer P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y

C O V E R F E A T U R E

back can provide the information for personalizationand domain adaptation, it isn’t clear how a search enginecan operationalize this information. Clearly, implicitfeedback is noisy and biased, making simple learningstrategies doomed to failure.

We show how, through proper interpretation andexperiment design, implicit feedback can provide cheapand accurate training data in the form of pairwise pref-erences. We provide a machine-learning algorithm thatcan use these preferences, and demonstrate how to inte-grate everything in an operational search engine thatlearns.

INTERPRETING IMPLICIT FEEDBACKConsider the example search shown in Figure 1. The

user issued the query “Jaguar” and received a rankedlist of documents in return. What does the user clickingon the first, third, and fifth link tell us about the user’spreferences, the individual documents in the ranking,and the query’s overall success?

User behaviorTo answer these questions, we need to understand how

users interact with a search engine (click) and how thisrelates to their preferences. For example, how significantis it that the user clicked on the top-ranked document?Does this tell us that it was relevant to the user’s query?

Viewing results.Unfortunately, a click doesn’t neces-sarily indicate that a result was relevant, since the waysearch engines present results heavily biases a user’s

Search-engine logs provide a wealth of information that machine-learning techniques can

harness to improve search quality. With proper interpretations that avoid inherent biases,

a search engine can use training data extracted from the logs to automatically tailor

ranking functions to a particular user group or collection.

Thorsten Joachims and Filip Radlinski, Cornell University

E ach time a user formulates a query or clicks on asearch result, easily observable feedback is pro-vided to the search engine. Unlike surveys orother types of explicit feedback, this implicitfeedback is essentially free, reflects the search

engine’s natural use, and is specific to a particular userand collection. A smart search engine could use thisimplicit feedback to learn personalized ranking func-tions—for example, recognizing that the query “SVM”from users at computer science departments most likelyrefers to the machine-learning method, but for otherusers typically refers to ServiceMaster’s ticker symbol.With a growing and heterogeneous user population,such personalization is crucial for helping searchadvance beyond a one-ranking-fits-all approach.1

Similarly, a search engine could use implicit feedback toadapt to a specific document collection. In this way, anoff-the-shelf search-engine product could learn about thespecific collection it’s deployed on—for example, learningthat employees who search their company intranet for“travel reimbursement” are looking for an expense-reportform even if the form doesn’t contain the word “reim-bursement.” Sequences of query reformulations could pro-vide the feedback for this learning task. Specifically, if asignificant fraction of employees searching for “travelreimbursement” reformulate the query, and eventuallyclick on the expense-report form, the search engine couldlearn to include the form in the results for the initial query.2

Most large Internet search engines now record queriesand clicks. But while it seems intuitive that implicit feed-

Search Engines that Learn from Implicit Feedback

August 2007 35

behavior. To understand this bias, we mustunderstand the user’s decision process.

First, which results did the user look atbefore clicking? Figure 2 shows the per-centage of queries for which users look atthe search result at a particular rank beforemaking the first click. We collected the datawith an eye-tracking device in a controlleduser study in which we could tell whetherand when a user read a particular result.3

The graph shows that for all results belowthe third rank, users didn’t even look at theresult for more than half of the queries. So, on manyqueries even an excellent result at position 5 would gounnoticed (and hence unclicked). Overall, most userstend to evaluate only a few results before clicking, andthey are much more likely to observe higher-rankedresults.3

Influence of rank. Even once a result is read, its rankstill influences the user’s decision to click on it. The bluebars in Figure 3 show the percentage of queries forwhich the user clicked on a result at a particular rank.Not surprisingly, users most frequently clicked the top-ranked result (about 40 percent of the time), with thefrequency of clicks decreasing along with the rank. Tosome extent, the eye-tracking results explain this, sinceusers can’t click on results they haven’t viewed.Furthermore, the top-ranked result could simply be themost relevant result for most queries. Unfortunately, thisisn’t the full story.

The red bars in Figure 3 show the frequency of clicksafter—unknown to the user—we swapped the top tworesults. In this swapped condition, the second result gainedthe top position in the presented ranking and receivedvastly more clicks than the first result demoted to secondrank. It appears that the top position lends credibility toa result, strongly influencing user behavior beyond theinformation contained in the abstract. Note that the eye-tracking data in Figure 2 shows that ranks one and two areviewed almost equally often, so not having viewed the sec-ond-ranked result can’t explain this effect.

Presentation bias.More generally, we found that theway the search engine presents results to the user has astrong influence on how users act.4 We call the combi-nation of these factors the presentation bias. If users areso heavily biased, how could we possibly derive usefulpreference information from their actions? The follow-ing two strategies can help extract meaningful feedbackfrom clicks in spite of the presentation bias and aid inreliably inferring preferences.

Absolute versus relative feedbackWhat can we reliably infer from the user’s clicks? Due

to the presentation bias, it’s not safe to conclude that aclick indicates relevance of the clicked result. In fact, wefind that users click frequently even if we severely

degrade the quality of the search results (for example, bypresenting the top 10 results in reverse order).3

Relative preference. More informative than whatusers clicked on is what they didn’t click on. Forinstance, in the example in Figure 1, the user decided

Figure 1. Ranking for “Jaguar” query.The results a user clicks on are marked

with an asterisk.

*1. The Belize Zoo; http://belizezoo.org 2. Jaguar–The British Metal Band; http://jaguar-online.com

*3. Save the Jaguar; http://savethejaguar.com 4. Jaguar UK–Jaguar Cars; http://jaguar.co.uk

*5. Jaguar–Wikipedia; http://en.wikipedia.org/wiki/Jaguar 6. Schrödinger (Jaguar quantum chemistry package); http://www.schrodinger.com7. Apple–Mac OS X Leopard; http://apple.com/macosx

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

Results

Perc

enta

ge o

f que

ries

Figure 2. Rank and viewership. Percentage of queries where a

user viewed the search result presented at a particular rank.

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

Results

Perc

enta

ge o

f que

ries

NormalSwapped

Figure 3. Swapped results. Percentage of queries where a user

clicked the result presented at a given rank, both in the normal

and swapped conditions.

36 Computer

not to click on the second result. Since we found in oureye-tracking study that users tend to read the resultsfrom top to bottom, we can assume that the user sawthe second result when clicking on result three. Thismeans that the user decided between clicking on the sec-ond and third results, and we can interpret the click asa relative preference (“the user prefers the third resultover the second result for this query”).

This preference opposes the presentation bias of click-ing on higher-ranked links, indicating that the user madea deliberate choice not to click on the higher-ranked result.We can extract similar relative preferences from the user’sdecision to click on the fifth result in our example, namelythat users prefer the fifth result overthe second and fourth results.

The general insight here is that wemust evaluate user actions in com-parison to the alternatives that theuser observed before making a deci-sion (for example, the top k resultswhen clicking on result k), and rela-tive to external biases (for example,only counting preferences that goagainst the presented ordering). Thisnaturally leads to feedback in theform of pairwise relative preferences like “A is betterthan B.” In contrast, if we took a click to be an absolutestatement like “A is good,” we’d face the difficult task ofexplicitly correcting for the biases.

Pairwise preferences. In a controlled user study, wefound that pairwise relative preferences extracted fromclicks are quite accurate.3 About 80 percent of the pair-wise preferences agreed with expert labeled data, wherewe asked human judges to rank the results a queryreturned by relevance to the query.

This is particularly promising, since two human judgesonly agree with each other about 86 percent of the time.It means that the preferences extracted from clicks aren’tmuch less accurate than manually labeled data.However, we can collect them at essentially no cost andin much larger quantities. Furthermore, the preferencesfrom clicks directly reflect the actual users’ preferences,instead of the judges’ guesses at the users’ preferences.

Interactive experimentationSo far, we’ve assumed that the search engine passively

collects feedback from log files. By passively, we meanthat the search engine selects the results to present tousers without regard for the training data that it mightcollect from user clicks. However, the search engine hascomplete control over which results to present and howto present them, so it can run interactive experiments. Incontrast to passively observed data, active experimentscan detect causal relationships, eliminate the effect ofpresentation bias, and optimize the value of the implicitfeedback that’s collected.4,5

Paired blind experiment. To illustrate the power ofonline experiments, consider the simple problem oflearning whether a particular user prefers Google orYahoo! rankings. We formulate this inference problemas a paired blind experiment.6 Whenever the user typesin a query, we retrieve the rankings for both Google andYahoo!. Instead of showing either of these rankings sep-arately, we combine the rankings into a single rankingthat we then present to the user.

Specifically, the Google and Yahoo! rankings are inter-leaved into a combined ranking so a user reading fromtop to bottom will have seen the same number of toplinks from Google and Yahoo! (plus or minus one) at

any point in time. Figure 4 illustratesthis interleaving technique. It’s easyto show that such an interleavedranking always exists, even when thetwo rankings share results.6 Finally,we make sure that the user can’t tellwhich results came from Google andwhich from Yahoo! and that we pre-sent both with equally informativeabstracts.

We can now observe how the userclicks on the combined ranking. Due

to interleaving, a user without preference for Yahoo! orGoogle will have equal probability of clicking on a linkthat came from the top of the Yahoo! ranking or the topof the Google ranking. This holds independently ofwhatever influence the presented rank has on the user’sclicking behavior or however many results the usermight consider. If we see that the user clicks significantlymore frequently on results from one of the searchengines, we can conclude that the user prefers the resultsfrom that search engine in this direct comparison.

For the example in Figure 4, we can assume that theuser viewed results 1 to 5, since there’s a click on result5. This means the user saw the top three results fromranking A as well as from ranking B. The user decidedto not click on two of the results from B, but did clickon all results from A, which indicates that the userprefers A.

Optimizing data quality. We can also use interactiveexperiments to improve the value of the feedback that’sreceived. The eye-tracking study showed that we canexpect clicks only for the top few results, and that thesearch engine will probably receive almost no feedbackabout any result ranked below 100. But since the searchengine controls the ranking that’s presented, it could mixthings up.

While not presenting the current “best-guess” rankingfor the query might reduce retrieval quality in the shortrun, improved training data through exploration canmake the search engine learn faster and improve morein the long run. An active learning algorithm for thisproblem maintains a model of uncertainty about the

Improved training data

through exploration can

make the search engine

learn faster

and improve more

in the long run.

August 2007 37

“Oxford English Dictionary.” After running the secondquery, the users often clicked on a particular result thatthe first query didn’t retrieve.2

The later actions in this query chain (the reformula-tion followed by the click) could explain what the userinitially intended with the query “oed.” In particular,we can infer the preference that, given the query “oed,”the user would have liked to see the clicked resultreturned for the query “Oxford English Dictionary.” Ifwe frequently see the same query chain (or more pre-cisely, the resulting preference statement), a searchengine can learn to associate pages with queries even ifthey don’t contain any of the query words.

LEARNING FROM PAIRWISE PREFERENCESUser actions are best interpreted as a choice among

available and observed options, leading to relative pref-erence statements like “for query q, user u prefers da

relevance of each document fora query, which then allows thesearch engine to determinewhich documents would bene-fit most from receiving feed-back.5 We can then insert theseselected documents into the topof the presented ranking, wherethey will likely receive feed-back.

Beyond counting clicksSo far, the only implicit feed-

back we’ve considered iswhether the user clicked oneach search result. But there’smuch beyond clicks that asearch engine can easilyobserve, as Diane Kelly andJaime Teevan note in theirreview of existing studies.7

Search engines can use readingtimes, for example, to furtherdifferentiate clicks.8 A click ona result that’s shortly followedby another click probablymeans that the user quicklyrealized that the first resultwasn’t relevant.

Abandonment. An interest-ing indicator of user dissatis-faction is “abandonment,”describing the user’s decision tonot click on any of the results.Abandonment is always a pos-sible action for users, and thusis in line with our relative feed-back model. We just need toinclude “reformulate query” (and “give up”) as possiblealternatives for user actions. Again, the decision to notclick on any results indicates that abandoning the resultswas the most promising option. Noticing which queriesusers abandoned is particularly informative when theuser immediately follows the abandoned query withanother one.

Query chains. In one of our studies of Web-searchbehavior, we found that on average users issued 2.2 queriesper search session.3 Such a sequence of queries, which wecall a query chain, often involves users adding or remov-ing query terms, or reformulating the query as a whole.Query chains are a good resource for learning how usersformulate their information need, since later queries oftenresolve ambiguities of earlier queries in the chain.9

For example, in one of our studies we frequentlyobserved that users searching the Cornell Library Webpages ran the query “oed” followed by the query

Figure 4. Blind test for user preference. (a) Original rankings, and (b) the interleaved rank-

ing presented to the user. Clicks (marked with *) in the interleaved ranking provide

unbiased feedback on which ranking the user prefers.

Ranking A (hidden from user)*1. Kernel Machineshttp://kernel-machines.org *2. SVM—Light Support Vector Machine http://svmlight.joachims.org *3. Support Vector Machines—The Bookhttp://support-vector.net 4. svm : SVM Lovershttp://groups.yahoo.com/group/svm 5. LSU School of Veterinary Medicinewww.vetmed.lsu.edu

Ranking B (hidden from user)*1. Kernel Machineshttp://kernel-machines.org 2. ServiceMasterhttp://servicemaster.com 3. School of Volunteer Managementhttp://svm.net.au 4. Homepage des SV Union Meppenhttp://sv-union-meppen.de 5. SVM—Light Support Vector Machinehttp://svmlight.joachims.org

Interleaved ranking of A and B (presented to the user) 1.Kernel Machines*http://kernel-machines.org 2. ServiceMasterhttp://servicemaster.com 3. SVM—Light Support Vector Machine*http://svmlight.joachims.org 4. School of Volunteer Managementhttp://svm.net.au 5. Support Vector Machines—The Book*http://support-vector.net 6. Homepage des SV Union Meppenhttp://sv-union-meppen.de 7. svm: SVM Lovershttp://groups.yahoo.com/group/svm 8. LSU School of Veterinary Medicinehttp://www.vetmed.lsu.edu

(a)

(b)

the following convex quadratic optimization problem:

minimize:

subject to: w · �(q1, u1, d1a) = w · �(q1, u1, d1

b) + 1 – �1

w · �(qn, un, dna) = w · �(qn, un, dn

b) + 1 – �n

�i : �i ≥ 0

We can solve this type of optimization problem in timelinear in n for a fixed precision (http://svmlight.joachims.org/svm_perf.html), meaning it’s practical tosolve on a desktop computer given millions of prefer-ence statements. Note that each unsatisfied constraintincurs a penalty �i, so that the term ��i in the objectiveis an upper bound on the number of violated trainingpreferences. The parameter C controls overfitting likein a binary classification SVM.

A LEARNING METASEARCH ENGINETo see if the implicit feedback interpretations and

ranking SVM could actually learn an improved rankingfunction, we implemented a metasearch engine calledStriver.6 When a user issued a query, Striver forwardedthe query to Google, MSN Search (now called WindowsLive), Excite, AltaVista, and HotBot. We analyzed theresults these search engines returned and extracted thetop 100 documents. The union of all these results com-posed the candidate set K. Striver then ranked the doc-uments in K according to its learned utility function h*and presented them to the user. For each document, thesystem displayed the title of the page along with its URLand recorded user clicks on the results.

Striver experimentTo learn a retrieval function using a ranking SVM,

it was necessary to design a suitable feature mapping�(q, u, d) that described the match between a query qand a document d for user u. Figure 5 shows the fea-tures used in the experiment.

To see whether the learned retrieval function improvedretrieval, we made the Striver search engine available toabout 20 researchers and students in the University ofDortmund’s artificial intelligence unit, headed byKatharina Morik. We asked group members to useStriver just like any other Web search engine.

After the system collected 260 training queries withat least one click, we extracted pairwise preferences andtrained the ranking SVM on these queries. Striver thenused the learned function for ranking the candidate setK. During an approximately two-week evaluation, wecompared the learned retrieval function against Google,

V ,12

( )w ξ ξ= ⋅ +=∑w w C

n ii

n1

1

38 Computer

over db.” Most machine-learning algorithms, however,expect absolute training data—for example, “da is rele-vant,” “da isn’t relevant,” or “da scores 4 on a 5-pointrelevance scale.”

Ranking SVMHow can we use pairwise preferences as training data

in machine-learning algorithms to learn an improvedranking? One option is to translate the learning probleminto a binary classification problem.10 Each pairwisepreference would create two examples for a binary clas-sification problem, namely a positive example (q, u, da,db) and a negative example (q, u, db, da). However,assembling the binary predictions of the learned rule atquery time is an NP-hard problem, meaning it’s likelytoo slow to use in a large-scale search engine.

Utility function. We’ll therefore follow a differentroute and use a method that requires only a single sortoperation when ranking results for a new query. Insteadof learning a pairwise classifier, we directly learn a func-tion h(q, u, d) that assigns a real-valued utility score toeach document d for a given query q and user u. Oncethe algorithm learns a particular function h, for any newquery q� the search engine simply sorts the documents bydecreasing utility.

We address the problem of learning a utility functionh from a given set of pairwise preference statements inthe context of support vector machines (SVMs).11 Ourranking SVM6 extends ordinal regression SVMs12 tomultiquery utility functions. The basic idea is that when-ever we have a preference statement “for query q, useru prefers da over db,” we interpret this as a statementabout the respective utility values—namely that for useru and query q the utility of da is higher than the utilityof db. Formally, we can interpret this as a constraint on the utility function h(q, u, d) that we want to learn:h(q, u, da) > h(q, u, db).

Given a space of utility functions H, each pairwisepreference statement potentially narrows down the sub-set of utility functions consistent with the user prefer-ences. In particular, if our utility function is linear in theparameters w for a given feature vector �(q, u, d)describing the match between q, u, and d, we can writeh(q, u, d) = w � �(q, u, d).

Finding the function h (the parameter vector w) that’sconsistent with all training preferences P = {(q1, u1, d1

a,d1

b), … , (qn, un, dna, dn

b)} is simply the solution of a system of linear constraints. However, it’s probably toomuch to ask for a perfectly consistent utility function.Due to noise inherent in click data, the linear system isprobably inconsistent.

Training method. Ranking SVMs aim to find a para-meter vector w that fulfills most preferences (has lowtraining error), while regularizing the solution with thesquared norm of the weight vector to avoid overfitting.Specifically, a ranking SVM computes the solution to

MSN Search, and a nonlearning meta-search engine using the interleavingexperiment. In all three comparisons, theusers significantly preferred the learnedranking over the three baseline rankings,6

showing that the learned function im-proved retrieval.

Learned weights. But what does thelearned function look like? Since the rank-ing SVM learns a linear function, we cananalyze the function by studying thelearned weights. Table 1 displays the fea-tures that received the most positive andmost negative weights. Roughly speaking,a high-positive (or -negative) weight indi-cates that documents with these featuresshould be higher (or lower) in the ranking.

The weights in Table 1 reflect the groupof users in an interesting and plausibleway. Since many queries were for scien-tific material, it was natural that URLsfrom the domain “citeseer” (and the alias“nec”) would receive positive weight.Note also the high-positive weight for“.de,” the country code domain name forGermany. The most influential weightsare for the cosine match between queryand abstract, whether the URL is in thetop 10 from Google, and for the cosinematch between query and the words in theURL. A document receives large negative weights if nosearch engine ranks it number 1, if it’s not in the top 10of any search engine (note that the second implies thefirst), and if the URL is long. Overall, these weights nicelymatched our intuition about these German computer sci-entists’ preferences.

Osmot experimentTo show that in addition to personalizing to a group

of users, implicit feedback also can adapt the retrievalfunction to a particular document collection, we imple-mented Osmot (www.cs.cornell.edu/~filip/osmot), anengine that searches the Cornell University Library Webpages. This time we used subjects who were unawarethat the search engine was learning from their actions.The search engine was installed on the Cornell Libraryhomepage, and we recorded queries and clicks over sev-eral months.2

We then trained a ranking SVM on the inferred pref-erences from within individual queries and across querychains. Again, the search engine’s performance improvedsignificantly with learning, showing that the search enginecould adapt its ranking function to this collection.

Most interestingly, the preferences from query chainsallowed the search engine to learn associations betweenqueries and particular documents, even if the documents

didn’t contain the query words. Many of the learnedassociations reflected common acronyms and mis-

August 2007 39

Table 1. Features with largest and smallest weights as

learned by the ranking SVM.

Weight Feature

0.60 query_abstract_cosine 0.48 top10_google 0.24 query_url_cosine 0.24 top1count_1 0.24 top10_msnsearch 0.22 host_citeseer 0.21 domain_nec 0.19 top10_count_3 0.17 top1_google 0.17 country_de

...–0.13 domain_tu-bs –0.15 country –0.16 top50count_4 –0.17 url_length –0.32 top10count_0 –0.38 top1count_0

Figure 5. Striver metasearch engine features. Striver examined rank in other

search engines, query/content matches, and popularity attributes.

1. Rank in other search engines (38 features total):rank_X: 100 minus rank in X � {Google, MSN Search, AltaVista, HotBot, Excite}

divided by 100 (minimum 0)top1_X: ranked #1 in X � {Google, MSN Search, AltaVista, HotBot, Excite}

(binary {0, 1})top10_X : ranked in top 10 in X � {Google, MSN Search, AltaVista, HotBot, Excite}

(binary {0; 1})top50_X : ranked in top 50 in X � {Google, MSN Search, AltaVista, HotBot, Excite}

(binary {0; 1})top1count_X : ranked #1 in X of the five search enginestop10count_X : ranked in top 10 in X of the five search enginestop50count_X : ranked in top 50 in X of the five search engines

2. Query/content match (three features total):query_url_cosine: cosine between URL-words and query (range [0, 1])query_abstract_cosine: cosine between title-words and query (range [0, 1])domain_name_in_query: query contains domain-name from URL (binary {0, 1})

3. Popularity attributes (~20,000 features total):url length: length of URL in characters divided by 30country_ X: country code X of URL (binary attribute {0, 1} for each country code)domain_X : domain X of URL (binary attribute {0, 1} for each domain name)abstract_contains_home: word “home” appears in URL or title (binary attribute {0,1})url_contains_tilde: URL contains “~” (binary attribute {0,1})url_X: URL X as an atom (binary attribute {0,1})

40 Computer

spellings unique to this document corpus and user population. For example, the search engine learned toassociate the acronym “oed” with the gateway to dic-tionaries and encyclopedias, and the misspelled name“lexus” with the Lexis-Nexis library resource. Thesefindings demonstrated the usefulness of pairwise pref-erences derived from query reformulations.

T aken together, these two experiments show howusing implicit feedback and machine learning canproduce highly specialized search engines. While

biases make implicit feedback data difficult to interpret,techniques are available for avoiding these biases, andthe resulting pairwise preference statements can be usedfor effective learning. However, much remains to bedone, ranging from addressing privacy issues and theeffect of new forms of spam, to the design of interactiveexperiments and active learning methods. ■

Acknowledgments

This work was supported by NSF Career Award No.0237381, a Microsoft PhD student fellowship, and agift from Google.

References

1. J. Teevan, S.T. Dumais, and E. Horvitz, “Characterizing theValue of Personalizing Search,” to be published in Proc. ACMSIGIR Conf. Research and Development in InformationRetrieval (SIGIR 07), ACM Press, 2007.

2. F. Radlinski and T. Joachims, “Query Chains: Learning toRank from Implicit Feedback,” Proc. ACM SIGKDD Int’lConf. Knowledge Discovery and Data Mining (KDD 05),ACM Press, 2005, pp. 239-248.

3. T. Joachims et al., “Evaluating the Accuracy of Implicit Feed-back from Clicks and Query Reformulations in Web Search,”ACM Trans. Information Systems, vol. 25, no. 2, article 7,2007.

4. F. Radlinski and T. Joachims, “Minimally Invasive Random-ization for Collecting Unbiased Preferences from ClickthroughLogs,” Proc. Nat’l Conf. Am. Assoc. for Artificial Intelligence(AAAI 06), AAAI, 2006, pp. 1406-1412.

5. F. Radlinski and T. Joachims, “Active Exploration for Learn-ing Rankings from Clickthrough Data,” to be published inProc. ACM SIGKDD Int’l Conf. Knowledge Discovery andData Mining (KDD 07), ACM Press, 2007.

6. T. Joachims, “Optimizing Search Engines Using ClickthroughData,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discov-ery and Data Mining (KDD 02), ACM Press, 2002, pp. 132-142.

7. D. Kelly and J. Teevan, “Implicit Feedback for Inferring UserPreference: A Bibliography,” ACM SIGIR Forum, vol. 37, no.2, 2003, pp. 18-28.

8. E. Agichtein, E. Brill, and S. Dumais, “Improving Web SearchRanking by Incorporating User Behavior,” Proc. ACM SIGIRConf. Research and Development in Information Retrieval(SIGIR 06), ACM Press, 2006, pp. 19-26.

9. G. Furnas, “Experience with an Adaptive Indexing Scheme,”Proc. ACM SIGCHI Conf. Human Factors in Computing Sys-tems (CHI 85), ACM Press, 1985, pp. 131-135.

10. W.W. Cohen, R.E. Shapire, and Y. Singer, “Learning to OrderThings,” J. Artificial Intelligence Research, vol. 10, AI AccessFoundation, Jan.-June 1999, pp. 243-270.

11. V. Vapnik, Statistical Learning Theory, John Wiley & Sons,1998.

12. R. Herbrich, T. Graepel, and K. Obermayer, “Large-MarginRank Boundaries for Ordinal Regression,” P. Bartlett et al.,eds., Advances in Large-Margin Classifiers, MIT Press, 2000,pp. 115-132.

Thorsten Joachims is an associate professor in the Depart-ment of Computer Science at Cornell University. Hisresearch interests focus on machine learning, especially withapplications in information access. He received a PhD incomputer science from the University of Dortmund, Ger-many. Contact him at [email protected].

Filip Radlinski is a PhD student in the Department of Com-puter Science at Cornell University. His research interestsinclude machine learning and information retrieval with aparticular focus on implicitly collected user data. He is aFulbright scholar from Australia and the recipient of aMicrosoft Research Fellowship. Contact him at [email protected].

Get accessto individual IEEE Computer Society documents online.

More than 100,000 articles and conference papers available!

$9US per article for members

$19US for nonmembers

www.computer.org/publications/dlib