Learning to Rank - From pairwise approach to listwise

ì

Learning To Rank: From Pairwise Approach to Listwise Approach

Zhe Cao, Tao Qin, Tie-‐Yan Liu, Ming-‐Feng Tsai, and Hang Li

Hasan Hüseyin Topcu

Learning To Rank

Outline

ì  Related Work

ì  Learning System

ì  Learning to Rank

ì  Pairwise vs. Listwise Approach

ì  Experiments

ì  Conclusion

Related Work

ì  Pairwise Approach : Learning task is formalized as classificaNon of object pairs into two categories ( correctly ranked and incorrectly ranked)

ì  The methods of classificaNon: ì  Ranking SVM (Herbrich et al., 1999) and Joachims(2002) applied

RankingSVM to InformaNon Retrieval ì  RankBoost ( Freund et al. 1998) ì  RankNet (Burges et al. 2005):

Learning System

Learning System

Training Data, Data Preprocessing, …

How objects are idenNfied?

How instances are modeled?

SVM, ANN, BoosNng

Evaluate with test data

Adapted from Paaern Classificaton(Duda, Hart, Stork)

Ranking

Learning to Rank

Learning to Rank

ì  A number of queries are provided

ì  Each query is associated with perfect ranking list of documents (Ground-‐Truth)

ì  A Ranking funcNon is created using the training data such that the model can precisely predict the ranking list.

ì  Try to opNmize a Loss funcNon for learning. Note that the loss funcNon for ranking is slightly different in the sense that it makes use of sorNng.

Training Process

Data Labeling

ì  Explicit Human Judgment (Perfect, Excellent, Good, Fair, Bad)

ì  Implicit Relevance Judgment : Derived from click data (Search log data)

ì  Ordered Pairs between documents (A > B)

ì  List of judgments(scores)

Features

Pairwise Approach

ì  Training data instances are document pairs in learning

Pairwise Approach

ì  Collects document pairs from the ranking list and for each document pairs it assigns a label.

ì  Data labels +1 if score of A > B and -‐1 if A < B

ì  Formalizes the problem of learning to rank as binary classificaNon

ì  RankingSVM, RankBoost and RankNet

Pairwise Approach Drawbacks

ì  ObjecNve of learning is formalized as minimizing errors in classificaNon of document pairs rather than minimizing errors in ranking of documents.

ì  Training process is computaNonally costly, as the documents of pairs is very large.

Pairwise Approach Drawbacks

ì  Equally treats document pairs across different grades (labels) (Ex.1)

ì  The number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs. (Ex.2)

Listwise Approach

ì  Training data instances are document list

ì  The objecNve of learning is formalized as minimizaNon of the total loses with respect to the training data.

ì  Listwise Loss FuncNon uses probability models: Permuta(on Probability and Top One Probability

Learning to Rank: From Pairwise Approach to Listwise Approach

we describe in details the listwise approach. In follow-ing descriptions, we use superscript to indicate the indexof queries and subscript to indicate the index of documentsfor a specific query.

In training, a set of queries Q = {q(1), q(2), · · · , q(m)} isgiven. Each query q(i) is associated with a list of docu-ments d(i) =

⇣

d(i)1 , d

(i)2 , · · · , d

(i)n(i)

⌘

, where d(i)j denotes the j-th

document and n(i) denotes the sizes of d(i). Furthermore,each list of documents d(i) is associated with a list of judg-ments (scores) y(i) =

⇣

y(i)1 , y

(i)2 , · · · , y

(i)n(i)

⌘

where y(i)j denotes

the judgment on document d(i)j with respect to query q(i).

The judgment y(i)j represents the relevance degree of d(i)

j toq(i), and can be a score explicitly or implicitly given by hu-mans. For example, y(i)

j can be number of clicks on d(i)j

when d(i)j is retrieved and returned for query q(i) at a search

engine (Joachims, 2002). The assumption is that the higherclick-on rate is observed for d(i)

j and q(i) the stronger rele-vance exists between them.

A feature vector x(i)j = (q(i), d(i)

j ) is created fromeach query-document pair (q(i), d(i)

j ), i = 1, 2, · · · ,m; j =

1, 2, · · · , n(i). Each list of features x(i) =⇣

x(i)1 , · · · , x

(i)n(i)

⌘

and the corresponding list of scores y(i) =⇣

y(i)1 , · · · , y

(i)n(i)

⌘

then form an ‘instance’. The training set can be denoted asT =

n

(x(i), y(i))om

i=1.

We then create a ranking function f ; for each feature vec-tor x(i)

j (corresponding to document d(i)j ) it outputs a score

f (x(i)j ). For the list of feature vectors x(i) we obtain a list of

scores z(i) =⇣

f (x(i)1 ), · · · , f (x(i)

n(i) )⌘

. The objective of learn-ing is formalized as minimization of the total losses withrespect to the training data.

mX

i=1

L(y(i), z(i)) (1)

where L is a listwise loss function.

In ranking, when a new query q(i0) and its associated docu-ments d(i0) are given, we construct feature vectors x(i0) fromthem and use the trained ranking function to assign scoresto the documents d(i0). Finally we rank the documents d(i0)

in descending order of the scores. We call the learningproblem described above as the listwise approach to learn-ing to rank.

By contrast, in the pairwise approach, a new training dataset T 0 is created from T , in which each feature vector pairx(i)

j and x(i)k forms a new instance where j , k, and +1 is

assigned to the pair if y(i)j is larger than y(i)

k otherwise �1.It turns out that the training data T 0 is a data set of bi-nary classification. A classification model like SVM canbe created. As explained in Section 1, although the pair-

wise approach has advantages, it also su↵ers from draw-backs. The listwise approach can naturally deal with theproblems, which will be made clearer in Section 6.

4. Probability ModelsWe propose using probability models to calculate the list-wise loss function in Eq. (1). Specifically we map a list ofscores to a probability distribution using one of the prob-ability models described in this section and then take anymetric between two probability distributions as a loss func-tion . The two models are referred to as permutation prob-ability and top one probability.

4.1. Permutation Probability

Suppose that the set of objects to be ranked are identifiedwith the numbers 1, 2, ..., n . A permutation ⇡ on the objectsis defined as a bijection from {1, 2, ..., n} to itself. We writethe permutation as ⇡ = h⇡(1), ⇡(2), ..., ⇡(n)i. Here, ⇡( j) de-notes the object at position j in the permutation. The set ofall possible permutations of n objects is denoted as ⌦n.

Suppose that there is a ranking function which assignsscores to the n objects. We use s to denote the list of scoress = (s1, s2, ..., sn), where s j is the score of the j-th object.Hereafter we sometimes make interchangeable the rankingfunction and the list of scores given by the ranking func-tion.

We assume that there is uncertainty in the prediction ofranking lists (permutations) using the ranking function. Inother words, any permutation is assumed to be possible, butdi↵erent permutations may have di↵erent likelihood calcu-lated based on the ranking function. We define the per-mutation probability, so that it has desirable properties forrepresenting the likelihood of a permutation (ranking list),given the ranking function.

Definition 1 Suppose that ⇡ is a permutation on the n ob-jects, and �(.) is an increasing and strictly positive func-tion. Then, the probability of permutation ⇡ given the listof scores s is defined as

Ps(⇡) =nY

j=1

�(s⇡( j))Pn

k= j �(s⇡(k))

where s⇡( j) is the score of object at position j of permutation⇡.

Let us consider an example with three objects {1, 2, 3} hav-ing scores s = (s1, s2, s3). The probabilities of permuta-tions ⇡ = h1, 2, 3i and ⇡0 = h3, 2, 1i are calculated as fol-lows:

Ps(⇡) =�(s1)

�(s1) + �(s2) + �(s3)· �(s2)�(s2) + �(s3)

· �(s3)�(s3)

.




⇣

d(i)1 , d

(i)2 , · · · , d

(i)n(i)

⌘



⇣

y(i)1 , y

(i)2 , · · · , y

(i)n(i)

⌘

where y(i)j denotes










j ), i = 1, 2, · · · ,m; j =


x(i)1 , · · · , x

(i)n(i)

⌘


y(i)1 , · · · , y

(i)n(i)

⌘


n

(x(i), y(i))om

i=1.




scores z(i) =⇣

f (x(i)1 ), · · · , f (x(i)

n(i) )⌘


mX

i=1

L(y(i), z(i)) (1)















Ps(⇡) =nY

j=1

�(s⇡( j))Pn

k= j �(s⇡(k))



Ps(⇡) =�(s1)

�(s1) + �(s2) + �(s3)· �(s2)�(s2) + �(s3)

· �(s3)�(s3)

.




⇣

d(i)1 , d

(i)2 , · · · , d

(i)n(i)

⌘



⇣

y(i)1 , y

(i)2 , · · · , y

(i)n(i)

⌘

where y(i)j denotes










j ), i = 1, 2, · · · ,m; j =


x(i)1 , · · · , x

(i)n(i)

⌘


y(i)1 , · · · , y

(i)n(i)

⌘


n

(x(i), y(i))om

i=1.




scores z(i) =⇣

f (x(i)1 ), · · · , f (x(i)

n(i) )⌘


mX

i=1

L(y(i), z(i)) (1)















Ps(⇡) =nY

j=1

�(s⇡( j))Pn

k= j �(s⇡(k))



Ps(⇡) =�(s1)

�(s1) + �(s2) + �(s3)· �(s2)�(s2) + �(s3)

· �(s3)�(s3)

.

Permutation Probability

ì  Objects : {A,B,C} and PermutaNons: ABC, ACB, BAC, BCA, CAB, CBA

ì  Suppose Ranking funcNon that assigns scores to objects sA, sB and sC

ì  Permuta5on Probabilty: Likelihood of a permutaNon

ì  P(ABC) > P(CBA) if sA > sB > sC

Top One Probability

ì  Objects : {A,B,C} and PermutaNons: ABC, ACB, BAC, BCA, CAB, CBA

ì  Suppose Ranking funcNon that assigns scores to objects sA, sB and sC

ì  Top one probability of an object represents the probability of its being ranked on the top, given the scores of all the objects

ì  P(A) = P(ABC) + P(ACB)

ì  NoNce that in order to calculate n top one probabiliNes, we sNll need to calculate n! permutaNon probabiliNes. ì  P(A) = P(ABC) + P(ACB) ì  P(B) = P(BAC) + P(BCA) ì  P(C) = P(CBA) + P(CAB)

Listwise Loss Function

ì  With the use of top one probability, given two lists of scores we can use any metric to represent the distance between two score lists.

ì  For example when we use Cross Entropy as metric, the listwise loss funcNon becomes

ì  Ground Truth: ABCD vs. Ranking Output: ACBD or ABDC

ListNet

ì  Learning Method: ListNet

ì  OpNmize Listwise Loss funcNon based on top one probability with Neural Network and Gradient Descent as opNmizaNon algorithm.

ì  Linear Network Model is used for simplicity: y = wTx + b

ListNet

Ranking Accuracy

ì  ListNet vs. RankNet, RankingSVM, RankBoost

ì  3 Datasets: TREC 2003, OHSUMED and CSearch ì  TREC 2003: Relevance Judgments (Relevant and Irrelevant), 20 features

extracted ì  OHSUMED: Relevance Judgments (Definitely Relevant, PosiNvely

Relevant and Irrelevant), 30 features ì  CSearch: Relevance Judgments from 4(‘Perfect Match’) to 0 (‘Bad

Match’), 600 features

ì  EvaluaNon Measures: Normalized Discounted CumulaNve Gain (NDCG) and Mean Average Precision(MAP)

Experiments

ì  NDCG@n on TREC

Experiments

ì  NDCG@n on OHSUMED

Experiments

ì  NDCG@n on CSearch

Conclusion

ì  Discussed ì  Learning to Rank ì  Pairwise approach and its drawbacks ì  Listwise Approach outperforms the exisNng Pairwise Approaches

ì  EvaluaNon of the Paper ì  Linear Neural Network model is used. What about Non-‐Linear

model? ì  Listwise Loss FuncNon is the key issue.(Probability models)

References

ì  Zhe Cao, Tao Qin, Tie-‐Yan Liu, Ming-‐Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th interna(onal conference on Machine learning (ICML '07), Zoubin Ghahramani (Ed.). ACM, New York, NY, USA, 129-‐136. DOI=10.1145/1273496.1273513 hap://doi.acm.org/10.1145/1273496.1273513

ì  Hang Li: A Short Introduc5on to Learning to Rank. IEICE TransacNons 94-‐D(10): 1854-‐1862 (2011)

ì  Learning to Rank. Hang Li. Microsow Research Asia. ACL-‐IJCNLP 2009 Tutorial. Aug. 2, 2009. Singapore

Data & Analytics

Learning to Rank - From pairwise approach to listwise