Fast and Accurate Ranking Regression · FastandAccurateRankingRegression model parameters and Plackett-Luce scores via a spectralmethod. Thoughtheproblemssolvedare non-convex,weestablishconditionsthatyieldcon-

Fast and Accurate Ranking Regression

İlkay YıldızDept. of ECE

Northeastern [email protected]

Jennifer DyDept. of ECE


Deniz ErdoğmuşDept. of ECE


Jayashree Kalpathy-CramerDept. of Radiology

MGH/Harvard Medical [email protected]

Susan OstmoDept. of OphthalmologyCasey Eye Inst., OHSU

[email protected]

J. Peter CampbellDept. of OphthalmologyCasey Eye Inst., [email protected]

Michael F. ChiangDept. of OphthalmologyCasey Eye Inst., [email protected]

Stratis IoannidisDept. of ECE


Abstract

We consider a ranking regression problem inwhich we use a dataset of ranked choices tolearn Plackett-Luce scores as functions of sam-ple features. We solve the maximum likeli-hood estimation problem by using the Al-ternating Directions Method of Multipliers(ADMM), effectively separating the learningof scores and model parameters. This separa-tion allows us to express scores as the station-ary distribution of a continuous-time MarkovChain. Using this equivalence, we proposetwo spectral algorithms for ranking regressionthat learn model parameters up to 579 timesfaster than the Newton’s method.

1 IntroductionLearning from ranked choices has a long history indomains such as econometrics (McFadden, 1973; Ryzinand Mahajan, 1999), transportation (McFadden, 2000),psychometrics (Thurstone, 1927; Bradley and Terry,1952), and sports (Elo, 1978), to name a few. ThePlackett-Luce choice model (Plackett, 1975) is a popu-lar parametric model used for inference in this setting:each sample is parametrized by a score and the prob-ability that a sample is ranked higher than a set ofalternatives is proportional to this score.

Plackett-Luce scores are traditionally learned fromranking observations via Maximum Likelihood Es-timation (MLE) (Dykstra, 1960; Hunter, 2004; Ha-jek et al., 2014; Negahban et al., 2018); under areparametrization, the negative log-likelihood is convex

Proceedings of the 23rdInternational Conference on ArtificialIntelligence and Statistics (AISTATS) 2020, Palermo, Italy.PMLR: Volume 108. Copyright 2020 by the author(s).

and Plackett-Luce scores can be estimated via, e.g.,Newton’s method (Nocedal and Wright, 2006). Never-theless, for large datasets, Newton’s method can be pro-hibitively slow (Hunter, 2004). Recently, Maystre andGrossglauser (2015) proposed a highly efficient iterativespectral method, termed Iterative Luce Spectral Rank-ing (ILSR), that estimates Plackett-Luce scores signifi-cantly faster than state-of-the-art methods. ILSR relieson the fact that ML estimates of Plackett-Luce scoresconstitute the stationary distribution of a Markov chainwith transition rates defined by ranking observations.

The above approaches learn Plackett-Luce scores inthe absence of sample features, which precludes rankpredictions on samples outside the training set. A nat-ural variant of the above setting is ranking regression,whereby Plackett-Luce scores are parametrized func-tions of sample features. This problem has receivedconsiderable attention in the literature, via both shal-low (Joachims, 2002; Pahikkala et al., 2009; Tian et al.,2019) and deep models (Burges et al., 2005; Changet al., 2016; Dubey et al., 2016; Han, 2018; Yıldız et al.,2019). Nevertheless, virtually all existing work on rank-ing regression relies on classic optimization methods forparameter inference. To the best of our knowledge, theopportunity to accelerate learning in ranking regressionvia spectral methods has not yet been explored.

We make the following contributions.

• We solve the ranking regression problem by usingthe Alternating Directions Method of Multipliers(ADMM) (Boyd et al., 2011) to perform MLE, effec-tively separating the learning of scores and modelparameters. This separation allows us to expressscores as the stationary distribution of a modifiedMarkov Chain, and to devise spectral algorithms forranking regression akin to ILSR.

• In particular, we propose two iterative algorithms,PLADMM and PLADMM-log, that jointly estimate


model parameters and Plackett-Luce scores via aspectral method. Though the problems solved arenon-convex, we establish conditions that yield con-vergence guarantees, as well as initializations tailoredto the Plackett-Luce objective.

• Our algorithms yield significant performance div-idends in terms of both speed and accuracy onsynthetic and real-life datasets. PLADMM andPLADMM-log are up to 579 times faster than tradi-tional optimization methods regressing Plackett-Lucescores from features, including Newton’s method.Furthermore, for large datasets, PLADMM andPLADMM-log outperform feature-less methods, in-cluding ILSR, by 13% in maximal choice predictionaccuracy and 9% in ranking prediction Kendall-Taucorrelation.

From a technical standpoint, we show that the Plackett-Luce negative log-likelihood augmented with a proximalpenalty has stationary points that satisfy the balanceequations of a Markov chain (c.f. Thm 4.2). In turn,ADMM allows us to reduce ranking regression to aregularized MLE with precisely such a penalty. Theremainder of this paper is organized as follows. Wereview related literature in Sec. 2. We formulate ourproblem in Sec. 3 and summarize ILSR. We describe ourmain contributions and proposed algorithms in Sec. 4.We present our experiments in Sec. 5 and concludewith future work in Sec. 6.

2 Related WorkThe problem of rank aggregation (Dwork et al., 2001),in which a total ordering of samples is regressed fromranking observations, is classic; literature on the subjectis vast—see, e.g., the surveys by Fligner and Verducci(1993), Cattelan (2012) and Marden (2014). Proba-bilistic inference in this setting typically assumes (a)that a “true” total ordering of samples exists, and (b)that ranking observations exhibit a stochastic transitiv-ity property (Agarwal, 2016): a sample is more likelyto be ranked higher than another when this event isconsistent with the underlying total ordering.

The noisy permutation model is a non-parametricmodel for this setting: pairwise comparisons consistentwith the underlying total ordering are observed underi.i.d. Bernoulli noise. Maximum likelihood estimation(MLE) is NP-hard in this setting. A polytime algorithmby Braverman and Mossel (2008) recovers the under-lying ordering in Θ(n log n) comparisons, w.h.p.; thisis tightened by several recent works (Wauthier et al.,2013; Mao et al., 2017, 2018). The Mallows model(Mallows, 1957) assumes that the probability of a rank-ing observation is a decreasing function of its distancefrom the underlying total ordering, under appropri-ate notions of distance (e.g., Kendall-Tau); MLE canbe approached, e.g., via EM (Lu and Boutilier, 2011).

Shah et al. (2016a) learn the full matrix of pairwisecomparison probabilities via a minimax optimal esti-mator requiring Θ(log2 n) comparisons. Rajkumar andAgarwal (2016) learn the matrix via matrix completion,requiring Θ(nr log n) comparisons, where r � n is therank. Ammar and Shah (2011) assume that compar-isons are sampled from an unknown distribution overtotal orderings and propose an entropy maximizationalgorithm requiring Θ(n2) comparisons.

We focus on parametric models, as they are more natu-ral in the context of regressing rankings from sample fea-tures. In both Plackett-Luce (Plackett, 1975) and Thur-stone (Thurstone, 1927) each sample is parametrizedby a score. In the Thurstone model, observations resultfrom comparing scores after the addition of Gaussiannoise. Vojnovic and Yun (2016) and Shah et al. (2016b)estimate Thurstone scores via MLE and provide sam-ple complexity bounds that are inversely proportionalto the smallest non-zero eigenvalue of the Laplacianof a graph modeling comparisons. In Plackett-Luce,the probability that a sample is chosen over a set ofalternatives is proportional to its score. Hunter (2004)proposes a Minorization-Maximization (MM) approachto estimate Plackett-Luce scores via MLE, earlier usedby Dykstra (1960) on pairwise comparisons (i.e., onthe Bradley-Terry (BT) setting). Hajek et al. (2014)provide an upper bound on the error in estimating thePlackett-Luce scores via MLE and show that the latteris minimax-optimal. Negahban et al. (2018) propose alatent factor model, estimating parameters via a convexrelaxation of the corresponding rank penalty and pro-viding sample complexity guarantees. Assuming scorepriors, Guiver and Snelson (2009), Caron and Doucet(2012) and Azari et al. (2012) estimate Plackett-Lucescores via Bayesian inference.

Our focus on Plackett-Luce is due to the recent emer-gence of spectral algorithms for inference in this setting.Negahban et al. (2012) propose the Rank Centrality(RC) algorithm for the BT setting and derive a min-imax error bound. Chen and Suh (2015) propose aspectral MLE algorithm extending RC with an addi-tional stage that cyclically performs MLE for each score.Soufiani et al. (2013) and Jang et al. (2017) extend RCto rankings of two or more samples by breaking rank-ings into independent comparisons. Improved bounds,applying also to broader noise settings, are providedby Rajkumar and Agarwal (2014). Khetan and Oh(2016) generalize the work by Soufiani et al. (2013) bybreaking rankings into independent shorter rankings,and building a hierarchy of tractable and consistentestimators. Blanchet et al. (2016) model sequentialchoices by state transitions in a Markov chain (MC),where transitions are functions of choice probabilities.

Bridging the above approaches with MLE, Maystre and

Yıldız, Dy, Erdoğmuş, Kalpathy-Cramer, Ostmo, Campbell, Chiang, Ioannidis

Grossglauser (2015) show that the MLE of Plackett-Luce scores can be expressed as the stationary distribu-tion of an MC. Their proposed Iterative Luce SpectralRanking (ILSR) algorithm estimates the Plackett-Lucescores faster than traditional optimization methods,such as, e.g., Hunter’s (Hunter, 2004) and Newton’smethod, and more accurately than prior spectral rankaggregation methods. Ragain and Ugander (2016) showthat a spectral approach applies even after relaxingthe assumption that the relative order of any two sam-ples is independent of the alternatives. Agarwal et al.(2018) propose another spectral method called accel-erated spectral ranking by departing from the exactequivalence between MLE and MC approximation anddemonstrate faster convergence than ILSR.

We depart from all aforementioned methods by regress-ing ranked choices from sample features. Closer toour work, RankSVM (Joachims, 2002) learns a targetranking from features via a linear Support Vector Ma-chine (SVM), with constraints imposed by all possiblecomparisons. Pahikkala et al. (2009) propose a regu-larized least-squares based algorithm for learning torank from comparisons. Several works learn compar-isons from features via MLE over logistic BT models(Guo et al., 2018; Tian et al., 2019); deeper modelshave also been considered (Burges et al., 2005; Changet al., 2016; Dubey et al., 2016; Han, 2018; Yıldız et al.,2019). Niranjan and Rajkumar (2017) assume that fea-tures are low-dimensional and use matrix completionto recover the BT scores. Saha and Rajkumar (2018)propose a least squares based algorithm called f-BTLto regress the BT scores; we adjust this to initalizeour algorithm. To the best of our knowledge, we arethe first to use a spectral method akin to ILSR to (a)regress Plackett-Luce scores from features, and (b) toestablish a significant speedup over prior art.

3 Problem FormulationPlackett-Luce Model. We consider a dataset of nsamples indexed by i ∈ N ≡ {1, . . . , n}. Every samplei ∈ N has a corresponding p-dimensional feature vectorxi ∈ Rp. There exists an underlying total orderingof these n samples. A labeler of this dataset acts asa (possibly noisy) oracle revealing this total ordering:when presented with a query A ⊆ N, i.e., a set of alter-native samples, the noisy labeler chooses the maximalsample in A w.r.t. the underlying total ordering.

Formally, our “labeled” dataset D = {(c`, A`) | ` ∈M ={1, ...,M}} consists of M observations (c`, A`), ` ∈M,where A` ⊆ N is the `-th query submitted to the labelerand c` ∈ A` is her respective `-th maximal choice (i.e.,the label). We tackle the problem of regressing suchchoices c` from the features xi of the samples i ∈ A`.To do so, we assume that choices are governed by

the Plackett-Luce model (Plackett, 1975). The modelasserts that every sample i ∈ N is associated with anon-negative deterministic score πi ∈ R+. Given scoresπ = [πi]i∈N ∈ Rn+, then (a) observations (c`, A`), ` ∈M

are independent, and (b) given query A`,

P(c` |A`,π) = πc`/∑j∈A` πj = π`/

∑j∈A` πj . (1)

Abusing notation, we write the score of the cho-sen sample as π` ≡ πc` . Note that P(c` |A`,π) =P(c` |A`, sπ), for all s > 0; thus, w.l.o.g., we may addi-tionally assume (or enforce via rescaling) that Plackett-Luce scores satisfy 1>π = 1.

Plackett-Luce also applies to ranking data. In the rank-ing setting, when presented with a query A` ⊆ N, thelabeler ranks the samples inA` into an ordered sequenceα`1 � α`2 � · · · � α`|A`|. Under the Plackett-Luce model,this ranking is expressed as |A`| − 1 maximal choicequeries: α`1 over A`, α`2 over A` \ {α`1}, etc., so that:

P(α`1�α`2�· · ·�α`|A`| |A`,π)=

|A`|−1∏t=1

(πα`t/

|A`|∑s=t

πα`s). (2)

The product form of (2) implies that rankings of aquery A` can be converted to |A`| − 1 maximal-choiceobservations, each governed by (1), that have the samejoint probability: ranking (α`1�α`2 �· · · �α`|A`|) canbe seen as the outcome of α`1 being chosen as thetop within the query set A`, α`2 being the top amongA` \ {α`1}, etc. Keeping this reduction from rankingto maximal-choice datasets in mind, we focus on thelatter in our exposition below.

Parameter Inference and Regression. Given ob-servations D, Maximum Likelihood Estimation (MLE)of the Plackett-Luce scores π ∈ Rn+ amounts to mini-mizing the negative log-likelihood:

L(D |π) ≡∑M`=1

(log∑j∈A` πj − log π`

). (3)

To regress scores π from sample features xi, i ∈ N, weconsider two cases:

Affine Case. We assume that there exist β ∈ Rp andb ∈ R such that π = πAFF(β, b;X) ≡ Xβ + b1. Then,MLE of parameters (β, b) ∈ Rp+1 amounts to solving:

min(β,b):πAFF(β,b;X)≥0

L (D |πAFF(β, b;X)) , (4)

where L is given by (3), and X = [x1, ..,xn]T ∈ Rn×p.

We note that Problem (4) is not convex, as the objectiveis not convex in (β, b) ∈ Rp+1 (c.f. Appendix A).

Logistic Case. In the logistic case, we assume thatthere exists β ∈ Rp s.t. π = πLOG(β;X) ≡ [eβ

>xi ]i∈N.As πLOG(β;X) ≥ 0 by definition, MLE corresponds to:

minβ∈Rp L(D |πLOG(β;X)). (5)


The objective of (5) is convex in β (c.f. Appendix A),so an optimal solution can be found via, e.g., Newton’smethod (Nocedal and Wright, 2006).

Plackett-Luce Without Features and a SpectralMethod. We wish to construct highly efficient algo-rithms for solving regression problems (4) and (5). Todo so, we first briefly review the state of the art forlearning the scores π in the absence of features. In thiscase, MLE amounts to:

minπ∈Rn+ L(D | π). (6)

As is the case for (5), reparametrizing the scores asπi = eθi , i ∈ N makes the negative log-likelihood Lconvex in θ = [θi]i∈N, which in turn enables computingthe Plackett-Luce scores via Newton’s method. Nev-ertheless, Newton’s method can be prohibitively slowfor large n and M (Hunter, 2004). Recently, Maystreand Grossglauser (2015) proposed a novel spectral algo-rithm that is significantly faster than Newton’s method.Their algorithm relies on the following theorem which,for completeness, we re-prove in Appendix B:

Theorem 3.1 (Maystre and Grossglauser (2015)). Anoptimal solution π ∈ Rn+ to (6) satisfies:∑

j 6=i πjλji(π) =∑j 6=i πiλij(π), for all i ∈ N, (7)

where, for all i, j ∈ N, with i 6= j,

λji(π) =∑`∈Wi∩Lj

(∑t∈A` πt

)−1 ≥ 0, (8)

for Wi = {` |i ∈ A`, c` = i} the observations wheresample i ∈ N is chosen and Li = {` |i ∈ A`, c` 6= i} theobservations where sample i ∈ N is not chosen.

Eq. (7) are the balance equations of a continuous-timeMarkov Chain (MC) with transition rates:

Λ(π) = [λji(π)]i,j∈N, (9)

where λji(π) are given by Eq. (8). Hence, π is thestationary distribution of the MC defined by transitionrates Λ(π) (Gallager, 2013). Let ssd(Λ) be the sta-tionary distribution of an MC with transition rates Λ.When matrix Λ is fixed (i.e., the transition rates areknown), the vector ssd(Λ) is a solution to the linear sys-tem defined by the balance equations (7) and 1>π = 1,as it is a distribution.1 If (9) is irreducible, the linearsystem has a unique solution π > 0 (Gallager, 2013).

However, the transition matrix Λ = Λ(π) in Theo-rem 3.1 is itself a function of π, and is therefore apriori unknown. Maystre and Grossglauser (2015) find

1In practice, ssd(Λ) can be computed by uniformizing Λ,i.e., increasing self-transition rates until all states have thesame outgoing rate, and finding the leading left eigenvectorvia, e.g., the power method (Lei et al., 2016).

π through an iterative algorithm. Starting from theuniform distribution π0 = 1

n1, they compute:

πl+1 = ssd(Λ(πl)

), for l = 0, 1, 2, . . . , (10)

where Λ(·) is given by (8), (9). Maystre and Gross-glauser (2015) refer to Eq. (10) as the Iterative LuceSpectral Ranking (ILSR) algorithm. They also establishthat (10) converges to an optimal solution of (6) undermild assumptions. Most importantly, as mentionedabove, ILSR significantly outperforms state-of-the-artMLE algorithms in computational efficiency.

4 Plackett-Luce ADMM (PLADMM)Algorithm

Given ILSR’s significant computational benefits, wewish to develop analogues in the regression setting. Incontrast to the feature-less setting, it is not a priorievident how to solve Problems (4) and (5) via a spec-tral approach. Taking the affine case as an example,and momentarily ignoring issues of non-convexity, thestationary points of the Lagrangian of the optimiza-tion problem (4) cannot be expressed via the balanceequations of an MC. Our main contribution is to cir-cumvent this problem by using the Alternating Direc-tions Method of Multipliers (ADMM) (Boyd et al.,2011). Intuitively, ADMM allows us to decouple theoptimization of scores π from model parameters β andb, encapsulating them in a quadratic penalty: the latterbecomes amenable to a spectral approach after a seriesof manipulations that we outline below (see Thm. 4.2).We focus here on the affine case, extending our methodto the logistic case in Appendix E.

An ADMM Approach. We rewrite Problem (4) as:

Minimize L(D |π) (11a)subject to: π = Xβ + b1, π ≥ 0. (11b)

To simplify our notation, we introduce β = (β, b) ∈Rp+1 and X = [X|1] ∈ Rn×(p+1), so that π = Xβ.ADMM solves (11) by minimizing the following aug-mented Lagrangian:

Lρ(β,π,y) = L(D |π) + yT (Xβ − π)

+ρ

2‖Xβ − π‖22,

(12)

where y ∈ Rn is a dual variable corresponding to theequality constraints in Eq. (11) and ρ > 0 is a penaltyparameter. ADMM alternates between optimizing βand π, thereby decoupling these two variables. Usinga rescaling u = 1

ρy ∈ Rn for convenience, applyingADMM on problem (11) yields the following iterativealgorithm (see Appendix C for a detailed derivation):

βk+1=arg minβ∈Rp+1

‖Xβ − πk + uk ‖22, (13a)


πk+1=arg minπ∈Rn+

(L(D|π)+ ρ

2‖Xβk+1−π+uk ‖22

), (13b)

uk+1=uk + Xβk+1 − πk+1. (13c)

This has the following immediate computational ad-vantages. First, step (13a) is a quadratic minimizationand admits a closed form solution. Crucially, step (13b)is amenable to a spectral approach, though the corre-sponding MC is not as apparent as in ILSR; we outlineits construction below.

An MC for Step (13b). We first establish the follow-ing auxiliary lemma, proved in Appendix D.1.Lemma 4.1. Given βk+1 ∈ Rp+1, uk ∈ Rn, let π ∈Rn+ be such that:

∇πLρ(βk+1,π,uk) = 0. (14)

For σ = [σi]i∈N ≡ ρ(π−Xβk+1−uk), and [λij(π)]i,j∈Ngiven by (8), (14) is equivalent to:∑

j 6=i πjλji(π)−∑j 6=i πiλij(π) = πiσi, (15)

for all i ∈ N.

Although Eq. (15) looks similar to Eq. (7), it is notevident that it corresponds to the balance equations ofan MC as, in general, σ 6= 0. Nevertheless, we provethat this is indeed the case:Theorem 4.2. Eq. (15) are the balance equations ofa continuous-time MC with transition rates:

µji(π) =

λji(π) +

2πiσiσj∑t∈N−

πtσt−∑t∈N+

πtσt

if j ∈ N+ and i ∈ N−

λji(π) otherwise,

(16)

where σ = [σi]i∈N ≡ ρ(π−Xβk+1−uk), [λij(π)]i,j∈Nare given by (8), and (N+,N−) is a partition of N suchthat σi ≥ 0 for all i ∈ N+ and σi < 0 for all i ∈ N−.

The proof is in App. D.2. By Lemma 4.1 and Theorem4.2, we conclude that a stationary π ∈ Rn+ satisfying(14) is also the stationary distribution of the continuous-time MC with transition rates:

M(π) = [µji(π)]i,j∈N, (17)

where µji(π) are given by Eq. (16). Motivated bythese observations, and mirroring ILSR (Eq. (10)), wecompute a solution to (13b) via:

πl+1 = ssd(M(πl)

), for l = 0, 1, 2, . . . , (18)

where M(·) is given by Eq. (17). We refer to thisprocedure as ILSRX (“ILSR with features”).

Overall Algorithm. Putting everything together, ourPlackett-Luce ADMM (PLADMM) solving Eq. (11)

Algorithm 1 PLADMM1: procedure ADMM(X, D = {(c`, A`) | ` ∈M}, ρ)2: Initialize β via Eq. (20); π ← Xβ; u← 03: repeat4: π ← ILSRX(ρ,π, X, β,u)

5: u← u+ Xβ − π6: β ← (XT X)

−1XT (π − u)

7: until convergence8: return β, π9: end procedure1: procedure ILSRX(ρ,π, X, β,u)2: repeat3: σ ← ρ(π − Xβ − u)4: Calculate M(π) = [µji(π)]i,j∈N via Eq. (16)5: π ← ssd (M(π))6: until convergence7: return π8: end procedure

is summarized in Algorithm 1. We iteratively up-date β, π, and u via Eq. (13) until convergence, withβ updated via Eq. (13a) and π updated via ILSRX(Eq. (18)). At iteration k, we initialize ILSRX withπk−1. We note that, as Problem (11) is non-convex,selecting a good initialization point is important inpractice. We discuss initialization, additional computa-tional issues, and theoretical guarantees below.

Initialization. We initialize β so that Xβ is a goodapproximation of Plackett-Luce scores. We use a tech-nique akin to Saha and Rajkumar (2018), applied toour affine setting. Given a distribution over queriesA ⊆ N, let Pij = EA[c = i|{i, j} ⊆ A] be the probabil-ity that i is chosen given a query A that contains bothi and j. By (1), for i, j ∈ N, PijPji = πi

πj=x>i β+b

x>j β+b, or:

δij(β) ≡ (Pijxj − Pjixi)>β + (Pij − Pji)b = 0. (19)

Motivated by (19), we estimate Pij empirically fromD, and obtain our initialization β0 = (β0, b0) ∈ Rp+1

by solving (19) in the least-square sense; that is,

β0 = arg minβ∈Rp+1:Xβ≥0∧1>Xβ=1

∑i,j δ

2ij(β). (20)

Note that this is a convex quadratic program. Finally,we also set the initial dual variable as u0 = 0.

Computational Complexity. Each iteration ofPLADMM involves the three steps in Eq. (13). One it-eration of ILSRX is O(

∑`∈D |A`|+n2) for constructing

the transition matrixM(π) via Eq.(17) and for findingthe stationary distribution π via, e.g., a power method(Lei et al., 2016), respectively. Updates of u and βare both O

(n(p+ 1)

)as matrix-vector multiplications,

since the matrix (XT X)−1XT can be precomputed.

Theoretical Guarantees. In general, condition (14) isnot sufficient for optimality w.r.t. step (13b). To showthis, we require the following technical assumption:Assumption 4.1. For {πk}k∈N, given by (13b), thereexists an ε > 0 such that πki > ε for all i ∈ N and k ∈ N.


Spec. DatasetROP FAC Pairwise Sushi Triplet Sushi

n 100 1000 100 100

p 143 50 18 18

M 29705 728 450 1200

|A`| 2 2 2 3

nfold 10 10 3 10

Type Choice Choice Choice Ranking

Table 1: No. of samples (n), no. of parameters (p), no. of observa-tions (M), query size (|A`|), no. of cross validation folds (nfold), andtype of observations for real data

Under this assumption, we show that stationarity im-plies optimality w.r.t. (13b) for large enough ρ:Theorem 4.3. Under Assumption 4.1, for ρ ≥2ε2 maxi

∑`|i∈A`

1|A`|2 , a π > 0 satisfying condition

(14) is a minimizer of (13b).

The proof is in Appendix D.3. Moreover, althoughproblem (11) is non-convex, we establish the followingconvergence guarantee for the ADMM steps (13). Weprovide the proof in Appendix D.4.Theorem 4.4. Suppose that there exists κ > 0 suchthat XT X � κ I and the sequence {(πk,uk, βk+1)}k∈Ngenerated by (13) is bounded. Then, under Assump-tion 4.1, for ρ > 2maxi |Wi|

ε2 where Wi is defined inTheorem 3.1, the sequence {(πk,uk, βk+1)}k∈N gen-erated by (13) converges to a point that satisfies theKarush-Kuhn-Tucker (KKT) conditions of (11).

5 ExperimentsExperiment Setup. We evaluate PLADMM andPLADMM-log (the spectral algorithm for the logis-tic case, described in Appendix E) on synthetic andreal-life datasets, summarized in Table 1. Additionaldetails on our datasets are in Appendix F.1. We per-form 10-fold cross validation (CV) for each dataset,except Pairwise Sushi, for which we use 3 folds. Forsynthetic datasets, we also repeat experiments over 5random generations. We partition each dataset intotraining and test sets in two ways. In observation CV,we partition the dataset w.r.t. observations M, using90% of the M observations for training and the re-maining 10% for testing. In sample CV, we partitionsamples N, using 90% of the n samples for trainingand the remaining 10% for testing. When partitioningw.r.t. samples, observations containing samples fromboth training and test partitions are discarded. As thePairwise Sushi dataset contains few observations (c.f.Table 1), we perform 3-fold cross validation in this case.

We implement2 seven inference algorithms describedin detail in Appendix F.2. Four are feature meth-ods, i.e., algorithms that regress Plackett-Luce scoresfrom features: PLADMM described in Algorithm 1,

2Our code is publicly available at

https://github.com/neu-spiral/FastAndAccurateRankingRegression

PLADMM-log described in Appendix E, sequentialleast-squares quadratic programming (SLSQP), thatsolves (4), and Newton on β, that solves the convexproblem (5) via Newton’s method. The remaining threeare featureless methods, i.e., algorithms that learn thePlackett-Luce scores from the choice observations alone:Iterative Luce Spectral Ranking (ILSR) described byEq.(10), the Minorization-Maximization (MM) algo-rithm (Hunter, 2004), and Newton on θ that solvesEq. (6) via the reparametrization πi = eθi , i ∈ N usingNewton’s method on θ = [θi]i∈N.

Performance Metrics. We run each algorithm untilconvergence (see App. F.2 for criteria). We measurethe elapsed time, including time spent in initialization,in seconds (Time) and the number of iterations (Iter).We measure the prediction performance by Top-1 ac-curacy (Top-1 Acc.) and Kendall-Tau correlation (KT)on the test set; formulas are provided in App. F.3.For synthetic datasets, we also measure the qualityof convergence by the norm of the difference betweenestimated and true Plackett-Luce scores (4π); lowervalues indicate better estimation. We report averagesand standard deviations over folds.

Execution Environment. For Tables 2 - 4, we mea-sure timing on an Intel Xeon CPU E5-2680v2 2.8GHzwith 128GB RAM. Particularly for experiments onlarger synthetic datasets (c.f. Figures 1 - 2), we use anIntel Xeon CPU E5-2680v4 2.4GHz with 500GB RAM.

Sample CV. We begin with the experiments on sam-ple CV, in which we partition samples N into trainingand test sets. Table 2 shows the evaluations of all al-gorithms trained on a synthetic dataset with n = 1000samples, p = 100 features, M = 1000 observations,and query size |A`| = 2. PLADMM and PLADMM-logconverge 4−27 times faster than other feature methods,i.e., Newton on β and SLSQP. Recall that in sample CVpartitioning, training observations contain only train-ing samples: test samples do not participate in anyof the training observations. Thus, featureless meth-ods ILSR, MM, and Newton on θ are no better thanrandom predictors, with 0.5 Top-1 Acc. and 0.0 KT.By regressing the Plackett-Luce scores from features,PLADMM and PLADMM-log significantly outperformthe predictions of ILSR, MM, and Newton on θ, by16%− 33% Top-1 Acc. and 16%− 30% KT.

Real datasets. We observe an equally significant speedgain on real datasets; Table 3 shows the evaluationson four real datasets partitioned w.r.t. sample CV.PLADMM and PLADMM-log are 3− 18 times fasterthan Newton on β and SLSQP. This speed gain is fun-damentally due to the smaller per iteration complexityof PLADMM and PLADMM-log (c.f. Section 4). Forinstance, compared to Newton on β, PLADMM-log re-

https://github.com/neu-spiral/FastAndAccurateRankingRegression


Partitioning Method Training Metrics Performance Metrics on the Test SetTime (s) ↓ Iter. ↓ 4π ↓ Top-1 Acc. ↑ KT ↑

Sample CV

PLADMM 0.237± 0.006 4± 0 0.717± 0.207 0.831± 0.119 0.609± 0.247

PLADMM-log 1.428± 2.595 49± 79 0.845± 0.204 0.668± 0.159 0.335± 0.318

ILSR (no X) 0.045± 0.002 2± 0 0.718± 0.207 0.5± 0.0 −1.0± 0.0

MM (no X) 9.728± 0.487 500± 0 1.2± 0.1 0.5± 0.0 0.0± 0.0

Newton on θ (no X) 4.537± 0.729 14± 3 1.236± 0.132 0.5± 0.0 −0.08± 0.272

Newton on β 6.406± 2.104 14± 5 0.808± 0.462 0.844± 0.148 0.688± 0.296

SLSQP 43.908± 24.469 229± 132 0.718± 0.206 0.796± 0.106 0.592± 0.211

Observation CV

PLADMM 0.48± 0.24 4± 0 0.717± 0.207 0.837± 0.037 0.569± 0.072

PLADMM-log 1.58± 2.027 29± 14 0.883± 0.208 0.699± 0.066 0.398± 0.132

ILSR (no X) 0.098± 0.056 2± 0 0.718± 0.208 0.708± 0.045 0.389± 0.088

MM (no X) 11.302± 0.515 500± 0 0.864± 0.19 0.685± 0.037 0.354± 0.074

Newton on θ (no X) 8.218± 1.782 14± 3 1.244± 0.121 0.506± 0.029 0.01± 0.05

Newton on β 7.696± 2.35 14± 4 0.804± 0.463 0.871± 0.087 0.742± 0.173

SLSQP 47.824± 28.585 219± 138 0.718± 0.206 0.819± 0.035 0.637± 0.07

Table 2: Evaluations on a synthetic dataset with n = 1000, p = 100, and M = 1000, partitioned w.r.t. sample CV and observation CV. Wereport the convergence time (Time), number of iterations until convergence (Iter), norm error in estimating true Plackett-Luce scores (4π),top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on θ do not use thefeatures X. Newton on β and SLSQP regress π from X.

101 102 103 104

p

10 1

100

101

102

103

104

Tim

e

PLADMMPLADMM-logILSRNewton on

101 102 103 104

p0.4

0.5

0.6

0.7

0.8

0.9

Top-

1 Ac

c.


(a) Time and Top-1 Acc. vs. p

102 103 104 105

n

10 2

10 1

100

101

102

103

104

Tim

e


102 103 104 105

n

0.4

0.5

0.6

0.7

0.8

0.9

Top-

1 Ac

c.


(b) Time and Top-1 Acc. vs. nFigure 1: Convergence time (Time) and top-1 test accuracy (Top-1 Acc.) vs. n and p for PLADMM, PLADMM-log, ILSR, and Newton onβ. Evaluations are on synthetic datasets containing M = 250 observations partitioned w.r.t. observation CV. Number of samples changes inn ∈ {50, 100, 1000, 10000, 100000} when number of features is p = 100, and number of features changes in p ∈ {10, 100, 1000, 10000} whennumber of samples is n = 1000.

Dataset Method Training Metrics Performance Metrics on the Test SetTime (s) ↓ Iter. ↓ Top-1 Acc. ↑ KT ↑

FAC

PLADMM 0.301± 0.048 4± 0 0.654± 0.237 0.307± 0.473

PLADMM-log 0.298± 0.466 10± 15 0.685± 0.237 0.369± 0.474

ILSR (no X) 0.059± 0.016 2± 0 0.5± 0.0 −1.0± 0.0

MM (no X) 5.905± 0.282 500± 0 0.5± 0.0 0.0± 0.0

Newton on θ (no X) 7.604± 0.805 18± 2 0.5± 0.0 −0.4± 0.49

Newton on β 0.859± 0.077 6± 1 0.67± 0.17 0.34± 0.339

SLSQP 14.332± 5.684 178± 67 0.675± 0.147 0.349± 0.293

ROP

PLADMM 1.708± 0.166 4± 0 0.783± 0.03 0.565± 0.06

PLADMM-log 0.325± 0.028 1± 0 0.724± 0.105 0.448± 0.209

ILSR (no X) 0.649± 0.053 2± 0 0.5± 0.0 −1.0± 0.0

MM (no X) 0.001± 0.001 1± 0 0.5± 0.0 −1.0± 0.0

Newton on θ (no X) 68.924± 5.521 8± 0 0.497± 0.012 −0.988± 0.036

Newton on β 47.563± 8.342 2± 1 0.552± 0.048 0.103± 0.096

SLSQP 4.823± 4.914 2± 1 0.769± 0.052 0.538± 0.104

Pairwise Sushi

PLADMM 0.046± 0.01 4± 0 0.451± 0.082 −0.09± 0.177

PLADMM-log 0.141± 0.025 27± 13 0.532± 0.076 0.064± 0.152

ILSR (no X) 0.014± 0.006 2± 0 0.5± 0.0 −1.0± 0.0

MM (no X) 1.513± 0.587 352± 183 0.5± 0.0 −1.0± 0.0

Newton on θ (no X) 1.282± 0.924 18± 9 0.5± 0.0 −0.666± 0.472

Newton on β 0.21± 0.115 4± 2 0.665± 0.035 0.33± 0.069

SLSQP 4.619± 6.321 168± 235 0.624± 0.065 0.248± 0.13

Triplet Sushi

PLADMM 0.091± 0.02 4± 0 0.358± 0.805 −0.333± 0.924

PLADMM-log 0.556± 0.276 40± 25 0.393± 0.826 0.096± 1.069

ILSR (no X) 0.033± 0.012 2± 0 0.334± 0.0 −0.047± 1.089

MM (no X) 1.824± 0.701 267± 151 0.334± 0.0 0.0± 0.0

Newton on θ (no X) 2.728± 1.475 13± 3 0.334± 0.0 −0.047± 1.089

Newton on β 1.966± 3.158 10± 19 0.322± 0.802 −0.261± 0.956

SLSQP 1.656± 1.793 20± 30 0.608± 0.826 0.358± 0.928

Table 3: Evaluations on real datasets partitioned w.r.t. sample CV (c.f. Sec. 5). We report the convergence time (Time), number ofiterations until convergence (Iter), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM,and Newton on θ do not use the features X. Newton on β and SLSQP regress π from X.


quires about 2 times more iterations, but still converges3 times faster than Newton on β on FAC. Moreover,while significantly decreasing the convergence time,PLADMM or PLADMM-log consistently attain similarprediction performance to Newton on β and SLSQP,except for Sushi, for which they perform slightly worse(by 13% − 20% Top-1 Acc.), though the convergencetime dividends are striking in comparison (78%− 95%).

Aligned with the prediction performance on syntheticdatasets, featureless methods ILSR, MM, and New-ton on θ can only attain 0.5 Top-1 Acc. and 0.0 KT.By regressing the Plackett-Luce scores from features,PLADMM and PLADMM-log significantly outperformthe predictions of ILSR, MM, and Newton on θ, by3%− 31% Top-1 Acc. and 5%− 78% KT.

Observation CV. A sample can appear in both train-ing and test observations in observation CV. Hence, fea-tureless methods ILSR, MM, and Newton on θ shouldfare better than in sample CV. Nonetheless, in Table2, as n = 1000 is larger than p = 100, there are morescores to learn than parameters. As a result, featuremethods are advantageous for good predictions com-pared to featureless methods. Particularly, PLADMMand PLADMM-log outperform the predictions of ILSR,MM, and Newton on θ in observation CV on Table 2,by 13% Top-1 Acc. and 9% KT. The relative perfor-mance of feature vs. featureless methods is governedby the relationship among n, p, and M . We thereforeexplore the effect of n and p below; the effect of M isdiscussed in Appendix F.4. We do not include Newtonon θ and MM in this analysis, as they are too slow.

Impact of p. To assess the impact of number of param-eters, we fix n = 1000, M = 250, |A`| = 2 and gener-ate synthetic datasets with p ∈ {10, 100, 1000, 10000}.Fig. 1a shows the Time and Top-1 Acc. of PLADMM,PLADMM-log, ILSR, and Newton on β. As M =250 observations are not enough to learn n = 1000scores, PLADMM leads to significantly better Top-1Acc. compared to ILSR. When p = 10, PLADMM andPLADMM-log outperform ILSR by 18% − 28% Top-1 Acc. Moreover, PLADMM and PLADMM-log areconsistently faster than Newton on β, for all p > 100.Particularly, for p = 10000, PLADMM and PLADMM-log converge 42-579 times faster than Newton on β.Interestingly, the convergence time of PLADMM-logcan even decrease with increasing p. This is becausethe number of iterations until convergence decreases.While significantly decreasing the convergence time,PLADMM consistently attains better Top-1 Acc. thanNewton on β, up to 8% for p = 100.

Impact of n. To assess the impact of number of samples,we fix p = 100, M = 250, |A`| = 2 and generate syn-thetic datasets with n ∈ {50, 100, 1000, 10000, 100000};

Fig. 1b shows evaluations on the resulting datasets.For n > p = 100, i.e., when there are more scores tolearn than parameters, PLADMM leads to significantlybetter Top-1 Acc. compared to ILSR. Particularly, forn = 100000, PLADMM outperforms ILSR by 25%Top-1 Acc. This confirms that, especially when thenumber of observations M is not sufficient to learn nscores, exploiting the features associated with the sam-ples is crucial in attaining good prediction performance.As expected, convergence time of Newton on β is notsignificantly affected by n. Despite this, PLADMMand PLADMM-log are faster than Newton on β forall n < 1000. Particularly, for n = 50, PLADMMand PLADMM-log converge 2-75 times faster thanNewton on β. While decreasing the convergence time,PLADMM consistently attains better Top-1 Acc. thanNewton on β, up to 19% for n = 100000.

Real datasets. We include the evaluations on realdatasets partitioned w.r.t. observation CV in the Ap-pendix (c.f.Table 4). Performance agrees with observa-tions above regarding the dependence on n an p. Fordatasets where n > M > p, e.g., FAC, PLADMM andPLADMM-log significantly outperform the predictionsof ILSR, by 10% Top-1 Acc. and 25% KT. For datasetswhere M is much larger than n (c.f. Table 1), featuremethods lead to similar prediction performance to eachother and slightly lower performance than ILSR, MM,and Newton on θ. Overall, PLADMM and PLADMM-log consistently converge faster than Newton on β andSLSQP, by 3− 27 times across all real datasets.

6 Conclusions

We solve the maximum likelihood estimation problemfor the Plackett-Luce scores via ADMM. We show thatthe scores are equivalent to the stationary distributionof a Markov Chain and propose spectral algorithms,PLADMM and PLADMM-log. We model the Plackett-Luce scores as affine and logistic functions of features.Extending these to more complex models, particularlyto deep neural networks, is an interesting open problem.Our approach has the potential of training a neuralnetwork over a linear penalty w.r.t. scores, where thelatter are regressed efficiently via a spectral methodover the quadratic pairwise ranking data. This can leadto significant improvements over training time, makingan epoch linear rather than quadratic in sample size.

Acknowledgments

We are supported by NIH (R01EY019474), NSF (SCH-1622542 at MGH; SCH-1622536 at Northeastern; SCH-1622679 at OHSU), and by unrestricted departmentalfunding from Research to Prevent Blindness (OHSU).


Bibliography

Agarwal, A., Patil, P., and Agarwal, S. (2018). Accel-erated spectral ranking. In International Conferenceon Machine Learning, pages 70–79.

Agarwal, S. (2016). On ranking and choice models. InIJCAI, pages 4050–4053.

Ammar, A. and Shah, D. (2011). Ranking: Compare,don’t score. In 2011 49th Annual Allerton Confer-ence on Communication, Control, and Computing(Allerton), pages 776–783. IEEE.

Ataer-Cansızoğlu, E. (2015). Retinal image analytics: Acomplete framework from segmentation to diagnosis.Northeastern University.

Azari, H., Parks, D., and Xia, L. (2012). Randomutility theory for social choice. In Advances in NeuralInformation Processing Systems, pages 126–134.

Blanchet, J., Gallego, G., and Goyal, V. (2016). Amarkov chain approximation to choice modeling. Op-erations Research, 64(4):886–905.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eck-stein, J. (2011). Distributed optimization and statis-tical learning via the alternating direction method ofmultipliers. Foundations and Trends R© in MachineLearning, 3(1):1–122.

Boyd, S. and Vandenberghe, L. (2004). Convex opti-mization. Cambridge university press.

Bradley, R. A. and Terry, M. E. (1952). Rank analysisof incomplete block designs: I. the method of pairedcomparisons. Biometrika, 39(3/4):324–345.

Braverman, M. and Mossel, E. (2008). Noisy sortingwithout resampling. In Proceedings of the 19th An-nual ACM-SIAM Symposium on Discrete Algorithms,pages 268–276. Society for Industrial and AppliedMathematics.

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds,M., Hamilton, N., and Hullender, G. (2005). Learn-ing to rank using gradient descent. In Proceedings ofthe International Conference on Machine Learning,pages 89–96. ACM.

Caron, F. and Doucet, A. (2012). Efficient bayesian in-ference for generalized bradley–terry models. Journalof Computational and Graphical Statistics, 21(1):174–196.

Cattelan, M. (2012). Models for paired comparisondata: A review with emphasis on dependent data.Statistical Science, pages 412–433.

Chang, H., Yu, F., Wang, J., Ashley, D., and Finkel-stein, A. (2016). Automatic triage for a photo series.ACM Transactions on Graphics (TOG), 35(4):148.

Chartrand, R. and Wohlberg, B. (2013). A noncon-vex ADMM algorithm for group sparsity with sparse

groups. In IEEE International Conference on Acous-tics, Speech and Signal Processing, pages 6009–6013.

Chen, Y. and Suh, C. (2015). Spectral mle: Top-krank aggregation from pairwise comparisons. In In-ternational Conference on Machine Learning, pages371–380.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., andFei-Fei, L. (2009). Imagenet: A large-scale hier-archical image database. In IEEE Conference onComputer Vision and Pattern Recognition, pages248–255.

Dubey, A., Naik, N., Parikh, D., Raskar, R., and Hi-dalgo, C. A. (2016). Deep learning the city: Quan-tifying urban perception at a global scale. In Euro-pean Conference on Computer Vision, pages 196–212.Springer.

Dwork, C., Kumar, R., Naor, M., and Sivakumar, D.(2001). Rank aggregation methods for the web. InProceedings of the 10th International Conference onWorld Wide Web, pages 613–622. ACM.

Dykstra, O. (1960). Rank analysis of incomplete blockdesigns: A method of paired comparisons employingunequal repetitions on pairs. Biometrics, 16(2):176–188.

Elo, A. E. (1978). The rating of chessplayers, past andpresent. Arco Pub.

Fligner, M. A. and Verducci, J. S. (1993). Probabil-ity models and statistical analyses for ranking data,volume 80. Springer.

Gallager, R. G. (2013). Stochastic Processes: Theoryfor Applications. Cambridge University Press.

Guiver, J. and Snelson, E. (2009). Bayesian inferencefor plackett-luce ranking models. In Proceedings ofthe 26th Annual International Conference on Ma-chine Learning, pages 377–384. ACM.

Guo, K., Han, D., and Wu, T.-T. (2017). Conver-gence of alternating direction method for minimizingsum of two nonconvex functions with linear con-straints. International Journal of Computer Mathe-matics, 94(8):1653–1669.

Guo, Y., Tian, P., Kalpathy-Cramer, J., Ostmo, S.,Campbell, J. P., Chiang, M. F., Erdogmus, D., Dy,J. G., and Ioannidis, S. (2018). Experimental designunder the bradley-terry model. In IJCAI, pages2198–2204.

Hajek, B., Oh, S., and Xu, J. (2014). Minimax-optimalinference from partial rankings. In Advances inNeural Information Processing Systems, pages 1475–1483.

Han, B. (2018). Dateline: Deep plackett-luce modelwith uncertainty measurements. arXiv preprintarXiv:1812.05877.


Hong, M. (2018). A distributed, asynchronous, andincremental algorithm for nonconvex optimization:An ADMM approach. IEEE Transactions on Controlof Network Systems, 5(3):935–945.

Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis.Cambridge university press.

Hunter, D. R. (2004). MM algorithms for general-ized bradley-terry models. The Annals of Statistics,32(1):384–406.

Jang, M., Kim, S., Suh, C., and Oh, S. (2017). Optimalsample complexity of m-wise data for top-k ranking.In Advances in Neural Information Processing Sys-tems, pages 1686–1696.

Joachims, T. (2002). Optimizing search engines usingclickthrough data. In Proceedings of the eighth ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 133–142. ACM.

Jolliffe, I. T. (1986). Principal component analysis andfactor analysis. In Principal Component Analysis.Springer.

Kamishima, T., Hamasaki, M., and Akaho, S. (2009).A simple transfer learning method and its applicationto personalization in collaborative tagging. In ICDM.

Kendall, M. G. (1938). A new measure of rank correla-tion. Biometrika, 30(1/2):81–93.

Khetan, A. and Oh, S. (2016). Computational andstatistical tradeoffs in learning to rank. In Advancesin Neural Information Processing Systems, pages739–747.

Lei, Q., Zhong, K., and Dhillon, I. S. (2016).Coordinate-wise power method. In Advances inNeural Information Processing Systems, pages 2064–2072.

Lu, T. and Boutilier, C. (2011). Learning mallowsmodels with pairwise preferences. In Proceedingsof the 28th International Conference on MachineLearning (icml-11), pages 145–152.

Mallows, C. L. (1957). Non-null ranking models. i.Biometrika, 44(1/2):114–130.

Mao, C., Pananjady, A., and Wainwright, M. J.(2018). Breaking the 1/

√n barrier: Faster rates

for permutation-based models in polynomial time.In Conference On Learning Theory, pages 2037–2042.

Mao, C., Weed, J., and Rigollet, P. (2017). Minimaxrates and efficient algorithms for noisy sorting. arXivpreprint arXiv:1710.10388.

Marden, J. I. (2014). Analyzing and modeling rankdata. Chapman and Hall/CRC.

Maystre, L. and Grossglauser, M. (2015). Fast andaccurate inference of plackett-luce models. In Ad-vances in Neural Information Processing Systems,pages 172–180.

McFadden, D. (1973). Conditional logit analysis ofqualitative choice behavior.

McFadden, D. (2000). Disaggregate behavioral traveldemand’s rum side. Travel Behaviour Research, pages17–63.

Negahban, S., Oh, S., and Shah, D. (2012). Iterativeranking from pair-wise comparisons. In Advances inNeural Information Processing Systems, pages 2474–2482.

Negahban, S., Oh, S., Thekumparampil, K. K., andXu, J. (2018). Learning from comparisons andchoices. The Journal of Machine Learning Research,19(1):1478–1572.

Niranjan, U. and Rajkumar, A. (2017). Inductive pair-wise ranking: going beyond the n log (n) barrier. InThirty-First AAAI Conference on Artificial Intelli-gence.

Nocedal, J. and Wright, S. (2006). Numerical optimiza-tion. Springer Science & Business Media.

Pahikkala, T., Tsivtsivadze, E., Airola, A., Järvinen,J., and Boberg, J. (2009). An efficient algorithm forlearning to rank from preference graphs. MachineLearning, 75(1):129–165.

Plackett, R. L. (1975). The analysis of permutations.Applied Statistics, pages 193–202.

Ragain, S. and Ugander, J. (2016). Pairwise choicemarkov chains. In Advances in Neural InformationProcessing Systems, pages 3198–3206.

Rajkumar, A. and Agarwal, S. (2014). A statistical con-vergence perspective of algorithms for rank aggrega-tion from pairwise data. In International Conferenceon Machine Learning, pages 118–126.

Rajkumar, A. and Agarwal, S. (2016). When can werank well from comparisons of o (n\log (n)) non-actively chosen pairs? In Conference on LearningTheory, pages 1376–1401.

Ryzin, G. v. and Mahajan, S. (1999). On the rela-tionship between inventory costs and variety ben-efits in retail assortments. Management Science,45(11):1496–1509.

Saha, A. and Rajkumar, A. (2018). Ranking withfeatures: Algorithm and a graph theoretic analysis.arXiv preprint arXiv:1808.03857.

Shah, N., Balakrishnan, S., Guntuboyina, A., andWain-wright, M. (2016a). Stochastically transitive modelsfor pairwise comparisons: Statistical and computa-tional issues. In International Conference on Ma-chine Learning, pages 11–20.

Shah, N. B., Balakrishnan, S., Bradley, J., Parekh, A.,Ramchandran, K., and Wainwright, M. J. (2016b).


Estimation from pairwise comparisons: Sharp mini-max bounds with topology dependence. The Journalof Machine Learning Research, 17(1):2049–2095.

Soufiani, H. A., Chen, W., Parkes, D. C., and Xia,L. (2013). Generalized method-of-moments for rankaggregation. In Advances in Neural InformationProcessing Systems, pages 2706–2714.

Sun, W.-T., Chao, T.-H., Kuo, Y.-H., and Hsu, W. H.(2017). Photo filter recommendation by category-aware aesthetic learning. IEEE Transactions onMultimedia, 19(8):1870–1880.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-novich, A. (2015). Going deeper with convolutions.In Computer Vision and Pattern Recognition, 2015.

Thurstone, L. L. (1927). The method of paired com-parisons for social values. The Journal of Abnormaland Social Psychology, 21(4):384.

Tian, P., Guo, Y., Kalpathy-Cramer, J., Ostmo, S.,Campbell, J. P., Chiang, M. F., Dy, J., Erdogmus,D., and Ioannidis, S. (2019). A severity score forretinopathy of prematurity. In Proceedings of the25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, pages 1809–1819. ACM.

Vojnovic, M. and Yun, S. (2016). Parameter estimationfor generalized thurstone choice models. In Inter-national Conference on Machine Learning, pages498–506.

Wang, Y., Yin, W., and Zeng, J. (2019). Global conver-gence of ADMM in nonconvex nonsmooth optimiza-tion. Journal of Scientific Computing, 78(1):29–63.

Wauthier, F., Jordan, M., and Jojic, N. (2013). Efficientranking from pairwise comparisons. In InternationalConference on Machine Learning, pages 109–117.

Yıldız, İ., Tian, P., Dy, J., Erdoğmuş, D., Brown, J.,Kalpathy-Cramer, J., Ostmo, S., Campbell, J. P.,Chiang, M. F., and Ioannidis, S. (2019). Classifica-tion and comparison via neural networks. NeuralNetworks.

Zeng, J., Ouyang, S., Lau, T. T.-K., Lin, S., andYao, Y. (2018). Global convergence in deep learningwith variable splitting via the kurdyka-lojasiewiczproperty. arXiv preprint arXiv:1803.00225.


A On the Convexity of the PLNegative Log-Likelihood

The Hessian of Eq. (3), given by Eq. (48), is not ingeneral positive semidefinite (PSD) (Hunter, 2004). Asimple counterexample is as follows: consider n = 2samples and a single observation, i.e., M = 1. TheHessian in this case is negative-definite for all π1, π2 > 0.Thus, Problem (4) with objective (3) is in generalnon-convex in (β, b) ∈ Rp+1. On the other hand, (3)under parametrization πi = eθi , i ∈ N is convex asa consequence of the convexity of the log of the sumof exponentials, which is well known (see (Boyd andVandenberghe, 2004)). The convexity of Problem (5)w.r.t. β follows by this observation and also the factthat the composition of convex and affine is convex.

B Proof of Theorem 3.1 (Maystre andGrossglauser, 2015)

We start by showing that ∂L(D |π)∂πi

= 0, i ∈ N is theoptimality condition to minimize Eq. (3). Consider thereparametrization πi = eθi , i ∈ N. Eq. (3) under thisreparametrization is given by:

L(D |θ) =

M∑`=1

log∑j∈A`

eθj − θ`

, (21)

which is convex w.r.t. θ = [θi]i∈N, i.e., even thoughEq. (3) is not convex w.r.t. π, it is convex under thereparametrization πi = eθi , i ∈ N. This implies that∂L(D | θ)∂θi

= 0, i ∈ N is the optimality condition tominimize Eq. (21) w.r.t. θ. By the chain rule, thiscondition can be written in terms of πi = eθi , i ∈ N as:

∂L(D |θ)

∂θi=∂L(D |π)

∂πieθi = 0 ∀i ∈ N. (22)

Note that eθi > 0, i ∈ N. Then, ∂L(D | θ)∂θi= 0 is equiv-

alent to ∂L(D |π)∂πi

= 0, i ∈ N, i.e., πi = eθi , i ∈ N

satisfies Eq.(22) if and only if θi, i ∈ N is the mini-mizer of Eq. (21). Hence, the stationarity condition∂L(D |π)

∂πi= 0, i ∈ N is also the optimality condition for

problem (6).

The optimality condition is given explicitly by:

∂L∂πi

=∑`∈Wi

(1∑

t∈A` πt− 1

πi

)

+∑`∈Li

1∑t∈A` πt

= 0 ∀i ∈ N,

(23)

whereWi = {` |i ∈ A`, c` = i} is the set of observationswhere sample i ∈ N is chosen and Li = {` |i ∈ A`, c` 6=

i} is the set of observations where sample i ∈ N is notchosen. Multiplying both sides of Eq. (23) with πi,i ∈ N, we have:

∑`∈Li

(πi∑t∈A` πt

)−∑`∈Wi

(∑j 6=i∈A` πj∑t∈A` πt

)= 0, (24)

for all i ∈ N. Note that∑`∈Wi

∑j 6=i∈A` · =∑

j 6=i∑`∈Wi∩Lj · and

∑`∈Li · =

∑j 6=i∑`∈Wj∩Li ·.

Accordingly, we rewrite Eq. (24) as:

∑j 6=i

∑`∈Wj∩Li

(πi∑t∈A` πt

)

=∑j 6=i

∑`∈Wi∩Lj

(πj∑t∈A` πt

)∀i ∈ N.

(25)

Then, an optimal solution π ∈ Rn+ to Eq. (6) satisfies:∑j 6=i

πjλji(π) =∑j 6=i

πiλij(π) ∀i ∈ N, (26)

where λji(π), i, j ∈ N, i 6= j are given by Eq. (8).

C Alternating Directions Method ofMultipliers

We employ Alternating Directions Method of Multipli-ers (ADMM) to solve the problem in Eq.(11) (Boydet al., 2011). ADMM is a primal-dual algorithm de-signed for problems with decoupled objectives, i.e.,objectives that can be written as a sum of functionswhere each function depends on only one of the opti-mized variables. In our case, we solve Eq.(11) for βand π, and the objective L(D |π) is a function of πonly.

ADMM solves a constrained optimization problem byminimizing the augmented Lagrangian, rather than thestandard Lagrangian. The difference of augmentedLagrangian from the standard Lagrangian is the ad-ditional quadratic penalty on the equality constraint.This additional penalty is shown to greatly improveconvergence properties of the algorithm (Boyd et al.,2011). The augmented Lagrangian of Eq. (11) is:

Lρ(β,π,y) = L(D |π)

+ yT (Xβ − π) +ρ

2‖Xβ − π‖22,

(27)

where ρ > 0 is the penalty parameter, y ∈ Rn is thedual variable, β = (β, b) ∈ Rp+1 and X = [X|1] ∈Rn×(p+1), so that π = Xβ.

ADMM alternates between optimizing the primal vari-ables β and π, and the dual variable y. Applying


ADMM on problem (11) yields the following iterativealgorithm:

βk+1 = arg minβ∈Rp+1

ykT

(Xβ − πk) + ρ2 ‖Xβ − π

k ‖22

= (XT X)−1XT (πk − 1

ρyk), (28a)

πk+1 = arg minπ∈Rn+

(L(D |π) + yk

T(Xβk+1 − π)

+ρ

2‖Xβk+1 − π‖22

), (28b)

yk+1 = yk + ρ(Xβk+1 − πk+1). (28c)

For convenience in calculations, the augmented La-grangian in (27) can be written in a different form, byintroducing a scaled dual variable u = 1

ρy and com-bining the linear and quadratic terms. By doing so,Eq. (27) is equivalent to the final form of the augmentedLagrangian:

Lρ(β,π,u) = L(D |π)

+ρ

2‖Xβ − π + u‖22 −

ρ

2‖u‖22 .

(29)

Having formed the final augmented Lagrangian inEq. (29), applying ADMM on problem (11) yields theiterative steps:

βk+1 = arg minβ∈Rp+1

‖Xβ − πk + uk ‖22

= (XT X)−1XT (πk − uk), (30a)


(L(D |π) + ρ

2 ‖Xβk+1 − π + uk ‖22

),

(30b)

uk+1 = uk + Xβk+1 − πk+1. (30c)

For convex problems, there are well-established conver-gence properties for ADMM. If the objective is closed,proper, and convex, and the standard Lagrangian of theproblem has a saddle point, then the ADMM iterationsare guaranteed to converge to a point where (a) theequality constraint is satisfied, and (b) objective anddual variable attain optimal values. Moreover, in manyapplications, ADMM has been shown to converge toa modest accuracy in a few tens of iterations (Boydet al., 2011). For nonconvex problems, there are fewconvergence analyses for ADMM, which focus on a re-stricted class of problems (Guo et al., 2017). In general,ADMM is not guaranteed to converge for non-convexproblems, and even if it does, it may not converge to theoptimal point of the problem. Nevertheless, ADMMis extensively used to also solve nonconvex problemssimilar to the one we study (Chartrand and Wohlberg,2013; Guo et al., 2017; Hong, 2018; Wang et al., 2019).

D Proofs

D.1 Proof of Lemma 4.1

At the k-th iteration of ADMM, gradient of the aug-mented Lagrangian in (29) w.r.t. π is:

∇πLρ(βk+1,π,uk)=∇πL+ρ(π−Xβk+1−uk). (31)

To simplify the rest of the calculations, we introduceσ = ρ(π − Xβk+1 − uk) ∈ Rn. Then, the stationaritycondition ∂Lρ(β

k+1,π,uk)∂πi

= 0, i ∈ N, is equivalent to:

∂Lρ(βk+1,π,uk)

∂πi=∂L∂πi

+ σi = 0 ∀i ∈ N. (32)

Setting ∂L∂πi

from Eq. (23) to Eq. (32), we have:

∂Lρ(βk+1,π,uk)

∂πi=∑`∈Wi

(− 1

πi+

1∑t∈A` πt

)

+∑`∈Li

1∑t∈A` πt

+ σi = 0,

(33)

for all i ∈ N. Multiplying both sides of Eq. (33) with−πi, i ∈ N, we have:

∑`∈Wi

(∑j 6=i∈A` πj∑t∈A` πt

)

−∑`∈Li

(πi∑t∈A` πt

)− πiσi = 0 ∀i ∈ N.

(34)

Note that∑`∈Wi

∑j 6=i∈A` · =

∑j 6=i∑`∈Wi∩Lj · and∑

`∈Li · =∑j 6=i∑`∈Wj∩Li ·. Accordingly, we rewrite

Eq. (34) as:

∑j 6=i

∑`∈Wi∩Lj

(πj∑t∈A` πt

)

−∑j 6=i

∑`∈Wj∩Li

(πi∑t∈A` πt

)− πiσi = 0 ∀i ∈ N.

(35)

Then, the stationarity condition ∂Lρ(βk+1,π,uk)∂πi

= 0,i ∈ N is equivalent to:∑

j 6=i

πjλji(π)−∑j 6=i

πiλij(π) = πiσi ∀i ∈ N, (36)

where λji(π), i, j ∈ N, i 6= j are given by Eq. (8).

D.2 Proof of Theorem 4.2

Summing Eq. (15) for i ∈ N, we get:∑i

∑j

(πjλji(π)−πiλij(π))1j 6=i=∑i

πiσi = 0. (37)


Since the Plackett-Luce scores are non-negative, i.e.πi ≥ 0 , i ∈ N, Eq. (37) implies that σ ≡ [σi]i∈Ncontains both positive and negative elements. Let(N+,N−) be a partition of N such that σi ≥ 0 for alli ∈ N+ and σi < 0 for all i ∈ N−. Then, for i ∈ N+ inEq. (15), we have:

∑j 6=i

πjλji(π) = πi

∑j 6=i

λij(π) + σi

, ∀i ∈ N+, (38)

where λij(π) + σi ≥ 0, i ∈ N+ and j ∈ N. Eq. (38)shows that from each state i ∈ N+ into the states inN−, there exists a total of σi ”additional outgoing rate”,compared to Eq. (7). At the same time, for i ∈ N− inEq. (15), we have:∑

j∈N+

πjλji(π) +∑

j∈N−|j 6=i

πjλji(π)

= πi∑j 6=i

λij(π) + πiσi, ∀i ∈ N−.(39)

Since πiσi < 0, for i ∈ N−, we distribute these termsinto the first sum on the left hand side. Then, Eq. (39)is equivalent to:∑j∈N+

πj

(λji(π)− πiσicj

πj

)+

∑j∈N−|j 6=i

πjλji(π)

= πi∑j 6=i

λij(π), ∀i ∈ N−,(40)

where∑j∈N+

cj = 1.

To determine the {cj}j∈N+ , recall from Eq. (38)that from each state j ∈ N+ into the states i ∈N−, there exists a total of σj additional outgoingrate. Then, Eq. (40) implies that

∑i∈N− −

πiσicjπj

=

σj , i.e., cj =−πjσj∑i∈N−

πiσi, j ∈ N+. Using

−∑i∈N− πiσi =

∑i∈N+

πiσi from Eq. (37), we con-

firm that∑j∈N+

cj =−

∑j∈N+

πjσj∑i∈N−

πiσi= 1, and rewrite

{cj}j∈N+as:

cj =−2πjσj∑

i∈N− πiσi −∑i∈N+

πiσi, ∀j ∈ N+. (41)

Finally, setting {cj}j∈N+into Eq. (40), we have:

∑j∈N+

πj

(λji(π)+

2πiσiσj∑t∈N−πtσt −

∑t∈N+

πtσt

)+

∑j∈N−|j 6=i

πjλji(π) = πi∑j 6=i

λij(π), ∀i ∈ N−,

(42)

where λji(π)+2πiσiσj∑

t∈N−πtσt−

∑t∈N+

πtσt≥ 0, j ∈ N+ and

i ∈ N−.

Eq. (15), partitioned as Eq. (38) and Eq. (42), is thebalance equations of a continuous-time MC with tran-sition rates given by:

µji(π) =

λji(π) +

2πiσiσj∑t∈N−

πtσt−∑t∈N+

πtσt

if j ∈ N+ and i ∈ N−

λji(π) otherwise.

(43)

Hence, π is the stationary distribution of this MC(Gallager, 2013).


We use the following definition.Definition D.1 (Diagonal dominance). A matrix His diagonally dominant if |Hii| ≥

∑j 6=i |Hij |, i ∈ N,

i.e., for every row, magnitude of the diagonal elementis larger than the sum of magnitudes of all off-diagonalelements (Horn and Johnson, 2012).

Eq. (33) is equivalent to:

∂Lρ(βk+1,π,uk)

∂πi=∑`∈Wi

− 1

πi+∑`|i∈A`

1∑t∈A` πt

+ ρ(π − Xβk+1 − uk)i ,

(44)

for all i ∈ N. At the k-th iteration of (13), let∇2Lρ(β

k+1,π,uk) be the Hessian of the augmentedLagrangian w.r.t. π. Differentiating Eq. (44) w.r.t. πj ,∇2Lρ(β

k+1,π,uk) has the following form:

∇2ijLρ(β

k+1,π,uk)

=

∑`∈Wi

1π2i−∑`|i∈A`

1(∑t∈A`

πt)2 +ρ, i = j

−∑`|i,j∈A`

1(∑t∈A`

πt)2 , i 6= j.

(45)

Consider ρ ≥ 2ε2 maxi

∑`|i∈A`

1|A`|2 . By Assumption

4.1, we have:

ρ ≥ 2

ε2

∑`|i∈A`

1

|A`|2∀i ∈ N,

⇔ ρ ≥∑`|i∈A`

2

(∑t∈A` πt)

2 ∀i ∈ N,

⇔ ρ+∑`∈Wi

1

π2i

>∑`|i∈A`

2

(∑t∈A` πt)

2 ∀i ∈ N, (46a)

⇔ ρ+∑`∈Wi

1

π2i

>∑`|i∈A`

1

(∑t∈A` πt)

2

+∑j 6=i

∑`|i,j∈A`

1

(∑t∈A` πt)

2 ∀i ∈ N. (46b)

Eq. (46a) implies that all diagonal elements of∇2Lρ(β

k+1,π,uk) are positive. Also, by Eq. (46b),


∇2Lρ(βk+1,π,uk) is diagonally dominant (c.f. Defini-

tion D.1). Thus, ∇2Lρ(βk+1,π,uk) is positive definite

(Horn and Johnson, 2012), i.e., Lρ(βk+1,π,uk) is con-vex w.r.t. π. As a result, under Assumption 4.1, forρ ≥ 2

ε2 maxi∑`|i∈A`

1|A`|2 , a stationary π > 0 satisfy-

ing condition (14) is also a minimizer of step (13b).


We make use of the following lemmas.

Lemma D.1 (Zeng et al. (2018)). Logarithm andpolynomials are Kurdyka–Łojasiewicz (KL) functions.Moreover, sums, products, compositions, and quotients(with denominator bounded away from 0) of KL func-tions are also KL.

Lemma D.2 (Guo et al. (2017)). Consider the opti-mization problem:

minimizeβ,π

g(π)

subject to Xβ = π,(47)

and solve Eq. (47) via Alternating Direction Methodof Multipliers (ADMM) (Boyd et al., 2011). Let{(πk,uk, βk)}k∈N be the sequence generated by theADMM algorithm, and ρ be the penalty parameter ofADMM. Suppose that there exists κ > 0 such thatXT X � κ I, and the sequence {(πk,uk, βk)}k∈N isbounded.

If there exist solutions for the minimization steps ofADMM w.r.t. both π and β, g(π) is a continuousdifferentiable function with an L-Lipschitz continuousgradient at πk, k ∈ N where L > 0, and the augmentedLagrangian of Eq. (47) is a KL function, then, forρ > 2L, {(πk,uk, βk)}k∈N converges to a point thatsatisfies the Karush-Kuhn-Tucker (KKT) conditions ofEq. (47).

To begin with, there exist solutions for the minimizationsteps in (13): β update has the closed form solutiongiven by Eq. (13a) and π update admits a minimizerfor large enough ρ by Lemma 4.3.

By Assumption 4.1, ∇πL given by Eq. (23) exists, i.e.L is continuous differentiable at πk, k ∈ N generated by(13b). Let ∇2(L) be the Hessian of L. DifferentiatingEq. (23) w.r.t. πj , ∇2(L) has the following form:

∇2ij(L)

=

∑`∈Wi

1π2i−∑`|i∈A`

1(∑t∈A`

πt)2 , i = j

−∑`|i,j∈A`

1(∑t∈A`

πt)2 , i 6= j.

(48)

Consider L = maxi |Wi|ε2 , where Wi is the set of obser-

vations where sample i ∈ N is chosen. By Assumption

4.1, we have:

L =maxi |Wi|

ε2≥∑`∈Wi

1

π2i

∀i ∈ N,

⇔ L−∑`∈Wi

1

π2i

+∑`|i∈A`

1

(∑t∈A` πt)

2

≥∑`|i∈A`

1

(∑t∈A` πt)

2 ∀i ∈ N,

⇔ L−∑`∈Wi

1

π2i

+∑`|i∈A`

1

(∑t∈A` πt)

2

≥∑j 6=i

∑`|i,j∈A`

1

(∑t∈A` πt)

2 ∀i ∈ N. (49)

Now, consider the matrix LIn×n−∇2(L). By Eq. (49),LIn×n −∇2(L) is diagonally dominant (c.f. DefinitionD.1) and all of its diagonal elements are positive, i.e.,∇2(L) is upper bounded by LIn×n. Thus, the objectivefunction of Eq. (11), i.e. L, has an L-Lipschitz contin-uous gradient at πk, k ∈ N, where L = maxi |Wi|

ε2 > 0.

Moreover, the augmented Lagrangian given by Eq.(29) is a sum of three functions: logarithm of theratio of two polynomials where the denominator isbounded away from 0 for all πk, k ∈ N by Assump-tion 4.1, and two other polynomial functions. ByLemma D.1, these three functions and their sum isKL on the set {πk |πki > ε , i ∈ N, k ∈ N}. As a re-sult, the augmented Lagrangian of Eq. (11) is a KLfunction. Putting it all together, by Lemma D.2, forρ > 2maxi |Wi|

ε2 , the sequence {(πk,uk, βk+1)}k∈N gen-erated by (13) converges to a point that satisfies theKKT conditions (Nocedal and Wright, 2006) of Prob-lem (11).

E Extension to the Logistic Case

We describe here how to apply our approach to regressmodel parameters in the logistic case. Recall thatProblem (5) is, in this case, convex, and can thus besolved by Newton’s method. Nevertheless, we wouldlike to accelerate its computation via a spectral methodakin to ILSR. Following the steps we took in the affinecase, we re-write (5) as:

Minimize L(D |π) (50a)subject to: logπ = Xβ, π ≥ 0, (50b)

where logπ = [log πi]i∈N is the Rn vector generatedby applying log to π element-wise. The augmentedLagrangian corresponding to Eq. (50) is:

Lρ(β,π,u) = L(D |π)

+ρ

2‖Xβ−logπ+u‖22−

ρ

2‖u‖22,

(51)


Algorithm 2 PLADMM-log1: procedure ADMM(X, D = {(c`, A`) | ` ∈M}, ρ)2: Initialize β via Eq. (55); π ← [ex

Ti β]i∈N; u← 0

3: repeat4: π ← ILSRX(ρ,π,X,β,u)5: u← u+Xβ − logπ

6: β ← (XTX)−1XT (logπ − u)

7: until convergence8: return β, π9: end procedure1: procedure ILSRX(ρ,π,X,β,u)2: repeat

3: σi ← ρ(log πi−x

Ti β−ui)

πi, i ∈ N

4: Calculate M(π) = [µji(π)]i,j∈N via Eq. (16)5: π ← ssd (M(π))6: until convergence7: return π8: end procedure

101 102 103 104 105

M

10 1

100

101

102

Tim

e


101 102 103 104 105

M0.0

0.2

0.4

0.6

0.8

1.0

Top-

1 Ac

c.


Figure 2: Convergence time (Conv. Time) and top-1 test accuracy(Top-1 Acc.) of PLADMM, PLADMM-log, ILSR, and Newton on βevaluated on synthetic datasets vs. the number of observations M ∈{10, 100, 1000, 10000, 100000}. Observations are partitioned w.r.t. obser-vation CV (c.f. Sec. 5), where number of samples is n = 1000, number offeatures is p = 100, and query size is |A`| = 2.

and applying ADMM on problem (50) yields:

βk+1 = arg minβ∈Rp

Lρ(β,πk,uk)

= (XTX)−1XT (logπk − uk), (52a)


L(D |π)

+ρ

2‖Xβk+1 − logπ + uk ‖22, (52b)

uk+1 = uk +Xβk+1 − logπk+1. (52c)

Mutatis mutandis, following the same manipulations inLemma 4.1, a stationary point of the objective in eachstep (52b) can be cast as the stationary distribution ofthe continuous-time MC with transition rates µji(π),i, j ∈ N, given by Eq. (16), the only difference beingthat vector σ = [σi]i∈N is now given by:

σi = ρ(log πi − xTi β − ui)

πi, i ∈ N. (53)

Having adjusted the transition matrix M(π) thusly, πcan again be obtained by repeated iterations of (18).

The resulting algorithm, which we refer to as Plackett-Luce ADMM-log (PLADMM-log), is summarized inAlgorithm 2; the algorithm is almost identical to Algo-rithm 1, using logπ instead of π, defining σ via (53),and having a different initialization. We discuss thelatter below.

Initialization. Similar to the initialization ofPLADMM (c.f. Eq. (20)), we initialize β so that theinitial scores obey the Plackett-Luce model, mirroringthe approach by Saha and Rajkumar (2018). Defin-ing Pij , i, j ∈ N the same way, and using the logisticparametrization in Sec.3, we have that:

PijPji

=πiπj

= eβT (xi−xj). (54)

Accordingly, we initialize β as:

β0 =arg minβ∈Rp

∑(i,j)∈D

(βT (xi − xj)− log

( PijPji

))2

, (55)

where Pij , i, j ∈ N, are again empirical estimates ob-tained from dataset D. Given β0, we generate theinitial Plackett-Luce scores via the logistic parametriza-tion π0 = [ex

Ti β

0

]i∈N. Finally, we initialize the dualvariable as u0 = 0.

F Experiments

F.1 Datasets

Synthetic Datasets. We generate the feature vectorsxi ∈ Rp, i ∈ N from N (0, σ2

xIp×p) and a commonparameter vector β ∈ Rp from N (0, σ2

βIp×p). Then,we generate the Plackett-Luce scores via the logisticparametrization π = [ex

Ti β]i∈N. We normalize the

resulting scores, so that 1>π = 1. We set σ2x = σ2

β =0.8 in all experiments. Given π, we generate eachobservation in D as follows: we first select |A`| = 2samples out of n samples uniformly at random. Then,we generate the choice cl, l ∈M from the Plackett-Lucemodel given by Eq. (1).

Filter Aesthetic Comparison (FAC). The FilterAesthetic Comparison (FAC) dataset (Sun et al., 2017)contains 1280 unfiltered images pertaining to 8 differ-ent categories. Twenty-two different image filters areapplied to each image. Labelers are provided with twofiltered images and are asked to identify which imagehas better quality. We select n = 1000 images withinone category, as only the filtered image pairs that arewithin the same category are compared. The result-ing dataset contains M = 728 pairwise comparisons.Moreover, for each image, we extract features via a


state-of-the-art convolutional neural network architec-ture, namely GoogLeNet (Szegedy et al., 2015), withweights pre-trained on the ImageNet dataset (Denget al., 2009). We select p = 50 of these features byPrincipal Component Analysis (Jolliffe, 1986).

Retinopathy of Prematurity (ROP). TheRetinopathy of Prematurity (ROP) dataset containsn = 100 retina images with p = 143 features (Ataer-Cansızoğlu, 2015). Experts are provided with twoimages and are asked to choose the image with higherseverity of the ROP disease. Five experts indepen-dently label 5941 image pairs; the resulting datasetcontains M = 29705 pairwise comparisons. Note thatsome pairs are labelled more than once by differentexperts.

SUSHI. The SUSHI Preference dataset (Kamishimaet al., 2009) contains n = 100 sushi ingredients with p =18 features. Each of the 5000 customers independentlyranks 10 ingredients according to her preferences. Weselect the rankings provided by 10 customers, wherean ingredient is ranked higher if it precedes the otheringredients in a customer’s ranked list. We generate twodatasets: triplet Sushi containing M = 1200 rankingsof |A`| = 3 ingredients, and pairwise Sushi containingM = 450 pairwise comparisons.

F.2 Algorithms

We implement four algorithms that regress Plackett-Luce scores from features, which we call as featuremethods.

PLADMM. PLADMM solves the problem in Eq. (11)and is summarized in Algorithm 1. We compute thestationary distribution at each iteration of ILSRX(c.f. Eq. (18)) using the power method (Lei et al., 2016).As the stopping criterion, we use ‖πk−πk−1‖2< rtol ‖πk‖2 and ‖Xβk − Xβk−1‖2< rtol ‖Xβk‖2. We setthe relative tolerance rtol = 10−4 for all experiments.We use the same relative tolerance for the stoppingcriterion of the power method. We set ρ = 1 in ourexperiments, which is a standard choice in the ADMMliterature (Boyd et al., 2011). In our experiments, weconsistently observe that Eq.(37) is satisfied. That iswhy, we use cj =

−πjσj∑i∈N−

πiσi, j ∈ N+ instead of Eq.(41)

to calculate the transition rates (16).

PLADMM-log. PLADMM-log solves the problemin Eq. (50) and is summarized in Algorithm 2. Asthe stopping criterion, we use ‖πk − πk−1‖2< rtol ‖πk‖2 and ‖eXβk − eXβk−1‖2< rtol ‖eXβ

k‖2, whereexponentiation is applied elementwise.

SLSQP. SLSQP solves the problem in Eq. (4) viathe sequential least-squares quadratic programming

(SLSQP) algorithm (Nocedal and Wright, 2006). Weinitialize SLSQP the same as PLADMM (c.f. Algorithm1). As stopping criterion, we use ‖πk − πk−1 ‖2<rtol ‖πk ‖2, where πk = Xβk + bk1, k ∈ N. Eachiteration of SLSQP is O

(∑`∈D

(|A`|(p+1)

)+(p+1)2

)for constructing the gradient of Eq. (3) w.r.t. β andupdating β, respectively.

Newton on β. Newton on β solves the convex prob-lem in Eq. (5) via Newton’s method (Nocedal andWright, 2006). We initialize Newton on β the sameas PLADMM-log (c.f. Algorithm 2). As stopping cri-terion, we use ‖πk − πk−1 ‖2< rtol ‖πk ‖2, whereπk = [ex

Ti β

k

]i∈N, k ∈ N. Each iteration of Newtonon β is O

(∑`∈D

(|A`| p2

)+ p2

)for constructing the

Hessian of Eq. (3) w.r.t. β and updating β, respectively.

We implement three algorithms that learn the Plackett-Luce scores from the choice observations alone, whichwe call as featureless methods.

ILSR. Iterative Luce Spectral Ranking (ILSR) algo-rithm solves the problem in Eq. (6) and is describedby the iterations in Eq.(10). We initialize ILSR withπ0 = 1

n1. We compute the stationary distribution ateach iteration of ILSR using the power method. As thestopping criterion, we use ‖πk − πk−1‖2< rtol ‖πk‖2.Each iteration of ILSR is O

(∑`∈D

(|A`|

)+ n2

)for

constructing the transition matrix Λ(π) (c.f. Eq.(9))and finding the stationary distribution π, respectively.

MM. The Minorization-Maximization (MM) algorithm(Hunter, 2004) solves the problem in Eq. (6). Weinitialize MM with π0 = 1

n1. As the stopping criterion,we use ‖πk − πk−1‖2< rtol ‖πk‖2. Each iteration ofMM is O

(∑`∈D

(|A`|

)).

Newton on θ. Newton on θ algorithm solves theproblem in Eq. (6) by reparametrizing the scores asπi = eθi , i ∈ N. It solves the resulting convex problemby Newton’s method (Nocedal and Wright, 2006). Weinitialize Newton on θ with θ0 = [θ0i ]i∈N = 0. Asstopping criterion, we use ‖πk − πk−1‖2< rtol ‖πk‖2,where πki = eθ

ki , i ∈ N, k ∈ N. Each iteration of New-

ton on θ is O(∑

`∈D(|A`|2

)+n2

)for constructing the

Hessian of Eq. (3) w.r.t. θ and updating θ, respectively.

F.3 Top-1 Accuracy and Kendall-TauCorrelation

We measure the prediction performance by Top-1 ac-curacy (Top-1 Acc.) and Kendall-Tau correlation(KT) on the test set. Let the test set be Dchoice ={(c`, A`) | ` ∈ {1, ...,Mtest}} for the choice setting andDrank = {(α`, A`) | ` ∈ {1, ...,Mtest}} for the rank-


ing setting, where α` = α`1 � α`2 � · · · � α`|A`|is an ordered sequence of the samples in A`. Forboth settings, given A`, we predict the `-th choice asc` = arg maxi∈A` πi. We calculate the Top-1 accuracy(Top-1 Acc.) for the choice setting as:

Top-1 Acc. =

∑Mtest

`=1 1(c` = c`)

Mtest

∈ [0, 1], (56)

and for the ranking setting as:

Top-1 Acc. =

∑Mtest

`=1 1(c` = α`1)

Mtest

∈ [0, 1]. (57)

For the ranking setting, given A`, we also predict theranking as α` = arg sort[πi]i∈A` , i.e. sequence of thesamples in A` ordered w.r.t. their scores. We calculateKendall-tau correlation (KT) (Kendall, 1938) as a mea-sure of the correlation between each true ranking α`and predicted ranking α`, ` ∈ {1, ...,Mtest}. For obser-vation `, let T` =

∑|A`|t=1

∑|A`|s=1 1(α`t � α`s ∧α`t � α`s) be

the number correctly predicted ranking positions, andF` =

∑|A`|t=1

∑|A`|s=1 1(α`t � α`s ∧α`s � α`t) be the number

incorrectly predicted ranking positions. Then, KT iscomputed by:

KT =

∑Mtest

`=1 (T` − F`)/(|A`|

2

)Mtest

∈ [−1, 1], (58)

where(|A`|

2

)is the number of sample pairs in a query

of size |A`|.

F.4 Impact of Number of Observations

Fig. 2 shows the convergence time (Time) and top-1test accuracy (Top-1 Acc.) of PLADMM, PLADMM-log, ILSR, and Newton on β when trained on syn-thetic datasets with number of observations M ∈{10, 100, 1000, 10000, 100000}. Observations are par-titioned w.r.t. observation CV (c.f. Sec. 5), wherenumber of samples is n = 1000, number of parametersis p = 100, and size of each query is |A`| = 2. Asn > p, PLADMM benefits from being able to regressn scores from a smaller number of p parameters andleads to significantly better Top-1 Acc compared toILSR in Fig. 2. Especially when M is not enough tolearn n = 1000 scores, but to learn p = 100 parameters,PLADMM gains the most performance advantage overILSR, up to 13% Top-1 Acc. Moreover, PLADMM andPLADMM-log are consistently faster than Newton onβ, for all number of observations M > 100. Particu-larly, for M = 100000, PLADMM and PLADMM-logconverge 4− 60 times faster than Newton on β.


Dataset Method Training Metrics Performance Metrics on the Test Set

Time (s) ↓ Iter. ↓ Top-1 Acc. ↑ KT ↑

FAC

PLADMM 0.352± 0.044 4± 0 0.68± 0.048 0.35± 0.089

PLADMM-log 0.17± 0.033 4± 0 0.691± 0.054 0.378± 0.11

ILSR (no X) 0.066± 0.012 2± 0 0.591± 0.067 −0.13± 0.164

MM (no X) 10.7± 0.501 500± 0 0.544± 0.046 0.046± 0.087

Newton on θ (no X) 9.152± 1.284 17± 3 0.5± 0.0 0.0± 0.0

Newton on β 1.531± 0.169 6± 1 0.701± 0.04 0.398± 0.08

SLSQP 22.73± 19.151 160± 135 0.689± 0.063 0.375± 0.125

ROP

PLADMM 1.953± 0.217 4± 0 0.896± 0.005 0.791± 0.009

PLADMM-log 0.359± 0.027 1± 0 0.904± 0.005 0.807± 0.01

ILSR (no X) 0.716± 0.058 2± 0 0.891± 0.005 0.781± 0.009

MM (no X) 356.497± 29.11 500± 0 0.905± 0.004 0.81± 0.008

Newton on θ (no X) 85.42± 6.849 9± 0 0.906± 0.004 0.811± 0.008

Newton on β 55.718± 6.293 2± 0 0.904± 0.005 0.808± 0.009

SLSQP 9.595± 7.136 2± 1 0.683± 0.049 0.366± 0.098

Pairwise Sushi

PLADMM 0.061± 0.002 4± 0 0.669± 0.034 0.338± 0.068

PLADMM-log 0.764± 1.192 58± 30 0.634± 0.075 0.267± 0.15

ILSR (no X) 0.027± 0.003 2± 0 0.763± 0.039 0.521± 0.084

MM (no X) 5.191± 0.345 490± 31 0.773± 0.048 0.543± 0.094

Newton on θ (no X) 2.342± 0.689 18± 5 0.735± 0.095 0.465± 0.185

Newton on β 0.176± 0.17 2± 2 0.685± 0.044 0.369± 0.087

SLSQP 16.198± 8.728 245± 134 0.64± 0.06 0.28± 0.119

Triplet Sushi

PLADMM 0.127± 0.007 4± 0 0.569± 0.035 0.218± 0.045

PLADMM-log 0.804± 0.349 36± 18 0.487± 0.034 0.19± 0.072

ILSR (no X) 0.054± 0.003 2± 0 0.678± 0.036 0.454± 0.06

MM (no X) 15.349± 0.617 500± 0 0.715± 0.035 0.522± 0.059

Newton on θ (no X) 5.122± 0.34 14± 1 0.73± 0.036 0.496± 0.089

Newton on β 1.12± 0.659 3± 2 0.605± 0.058 0.285± 0.062

SLSQP 21.738± 39.761 107± 197 0.521± 0.043 0.191± 0.059

Table 4: Evaluations on real datasets partitioned w.r.t. observation CV (c.f. Sec. 5). We report the convergence time in seconds (Time),number of iterations until convergence (Iter), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT).ILSR, MM, and Newton on θ learn the Plackett-Luce scores π from the choice observations alone and do not use the features X. Newtonon β and sequential least squares quadratic programming (SLSQP) regress π from X. (c.f. Sec. F.2).

Documents

Fast and Accurate Ranking Regression · FastandAccurateRankingRegression model parameters and Plackett-Luce scores via a spectralmethod. Thoughtheproblemssolvedare non-convex,weestablishconditionsthatyieldcon-