188
Learning Theory and Algorithms for Auctioning and Adaptation Problems by Andr´ es Mu ˜ noz Medina A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Mathematics New York University September 2015 Mehryar Mohri

cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Learning Theory and Algorithms forAuctioning and Adaptation Problems

by

Andres Munoz Medina

A dissertation submitted in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Department of Mathematics

New York University

September 2015

Mehryar Mohri

Page 2: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

c© Andres Munoz MedinaAll Rights Reserved, 2015

Page 3: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Dedication

To William Shaw, the limit does exist.

iii

Page 4: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

AcknowledgementsFirst and foremost, I want to thank my advisor Mehryar Mohri who I recall once said tome he considers each student an investment. To him I want to say “thank you for takinga risk with me”. I want to thank him for always being honest with me, for respectingmy opinion as that of a colleague and most importantly for teaching me that researchis much more than just proving theorems. Being his student was challenging, yet it isthanks to him that today I can call myself a researcher.

I want to thank my thesis committee, not only for their comments and suggestionson my thesis but for their role in my professional development: Corinna Cortes, a readerand collaborator on the subject of domain adaptation, who taught me a wide varietyof tricks of the trade in experimental evaluation of algorithms. In fact, several resultson this dissertation would not have been possible without her detailed suggestions andexplanations. Yishay Mansour and Claudio Gentile, who indirectly inspired my researchin auctioning through their extensive results in this area and who, in spite of the distance,were willing to be part of this committee and provided me with insightful comments formy dissertation. Finally, I want to thank Esteban Tabak for being there for me andadvising me at difficult times during my Ph.D. and Robert Kohn for preparing me onmy qualifying exams.

Making a transition from theoretical mathematics to computer science was a majorchallenge for me: abstract topology does not prepare you for doing large-scale ma-chine learning experiments. It is only due to the patience and teachings of my hostsand friends at Google, Afshin Rostamizadeh, Umar Syed, Keith Hall, Cyril Allauzen,Shankar Kumar, Kishore Papineni and Richard Zens that I have developed the skillsneeded to become a research scientist. I look forward to many future collaborationswith them.

I am also indebted to my lab colleagues and nevertheless friends: Giulia DeSalvo,Scott Yang, Marius Kloft and Vitaly Kuznetsov for countless hours of machine learningdiscussions, research suggestions and talk preparations.

I want to thank my friends at NYU: Juan Calvo, Edgar Costa, Henrique Moyses andof course our volleyball team. These 5 years would not have been the same without ourstudy groups and academic exchanges. But most of all, I am grateful for knowing that Iwas never alone in this program and that all of us had to deal with the same issues. I alsowant to thank my friends outside of the program for keeping me sane throughout theseyears. In particular Miguel Angel, Marinie and Armando, some of my oldest friends,who in spite of the distance, always offered me support and advice when I most neededit.

Quiero tambien agradecer a mis padres y a mi hermano quienes me han apoyadodesde la infancia y han creıdo siempre en mı. Todos mis logros se los dedico y se losseguire dedicando a ustedes.

Finally, I want to thank Will Shaw, the love of my life. Over the past four years

iv

Page 5: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

you have stood by my side in the good times and the bad times. You have shared mysuccesses and my failures. You have put up with my insane schedule and have alwaysbeen there for me when I needed you the most; and for that, I cannot thank you enough.

v

Page 6: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

AbstractA common assumption in machine learning is that training and testing data are i.i.d.realizations from the same distribution. However, this assumption is often violated inpractice: for instance test and training distributions can be related but different such aswhen training a face recognition system on a carefully curated data set which will bedeployed in the real world. As a different example, consider training a spam classifierwhen the type of emails considered spam changes as a function of time.

The first problem described above is known as domain adaptation and the second oneis called learning under drifting distributions. The first part of this thesis presents theoryand algorithms for these problems. For domain adaptation, we provide tight learningbounds based on the novel concept of generalized discrepancy. These bounds stronglymotivate our learning algorithm and it is shown, both theoretically and empirically, thatthis algorithm can significally improve upon the current state of the art. We extend thetheoretical results of domain adaptation to the more challenging scenario of learningunder drifting distributions. Moreover, we establish a deep connection between on-linelearning and this problem. In particular, we provide a novel on-line to batch conversionthat motivates a learning algorithm with excellent empirical performance.

The second part of this thesis studies a crucial problem in the intersection of learn-ing and game theory: revenue optimization in repeated auctions. More precisely, westudy second-price and generalized second-price auctions with reserve. These auctionmechanisms have become extremely popular in recent years due to the advent of onlineadvertisement. Both type of auctions are characterized by a reserve price representingthe minimum value at which the seller is willing to forego of the object in question.Therefore, selecting an optimal reserve price is crucial in achieving the largest possiblerevenue. We cast this problem as a learning problem and provide the first theoreticalanalysis for learning optimal reserve prices from samples for both second-price andgeneralized second-price auctions. These results however assume buyers do not reactstrategically to changes in reserve prices. Therefore, in the last chapter of this thesis weanalyze the possible strategies for buyers and show that if the seller is more patient thanthe buyer, it is not in the best interest of the buyer to behave strategically.

vi

Page 7: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Contents

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

I Domain Adaptation 5

1 Domain Adaptation 61.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Learning Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Learning Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6 Generalized Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . 221.7 Optimization Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2 Drifting 392.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.3 Drifting PAC Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 422.4 Drifting Tracking Scenario . . . . . . . . . . . . . . . . . . . . . . . . 462.5 On-line to Batch Conversion . . . . . . . . . . . . . . . . . . . . . . . 472.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vii

Page 8: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

II Auctions 57

3 Learning in Auctions 583.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4 Learning Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4 Generalized Second-Price Auctions 864.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.4 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.5 Convergence of Empirical Equilibria . . . . . . . . . . . . . . . . . . . 984.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5 Learning Against Strategic Adversaries 1075.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.3 Monotone Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.4 A Nearly Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . . . 1125.5 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.6 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6 Conclusion 119

Appendices 121

Bibliography 167

viii

Page 9: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

List of Figures

1.1 Example of reweighting to make source and target distributions moresimilar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Examples of favorable (c) and unfavorable (a and b) scenarios for adap-tation. (a) Instance distributions are disjoint. (b) Instance distributionsmatch but labeling functions are different. Therefore, joint distributionsare not similar. (c) Instance distributions can be made similar throughreweighting and labeling function is the same. . . . . . . . . . . . . . . 11

1.3 Figure depicting the difference between the L1 distance and the discrep-ancy. In the left figure, the L1 distance is given by twice the measure ofthe green rectangle. In the right figure, P (h(x) 6= h′(x)) is equal to themeasure of the blue rectangle and Q(h(x) 6= h′(x)) is the measure ofthe purple rectangle. The two measures are equal, thus disc(P,Q) = 0. . 12

1.4 Depiction of the distances ηH(fP , fQ) and δP1 (fP , H′′). . . . . . . . . . 23

1.5 Illustration of the sampling process on the set H ′′. . . . . . . . . . . . . 301.6 (a) Hypotheses obtained by training on source (green circles), target

(red triangles) and using DM (dashed blue) and GDM algorithms (solidblue). (b) Objective functions for source and target distribution as wellas GDM and DM algorithms. The vertical lines show the minimizer foreach algorithm. Set H and surrogate hypothesis set H ′′ ⊆ H are shownat the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.7 (a)MSE performance for different adaptation algorithms when adaptingfrom kin-8fh to the three other kin-8xy domains. (b) Relative errorof DM over GDM as a function of the ratio r

Λ. . . . . . . . . . . . . . . 35

2.1 Barplot of estimated discrepancies for the continuous drifting and alter-nating drifting scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.2 MSE of different algorithms for the continuous drifting scenario. Dif-ferent plots represent different cycle sizes: k = 100 (top-left), k = 200(bottom-left), k = 400 (top-right) and k = 600 (bottom-right). . . . . . . 55

2.3 MSE of different algorithms for the alternating drifting scenario. Dif-ferent plots represent different cycle sizes: k = 100 (top-left), k = 200(bottom-left), k = 400 (top-right) and k = 600 (bottom-right). . . . . . . 55

ix

Page 10: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

3.1 (a) Plot of the loss function r 7→ L(r,b) for fixed values of b(1) and b(2);(b) Functions l1 on the left and l2 on the right. . . . . . . . . . . . . . . 65

3.2 (a) Piecewise linear convex surrogate loss Lp. (b) Comparison of thesum of real losses

∑mi=1 L(·,bi) for m = 500 with the sum of convex

surrogate losses. Note that the minimizers are significantly different. . . 673.3 (a) Comparison of the true loss L with surrogate loss Lγ on the left

and surrogate loss Lγ on the right, for γ = 0.1. (b) Comparison of∑500i=1 L(r,bi) and

∑500i=1 Lγ(r,bi) . . . . . . . . . . . . . . . . . . . . 71

3.4 (a) Prototypical v-function. (b) Illustration of the fact that the definitionof Vi(r,bi) does not change on an interval [nk, nk+1]. . . . . . . . . . . 74

3.5 Pseudocode of our DC-programming algorithm. . . . . . . . . . . . . . 803.6 Plots of expected revenue against sample size for different algorithms:

DC algorithm (DC), convex surrogate (CVX), ridge regression (Reg)and the algorithm that uses no feature to set reserve prices (NF). For (a)-(c) bids are generated with different noise standard deviation (a) 0, (b)0.25, (c) 0.5. The bids in (d) were generated using a generative model. . 83

3.7 Distribution of reserve prices for each algorithm. The algorithms weretrained on 800 samples using noisy bids with standard deviation 0.5. . . 84

3.8 Results of the eBay data set. Comparison of our algorithm (DC) againsta convex surrogate (CVX), using no features (NF), setting no reserve(NR) and setting reserve price to highest bid (HB). . . . . . . . . . . . 84

4.1 Depiction of the loss Li,s. Notice that the loss in fact resembles a broken“V” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.2 (a) Empirical verification of Assumption 2. Values were generated usinga uniform distribution over [0, 1] and the parameters of the auction wereN = 3, s = 2. The blue line corresponds to the quantity maxi ∆βifor different values of n. In red we plot the desired upper bound forC = 1/2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.3 Approximation of the empirical bidding function β to the true solutionβ. The true solution is shown in red and the shaded region representsthe confidence interval of β when simulating the discrete GSP 10 timeswith a sample of size 200. Here N = 3, S = 2, c1 = 1, c2 = 0.5 andvalues were sampled uniformly from [0, 1] . . . . . . . . . . . . . . . . 103

4.4 Bidding function for our experiments in blue and identity function inred. Since β is a strictly increasing function it follows from (Gomes andSweeney, 2014) that this GSP admits an equilibrium. . . . . . . . . . . 104

4.5 Comparison of methods for estimating valuations from bids. (a) His-togram of true valuations. (b) Valuations estimated under the SNE as-sumption. (c) Density estimation algorithm. . . . . . . . . . . . . . . . 105

x

Page 11: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

5.1 (a) Tree T (3) associated to the algorithm proposed in (Kleinberg and Leighton,2003a). (b) Modified tree T ′(3) with r = 2. . . . . . . . . . . . . . . . . . 113

5.2 Comparison of the monotone algorithm and PFSr for different choices ofγ and v. The regret of each algorithm is plotted as a function of the numberrounds when γ is not known to the algorithms (first two figures) and when itsvalue is made accessible to the algorithms (last two figures). . . . . . . . . . 118

C.1 Depiction of the envelope function . . . . . . . . . . . . . . . . . . . . 137

D.1 Regret curves for PFSr and monotone for different values of v and γ.The value of γ is not known to the algorithms. . . . . . . . . . . . . . . 165

D.2 Regret curves for PFSr and monotone for different values of v and γ.The value of γ is known to both algorithms. . . . . . . . . . . . . . . . 166

xi

Page 12: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

List of Tables

1.1 Notation table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Adaptation from books (B), kitchen (K), electronics (E) and

dvd (D) to all other domains. Normalized results: MSE of training onthe unweighted source data is equal to 1. Results in bold represent thealgorithm with the lowest MSE. . . . . . . . . . . . . . . . . . . . . . 36

1.3 Adaptation from caltech256 (C), imagenet (I), sun (S) and bing(B). Normalized results: MSE of training on the unweighted source datais equal to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1 Results of different algorithms in the financial data set. Bold results aresignificant at the 0.05 level. . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Mean revenue of both our algorithms. . . . . . . . . . . . . . . . . . . 104

xii

Page 13: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

List of Appendices

Appendix to Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Appendix to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136Appendix to Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

xiii

Page 14: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

IntroductionThis thesis presents a complete theoretical and algorithmic analysis of two importantproblems in machine learning. The first part studies a problem rooted at the very foun-dation of this field, that of learning under non standard distribution assumptions. In anideal learning scenario, a learner uses an i.i.d. training sample from a distribution Dto select a hypothesis h which is to be tested on the same distribution D. In practice,however this assumption is not always satisfied. Consider, for instance, the followingexamples:• A hospital in New York city collects patient information in order to predict the

probability of a patient developing a particular disease. After training a learningalgorithm on this dataset, the resulting system is deployed in all hospitals in theUnited States.• A political analyst trying to predict the winner of an election collects social me-

dia information over the span of a month. This data is used to train a sentimentclassification algorithm that will help analyze people’s preferences over candidates.

In the first example the training sample used by the learning algorithm represents onlya biased fraction of the target population. Therefore, we cannot guarantee that the pre-dictions made by the learner will be accurate in the whole U.S. population. This is aprototypical instance of the problem of domain adaptation where the learner must trainon a source distribution different from the intended target. The second example suffersfrom a similar issue: the population’s sentiment towards a candidate can change on adaily basis. Thus, the information collected at the beginning of the month might misleadthe learner, making its predictions at the end of the month inaccurate. This is an exampleof how the distribution of the data drifts over time. For this reason, this learning scenariois commonly known as learning with drifting distributions.

These examples motivate the following question: can we still learn without the stan-dard i.i.d. assumptions of statistical learning theory? It is clear that if the training andtarget distributions are vastly different, for instance if their supports are disjoint, domainadaptation cannot succeed. Similarly, if the distributions drastically drift over time thereis no hope for learning. At the core of these problems is therefore a notion of distancebetween distributions. Several measures of divergence between distributions exist, suchas the L1 distance and the KL-divergence. In Chapter 1 we discuss the disadvantages ofthese divergence measures and introduce a novel distance tailored for the problems ofdomain adaptation and drifting. In Chapters 1 and 2 we derive learning bounds for theseproblems which help us design algorithms with excellent theoretical properties as wellas state-of-the-art results.

Part II of this thesis presents the first learning-based analysis of revenue optimiza-tion in auctions. Traditionally studied in economics, auctions have recently become animportant object of analysis in computer science, mainly, due to the important role auc-tions play nowadays in electronic markets. Indeed, online advertisement is today one

1

Page 15: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

of the fastest growing markets in the world consisting of billions of dollars worth oftransactions. The mechanisms by which most of the space for these advertisements issold are the second-price auction with reserve and the generalized second-price auctionwith reserve. In a second-price auction buyers bid for an object and the highest bidderwins. However, the winner is not obligated to pay his bid but the bid of the secondhighest bidder. In order to avoid selling its inventory at a low price when the second bidis too small, the seller sets a reserve price under which the object will not be sold. Thewinner of the auction therefore must pay the maximum between the second highest bidand the reserve price. An appropriate selection of the reserve price is therefore crucial.If set too low the seller might not extract all possible revenue, on the other hand if thereserve price is too high buyers might not bid above it and the seller would procure norevenue. In Chapter 3 we propose a data-driven algorithm for selecting the optimal re-serve price. This approach poses several theoretical challenges since the loss functionassociated with this problem does not posses the properties of traditional loss functionsused in machine learning such as convexity or continuity. In spite of this, we providelearning guarantees for this problem and give an efficient algorithm for setting optimalreserve prices as a function of the features of the auction. We further extend these resultsto the more complicated generalized second-price auction.

Trying to learn optimal reserve prices has a secondary effect inherent to auctionsand it is the fact that buyers have the possibility to react to our actions. In particular ifbuyers realize that their bids are used to train a learning algorithm, they could modifytheir bidding behavior in an attempt to misguide the learner and obtain a more beneficialreserve price in the future. The interactions between a seller trying to optimize his rev-enue and a strategic buyer are analyzed in Chapter 5. We show that under some naturalassumptions a seller can in fact find the optimal reserve price even in the challengingscenario where the buyer has access to the learning algorithm used by the seller. This isa remarkable result that justifies the use of historical data as means to optimize revenue.

Let us emphazise that, with exception of the work of (Cesa-Bianchi et al., 2013),the use of machine learning techniques in the study of auctions presented here is com-pletely nove. Indeed, auction mechanisms have been traditionally studied only from agame theoretic perspective. Given the large amount of historical data collected by majoronline advertising companies, we believe that the use of machine learning techniques iscrucial for better understanding and optimizing these mechanisms. In fact, our work hasinspired novel revenue optimization algorithms in more general settings (Morgensternand Roughgarden, 2015).

0.1 NotationWe present the notation and basic concepts that will be used throughout this thesis. Wewill consider a an input space X and an output space Y . Unless otherwise stated we

2

Page 16: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

will assume that Y ⊂ R. Let H denote a hypothesis set, that is a collection of functionsh : X → Y ′ for Y ′ ⊂ R. For the most part we will consider the case Y ′ = Y . A lossfunction L : Y ′ × Y → R will be used to measure the quality of an algorithm. For adistribution D over X × Y we denote the expected loss of a hypothesis by

LD(h) = E(x,y)∼D

[L(h(x), y)].

When the distributionD is understood from context it will be omitted from the notation.Similarly, for a sample ((x1, y1), . . . , (xm, ym)) ∈ (X × Y)m we denote the empiricalloss of a hypothesis by

LD(h) =1

m

m∑

i=1

L(h(xi), yi).

If there exist a labeling function f : X → Y we define the expected loss of a hypothesish with respect to f as

LDx(h, f) = Ex∼Dx

[L(h(x), f(x)

)],

where Dx denotes the marginal distribution of D over the input space X . We use ananalogous notation for the empirical loss of a hypothesis h with respect to f . Matriceswill be denoted by upper case bold letters (A,B, . . .) and vectors will be shown as lowercase bold letters (u,v, . . .).

The following capacity concepts will be used repeatedly in this thesis.

Definition 1. LetZ be any set and letG be a family of functions mappingZ to R. Givenan i.i.d. sample S = (z1, . . . , zm) ∈ Zm from a distribution D, we define the empiricalRademacher complexity of G as

RS(G) =1

mEσ

[supg∈G

m∑

i=1

σig(xi)],

whereσ = (σ1, . . . , σm) and the σis are independent uniform random variables over theset −1, 1. The random variables σi are called Rademacher variables. TheRademacher complexity of the class G is the expectation of the empirical Rademachercomplexity :

Rm(G) = ES

[RS(G)].

The Rademacher complexity measures the correlation of the function class G withnoise and is therefore a way of quantifying the size of the function class G. When thefunctions inG take values on the set −1, 1, the Rademacher complexity can be relatedto the growth function.

Definition 2. LetG be a family of functions mappingZ to −1, 1. The growth function

3

Page 17: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

ΠG : N→ N for the family G is defined by:

ΠG(m) = maxz1,...,zm⊂Z

∣∣(g(z1), . . . , g(zm)

)|g ∈ G

∣∣.

The growth function evaluates to the number of ways a sample of size m can beclassified using the function class G. Similar to the Rademacher complexity, the growthfunction is a measure of the size of a function class. However, it is a purely combinatorialmeasure unlike the Rademacher complexity which takes into account the distribution ofthe data. The following lemma due to Massart relates both quantities.

Lemma 1 (Massart’s Lemma). Let G be a family of functions taking values in −1, 1.Then the following holds for any sample S:

RS(G) ≤√

2 log ΠG(m)

m.

We now introduce the combinatorial concept of VC-dimension.

Definition 3. Let G be a family of functions taking values in −1, 1, the VC-dimensionof G is the largest value d such that ΠG(d) = 2d. We denote this value as VCdim(G).

The VC-dimension is thus the size of the largest sample that can be classified in allpossible ways by G. The last lemma of this section provides a bound on the growthfunction in terms of the VC dimension.

Lemma 2 (Sauer’s Lemma). Let G be a family of functions with VCdim(G) = d. Thenfor all m ≥ d,

ΠG(m) ≤d∑

i=0

(m

i

)≤(emd

)d.

In particular, if the class G has VCdim(G) = d then Rm(G) is in O(√

d log dm

m

).

For an extensive treatment of these complexity measures we refer the reader to Mohriet al. (2012). We conclude this section by presenting an analogous concept to that ofVC-dimension for a class of functions G taking values in R.

Definition 4. Let G be a family of functions takin values in R, we define the pseudodimension of G, Pdim(G) as:

Pdim(G) = VCdim(G′),

where G′ = z 7→ 1g(z)−t>0|g ∈ G, t ∈ R.

4

Page 18: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Part I

Domain Adaptation

5

Page 19: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Chapter 1

Domain Adaptation

In this chapter we study the problem of domain adaptation. In machine learning, do-main adaptation refers to the problem of obtaining the most accurate hypothesis on atest distribution different from the training distribution. This challenging problem hasreceived a great deal of attention by the learning community over the past decade. Here,we present an overview of the theory on domain adaptation as well as a description ofthe current state of the art adaptation algorithm: discrepancy minimization (DM). Wediscuss the main drawbacks of DM and introduce a new algorithm based on the novelconcept of generalized discrepancy. We provide learning guarantees for our proposedalgorithm and show empirically, that our algorithm consistently outperforms DM andseveral other commonly used algorithms.

1.1 IntroductionA standard assumption in much of learning theory and algorithms is that training andtest data are sampled from the same distribution. In practice, however, this assumptionoften does not hold. The learner then faces the more challenging problem of domainadaptation where the source and target distributions are distinct. This problem arisesin a variety of applications such as natural language processing and computer vision(Dredze et al., 2007; Blitzer et al., 2007; Jiang and Zhai, 2007; Leggetter and Woodland,1995; Martınez, 2002; Hoffman et al., 2014) and many others. A common trait amongthe aforementioned examples is that, albeit different, the source and target distributionsare somehow related as otherwise learning would be impossible. Indeed, as shown byBen-David et al. (2010), if the supports of the source and target distributions differ toomuch adaptation cannot succeed.

More surprisingly, Ben-David and Urner (2012) showed that even in the favorablescenario where the source and target distribution admit the same support, a sample ofsize in the order of the cardinality of the target support is needed in order to solve thedomain adaptation problem. As pointed out by the authors, the problem becomes triv-

6

Page 20: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Source Target

Figure 1.1: Example of reweighting to make source and target distributions more similar.

ially intractable when the hypothesis set contains no candidate with good performanceon the training set. However, the adaptation tasks found in applications seem to be oftenmore favorable than such worst cases and in fact several algorithms over the past decadehave empirically demonstrated that adaptation can indeed succeed.

We can distinguish two broad families of adaptation algorithms. Some consist offinding a new feature representation. The core idea behind these algorithms is to mapthe source and target data into a new feature space where the difference between sourceand target distributions is reduced. Transfer Component Analysis (TCA) (Pan et al.,2011) and the work on frustratingly easy domain adaptation (FE) (Daume III, 2007)belong both to this family of algorithms. While some empirical evidence has beenreported in the literature for the effectiveness of these algorithms, we are not awareof any theoretical guarantees in support of these techniques.

Many other adaptation algorithms can be viewed as reweighting techniques. Orig-inated in the statistics literature on sample bias correction, these techniques attempt tocorrect the difference between distributions by multiplying every training example by apositive weight. Most of the classical algorithms such as KMM (Huang et al., 2006),KLIEP (Sugiyama et al., 2007) and uLSIF (Kanamori et al., 2009) fall in this category.

The weights for the KMM algorithm are selected in order to minimize the differencebewteen the mean of the reweighted source and target feature vectors under an appropri-ate feature map. A different approach is given by the KLIEP algorithm where weightsare selected to minimize the KL- divergence between the source and target distribution.Finally, the uLSIF algorithm chooses these positives weights in order to minimize thesquare distance between these distributions.

As the previous description shows, the underlying idea behind common reweightingtechniques is that of minimizing the distance between the reweighted empirical sourceand target distributions. A crucial component of these learning algorithms is thus thechoice of a divergence between probability measures. The KLIEP algorithm is basedon the minimization of the KL-divergence , while algorithms such as KMM or the al-gorithm of Zhang et al. (2013) use the maximum mean discrepancy distance (MMD)(Gretton et al., 2007) as the divergence to be minimized. While these are natural di-

7

Page 21: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

vergence measures commonly used in statistics, the aforementioned algorithms do notprovide any learning guarantees. Instead, when the source and target distributions admitdensities q(x) and p(x) respectively, it can be shown that the weight on the sample pointxi will converge to the importance ratio p(xi)/q(xi). The use of this ratio is commonlyknown as importance weighting and reweighting instances by this ratio provides anunbiased estimate for the expected loss on the target distribution. While this unbiased-ness gives motivation for this kind of reweighting algorithms, it has been shown bothempirically and theoretically that importance weighting algorithms can fail for the com-mon case where the importance ratio becomes unbounded unless the second-moment isbounded, an assumption that cannot be tested in general (Cortes et al., 2010).

The discussion above shows some of the problems ineherent to general purpose di-vergence measures not tailored for domain adaptation. In this chapter, therefore wepresent the Y-discrepancy as an alternative divergence to address this issues. The dis-crepancy is a generalization of the dA-distance, a crucial notion in the development ofdoman adaptation theory introduced by Ben-David et al. (2006). We provide a surveyof discrepancy-based learning guarantees later on this chapter, as well as a discrepancyminimization (DM) algorithm proposed by Mansour et al. (2009) and later enhanced byCortes and Mohri (2011) and Cortes and Mohri (2013).

In spite of the theoretical guarantees of DM, we will discuss some of the drawbacksof the DM algorithm and propose the novel notion of generalized discrepancy to addressthese issues. We provide learning guarantees based on this new distance which motivatethe design of a generalized discrepancy minimization (GDM) algorithm. Unlike previ-ous algorithms, we do not consider a fixed reweighting of the losses over the trainingsample. Instead, the weights assigned to training sample losses vary as a function of thehypothesis h. This helps us ensure that, for every hypothesis h, the empirical loss on thesource distribution is as close as possible to the empirical loss on the target distributionfor that particular h.

The chapter is organized as follows: we describe the learning scenario considered(Section 1.2), we then define the notion of discrepancy distance and compare the dis-crepancy against other common divergences used in practice (Section 1.3). We alsoprovide learning guarantees for the domain adaptation problem based on the discrep-ancy and describe in detail the DM algorithm. Having established the basic results fordomain adaptation we present a description of our GDM algorithm and show that it canbe formulated as a convex optimization problem (Section 1.5). Next, we analyze the the-oretical properties of our algorithm, which will guide the choice of parameters definingour algorithm (Section 1.6). In Section 1.7, we further analyze our optimization prob-lem and derive an equivalent form that can be handled by a standard convex optimizationsolver. Finally, in Section 1.8, we report the results of experiments demonstrating thatour algorithm improves upon the DM algorithm in several tasks.

8

Page 22: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Table 1.1: Notation table.X Input space Y Output spaceP Target distribution Q Source distributionP Empirical target distribution Q Empirical source distributionT Target unlabeled sample S Labeled source sampleT ′ Small target labeled sample SX Unlabeled source samplefP Target labeling function fQ Source labeling function

LP (h,fP ) Expected target loss LQ(h,fQ) Expected source lossLP

(h,fP ) Empirical target loss LQ

(h,fQ) Empirical source lossdisc(P,Q) Discrepancy DISC(P ,U) Generalized discrepancy

discH′′ (P,Q) Local Discrepancy discY (P,Q) Y-discrepancyqmin DM solution Qh GDM solution

1.2 Learning ScenarioThis section defines the learning scenario of domain adaptation we consider, which co-incides with that of Blitzer et al. (2007); Mansour et al. (2009), or Cortes and Mohri(2013) and introduces the definitions and concepts needed for the following sections.For the most part, we follow the definitions and notation of Cortes and Mohri (2013).

Let X denote the input space and Y ⊆ R the output space. We define a domainas a distribution over the set X and a labeling function mapping X to Y . Throughoutthis chapter we consider a source domain (Q, fQ) and a target domain (P, fP ). We willabuse notation and denote by P andQ also the joint distributions induced by the labelingfunctions on the space X × Y .

In the scenario of domain adaptation we consider, the learner receives two samples:a labeled sample of m points from the source domain S = ((x1, y1), . . . , (xm, ym)) ∈(X × Y)m with SX = (x1, . . . , xm) drawn i.i.d. according to Q and yi = fQ(xi) for i ∈[1,m]; and an unlabeled sample T = (x′1, . . . , x

′n) ∈ X n of size n drawn i.i.d. according

to the target distribution P . We denote by Q the empirical distribution correspondingto SX and by P the empirical distribution corresponding to T . We will be in factmore interested in the scenario commonly encountered in practice where, in additionto these two samples, a small amount of labeled data from the target domain T ′ =((x′′1, y

′′1), . . . , (x′′s , y

′′s )) ∈ (X × Y)s is received by the learner.

We consider a loss function L : Y × Y → R+ which unless otherwise stated, weassume to be jointly convex in its two arguments. The Lp losses commonly used inregression and defined by Lp(y, y′) = |y′ − y|p for p ≥ 1 are special instances of thisdefinition. For any distribution D over X ×Y we denote by LD(h) the expected loss ofh:

LD(h) = E(x,y)∼D

[L(h(x), y)

].

9

Page 23: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Similarly, for any two functions h, h′ and a distribution D over X we let

LD(h, h′) = Ex∼D

[L(h(x), h′(x))

].

The learning problem consists of selecting a hypothesis h out of a hypothesis setH witha small expected lossLP (h, fP ) with respect to the target domain. We further extend thisnotation to arbitrary functions q : X → R with a finite support as follows: Lq(h, h

′) =∑x∈X q(x)L(h(x), h′(x)). The function q is known as a reweighting function.Given a reweighting function q : SX → R we will be interested in studying regular-

ized weighted risk minimization. That is, given a positive semi-definite (PSD) kernelK, we analyze algorithms that return a hypothesis that solves the following optimizationproblem:

minh∈H

λ‖h‖2K + Lq(h, fQ), (1.1)

where ‖ · ‖K is the norm on the reproducing Hilbert space H induced by the kernel K.This family of algorithms is commonly used in practice and includes algorithms suchas support vector machines (SVM), kernel ridge regression (KRR) and support vectorregression (SVR) just to name a few.

1.3 DiscrepancyA crucial component of domain adaptation is, of course, a notion of divergence betweendistributions. As already mentioned, if the divergence between the source and targetdistributions is too large, then adaptation is not possible. In Figure 1.2(a) we show asource and a target distribution over X with disjoint supports. If only data from thesource distribution Q is used for training, one should not expect adaptation to succeed.Similarly, Figure 1.2(b) depicts the source and target distributions as well as the labelingfunctions fP and fQ. Even though the distributions over the instance space are close, thelabeling functions differ, making the joint distributions P and Q far from each other. Onthe other hand, Figure 1.2(c) presents a scenario where reweighting the source samplewould likely be useful. Indeed, by assigning more weight on the examples close to 0 wecan make the source and target distributions become similar, thereby helping adaptation.

The source and target distributions for the first two examples are far whereas on thethird example a reweighting technique could make the source and target distributionsclose. In order to provide a formal analysis of domain adaptation an appropriate notionof divergence between distributions is therefore necessary.

A natural choice for this divergence is the total variation or L1 distance between two

10

Page 24: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

P

Q−2 −1 0 1 2

−2

−1

01

23

4

−2 −1 0 1 2

−2−1

01

23

4

SourceTarget

(a) (b)

(c)

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

−1.2

−0.8

−0.4

0.0

0.2

SourceTarget

Figure 1.2: Examples of favorable (c) and unfavorable (a and b) scenarios for adapta-tion. (a) Instance distributions are disjoint. (b) Instance distributions match but labelingfunctions are different. Therefore, joint distributions are not similar. (c) Instance distri-butions can be made similar through reweighting and labeling function is the same.

distributions :

‖P −Q‖1 := 2 supA|P (A)−Q(A)| =

(x,y)∈X×Y

|P (x, y)−Q(x, y)|,

where the supremum is taken over all measurable sets A ⊂ X × Y . Another commonmeasure of divergence between distributions is the KL-divergence or relative entropy:

KL(P ||Q) =∑

x∈X×Y

P (x, y) log(P (x, y)

Q(x, y)

).

Notice however that these divergence measures do not take into consideration the lossfunction or the hypothesis set; both crucial for learning. Therefore, favorable distribu-tions for learning can exist that nevertheless are assigned a large L1 distance. Consider,

11

Page 25: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

h h'

Figure 1.3: Figure depicting the difference between the L1 distance and the discrepancy.In the left figure, the L1 distance is given by twice the measure of the green rectangle.In the right figure, P (h(x) 6= h′(x)) is equal to the measure of the blue rectangle andQ(h(x) 6= h′(x)) is the measure of the purple rectangle. The two measures are equal,thus disc(P,Q) = 0.

for instance, the following toy example:Let P and Q be distributions over R2. Where P is uniform in the rectangle R1 de-

fined by the vertices (−1, R), (1, R), (1,−1), (−1,−1) and Q is uniform in the rectan-gle R2 spanned by (−1,−R), (1,−R), (−1, 1), (1, 1). These distributions are depictedin Figure 1.3. Let H be a set of threshold functions on x1. That is, ht(x1, x2) = 1 ifx1 ≤ t and 0 otherwise. Notice that the value of ht only changes as a function of thefirst coordinate and that the marginals of P and Q along the first coordinate agree. Inview of this, an algorithm trained on Q should have the same performance as an algo-rithm trained on P . We therefore expect the divergence between these two distributionsto be small. However, the L1 distance of these probability measures is given by twicethe measure of the green rectangle, i.e, ‖P − Q‖1 = 2(R−1)

R+1. This distance goes to 2 as

R → ∞. Moreover, the KL-divergence KL(P ||Q) has a major challenge in this case:if the support of Q is not included in that of P, this divergence becomes uninterestinglyinfinite. One could argue that the L1 distance of the marginals over the first coordinateis in fact 0. However, in practice finding appropriate subsets of the support of P and Qto use the L1 distance might not be a trivial task as in this example.

This simple example shows that the problem of domain adaptation requires a finermeasure of divergence between distributions. Notice that for a hypothesis h we areultimately interested in the difference |LP (h)− LQ(h)|. It is therefore natural to definethe divergence as the supremum over h ∈ H of this quantity.

Definition 5. Given distributions P and Q over X × Y . We define the Y-discrepancy

12

Page 26: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

between P and Q as

discY(P,Q) = suph∈H|LP (h)− LQ(h)|.

Equivalently, if the labeling functions fP and fQ are deterministic we have:

discY(P,Q) = suph∈H|LP (h, fP )− LQ(h, fQ)|.

It is clear from its definition that the Y-discrepancy takes into account the hypoth-esis set and the loss function. It is also symmetric and satisfies the triangle inequality.However, in general, the Y-discrepancy is not a metric as it may be 0 even when P 6= Q.It is easy to see that the Y-discrepancy is a finer measure than the L1 distance. Indeed,if the loss function L is bounded by M the following holds

discY(P,Q) = suph∈H|LP (h)− LQ(h)|

≤ suph ∈H

x∈X ,y∈Y

|P (x, y)−Q(x, y)||L(h(x), y)| ≤M‖P −Q‖1.

By Pinsker’s inequality, we also have

‖P −Q‖1 ≤√

2KL(P ||Q).

Therefore, the Y-discrepancy is also finer than the KL-divergence. Furthermore, asthe following theorem shows, the Y-divergence can be estimated from a finite set ofsamples. This is in stark contrast with the L1 distance which in order to be estimatedrequires a sample of size in the order of the support of the distribution. In particular,for distributions with an infinite support, the L1 distance cannot be estimated (Valiant,2011).

Theorem 1. LetHL = (x, y) 7→ L(h(x), y)|h ∈ H and let P, Q be distributions overX×Y . Then, for any δ > 0, with probability at least 1−δ over the choice of samples S =((x1, y1), . . . , (xm, ym)) ∈ (X ×Y)m drawn from Q and S ′ = ((x′1, y

′1), . . . , (x′n, y

′n)) ∈

(X × Y)n drawn according to P the following holds:

discY(P,Q) ≤ discY(P , Q)+2Rn(HL)+2Rm(HL)+M(√ log(2/δ)

2n+

√log(2/δ)

2m

)

discY(P , Q) ≤ discY(P,Q)+2Rn(HL)+2Rm(HL)+M(√ log(2/δ)

2n+

√log(2/δ)

2m

).

Proof. Notice that the Y-discrepancy between the distributionQ and its empirical coun-terpart Q is simply the maximum generalization error over all hypothesis h ∈ H . There-

13

Page 27: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

fore, by a standard learning bound, with probability at least 1− δ over the choice of S:

discY(Q, Q) ≤ 2Rm(HL) +M

√log(1/δ)

2m,

and a similar bound can be given for P . Furthermore, by the triangle inequality we have:

discY(Q,P ) ≤ discY(Q, Q) + discY(Q, P ) + discY(P , P ).

The result now follows immediately from these inequalities as well as the union bound.

For µ-Lipschitz continuous functions the Rademacher complexity Rn(HL) can bebounded by µRn(H), and for the 0-1 loss we have Rn(HL) ≤ 1

2Rn(H). Furthermore,

if H has finite VC-dimension the Rademacher complexity Rn(H) can be shown tobe in O( 1√

n). Therefore, estimation of the discrepancy between distributions has the

same statistical complexity as PAC-learning. Notice, however, that the accuracy of theseestimates depend on both the size of the source and target samples. But in practice, onlya small labeled sample from the target data might be available and therefore the Y-discrepancy cannot be accurately estimated. Nevertheless, under certain conditions, wecan estimate an upper bound on the Y-discrepancy using only unlabeled samples.

In a deterministic scenario and when fP = fQ = f , the Y-discrepancy reduces to

suph∈H|LP (h, f)− LQ(h, f)|.

Furthermore, when the labeling function f belongs to the hypothesis set H . We mayupper bound the above quantity by taking a supremum over h′ ∈ H . This gives thefollowing definition of discrepancy.

Definition 6 (Mansour et al. (2009)). The discrepancy disc between two distributions Pand Q over X , is defined as

disc(P,Q) = suph,h′∈H

|LP (h, h′)− LQ(h, h′)|.

The discrepancy admits properties similar to those of the Y-discrepancy: it is alower bound on the L1 distance and can be accurately estimated from finite samples.Moreover, since the labeling function is not required for its definition, these samplescan be unlabeled.

In general, the discrepancy and theY-discrepancy are not directly comparable. How-ever, as pointed out before, when fP = fQ = f ∈ H the discrepancy is an upper boundon the Y-discrepancy. A detailed analysis of the relationship between these two mea-sures is given in Section 1.4.

14

Page 28: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Let us now evaluate the discrepancy between the two distributions P and Q previ-ously introduced in our toy example. Let L be the 0-1 loss, using the fact that hypothesesht ∈ H are threshold functions we have

disc(P,Q) = supht,hs∈H

∣∣∣P (ht(x1, x2) 6= hs(x1, x2))−Q(ht(x1, x2) 6= hs(x1, x2))∣∣∣

= sups,t∈R

∣∣∣P ([s, t]× R)−Q([s, t]× R)∣∣∣

= supt,s∈R

∣∣∣ |t− s|2− |t− s|

2

∣∣∣ = 0,

which is what we expected for this scenario. In the next section we show that learningguarantees can be given where the discrepancy only appears as an additive term. There-fore, for this particular example we can guarantee that training on the source distributionis equivalent to training on the target.

We conclude this chapter with some historical remarks on the notion of discrepancy.The first theoretical analysis of domain adaptation for classification was done by Ben-David et al. (2006) where the so called dA-distance was introduced. The dA- distance isin fact a special case of the discrepancy when the loss function is the 0-1 loss. Later on,the discrepancy disc was introduced as a generalization of the dA-distance to arbitraryloss functions by Mansour et al. (2009). As we will later see, the notion of discrepancywas pivotal in the design of a theoretically founded algorithm for domain adaptation(Cortes and Mohri, 2011, 2013). Finally, the Y-discrepancy was used by Mohri andMunoz (2012) to analyze the related problem of learning with drifting distributions (SeeChapter 2).

1.4 Learning GuaranteesHere, we present two discrepancy-based guarantees: a tight learning bound based onthe Rademacher complexity and a pointwise bound derived from a stability analysis.We will assume that the loss function L is µ-admissible.

Definition 7. A loss function is said to be µ-admissible if there exists µ > 0 such that

|L(h(x), y)− L(h′(x), y)| ≤ µ|h(x)− h′(x)| (1.2)

holds for all (x, y) ∈ X × Y and h′, h ∈ H .

Notice that the notion of µ-admissibility is somewhat weaker than requiring µ-Lipschitzness with respect to the first argument. The Lp losses commonly used in re-gression, p ≥ 1, verify this condition (see Appendix A.3).

15

Page 29: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Our first generalization bound is given in terms of the Y-discrepancy. Its deriva-tion follows a similar analysis to the one given by Mohri and Munoz (2012) to derivegeneralization bounds for learning with drifting distributions.

Proposition 1. Let HQ and HP be the families of functions defined as follows: HQ :=x 7→ L(h(x), fQ(x)) : h ∈ H and HP := x 7→ L(h(x), fP (x)) : h ∈ H. DefineMQ andMP asMQ=supx∈X ,h∈H L(h(x), fQ(x)) andMP=supx∈X ,h∈H L(h(x), fP (x)).Then, for any δ > 0,

1. with probability at least 1 − δ over the choice of a labeled sample S of size m, thefollowing inequality holds for all h ∈ H:

LP (h, fP ) ≤ LQ(h, fQ) + discY(P,Q) + 2Rm(HQ) +MQ

√log(1

δ)

2m; (1.3)

2. with probability at least 1− δ over the choice of a sample T of size n, the followinginequality holds for all h ∈ H and any distribution q over a sample SX :

LP (h, fP ) ≤ Lq(h, fQ) + discY(P , q) + 2Rn(HP ) +MP

√log(1

δ)

2n. (1.4)

Proof. Let Φ(S) denote suph∈H LQ(h, fQ) − LP (h, fP ). Changing one point in Schanges Φ(S) by at most MQ

m. Thus, by McDiarmid’s inequality, the following inequal-

ity holds:

P(Φ(S)− E[Φ(S)] > ε

)≤ e

− 2mε2

M2Q .

Therefore, for any δ > 0, with probability at least 1 − δ, the following holds for allh ∈ H:

LP (h, fP ) ≤ LQ(h, fQ) + E[Φ(S)] +MQ

√log(1

δ)

2m.

Next, we can bound E[Φ(S)] as follows:

E[Φ(S)] = E[

suph∈HLQ(h, fQ)− LP (h, fP )

]

≤ E[

suph∈HLQ(h, fQ)− LQ(h, fQ)

]+ sup

h∈HLQ(h, fQ)− LP (h, fP )

≤ 2Rm(HQ) + discY(P,Q),

where the last inequality follows from a standard symmetrization inequality in terms ofthe Rademacher complexity and the definition of the discY(P,Q).

The second learning bound can be shown as follows. Starting with a standardRademacher complexity bound for HP , for any δ > 0, with probability at least 1 − δ,

16

Page 30: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

the following holds for all h ∈ H:

LP (h, fP ) ≤ LP (h, fP ) + 2Rn(HP ) +MP

√log(1

δ)

2n(1.5)

≤ Lq(h, fQ) + LP (h, fP )− Lq(h, fQ) + 2Rn(HP ) +MP

√log(1

δ)

2n

≤ Lq(h, fQ) + discY(P , q) + 2Rn(HP ) +MP

√log(1

δ)

2n,

where the last two inequalities hold for any distribution q. This completes the proof.

Observe that these bounds are tight as a function of the divergence measure (discrep-ancy) we use: in the absence of adaptation, the following tight Rademacher complexitylearning bound holds:

LP (h, fP ) ≤ LP (h, fP ) + 2Rn(HP ) +MP

√log(1

δ)

2n.

Our second adaptation bound differs from this inequality only by the fact that LP (h, fP )

is replaced with Lq(h, fQ) + discY(P , q). But, by definition of Y-discrepancy, thereexists an h ∈ H such that |LP (h, fP ) − Lq(h, fQ)| = discY(P , q). Therefore, ourbound cannot be improved in the worst case. A similar analysis shows that our firstbound is also tight.

Given a labeled sample S from the source domain, Proposition 1 suggests choosinga distribution q with support SX that minimizes the right-hand side of (1.4). However,the quantity discY(P , q) depends, by definition, on the unknown labels from the targetdomain and therefore cannot be minimized. Thus, we will instead upper bound theY-discrepancy in terms of quantities that can be estimated.

The bound based on the discrepancy disc given on the previous section required thelabeling functions to be equal and to belong to the hypothesis set H . In order to relaxthis condition we introduce the following term quantifying the difference of the sourceand target labeling functions:

ηH(fP , fQ) = minh0∈H

(max

x∈supp(P )|fP (x)− h0(x)|+ max

x∈supp(Q)|fQ(x)− h0(x)|

),

Proposition 2. The following inequality holds for all distributions q over SX :

discY(P , q) ≤ disc(P , q) + µ ηH(fP , fQ).

Proof. By the triangle inequality and the µ-admissibility of the loss, the following in-

17

Page 31: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

equality holds for all h0 ∈ H:

discY(P , q)

= suph∈H|Lq(h, fQ)− LP (h, fP )|

≤ suph∈H

(|LP (h, h0)− LP (h, fP )|+ |Lq(h, fQ)− Lq(h, h0)|

)

+ suph∈H|Lq(h, h0)− LP (h, h0)|

≤ µ(

supx∈supp(P )

|h0(x)− fP (x)|+ supx∈supp(Q)

|fQ(x)− h0(x)|)

+ disc(P , q).

Minimizing over all h0 ∈ H gives discY(P , q) ≤ µ ηH(fP , fQ) + disc(P , q) and com-pletes the proof.

The following corollary is an immediate consequence of propositions 1 and 2.

Corollary 1. Under the notation of Proposition 1, for any distribution q over SX andfor any δ > 0, with probability at least 1− δ:

LP (h, fP ) ≤ Lq(h, fQ)+disc(P , q)+ηH(fP , fQ)+2Rn(HP )+M

√log(1/δ)

2n. (1.6)

Corollary 1 motivates the following algorithm: select a distribution q and a hypoth-esis h minimizing the right hand side of (1.6). This optimization problem is howevernot jointly convex on q and h. Instead, Cortes and Mohri (2013) propose a two stepdiscrepancy minimization algorithm. That is, first a distribution qmin is found satisfyingqmin = argminq∈∆(SX ) disc(P , q), where ∆(SX ) ⊂ [0, 1]SX denotes the set of all proba-bility distributions over SX . Given qmin, the final hypothesis is chosen to minimize thefollowing objective function:

λ‖h‖2K + Lqmin(h, fQ). (1.7)

This is the discrepancy minimization algorithm of Mansour et al. (2009). This algorithmis further motivated by the following theorem which provides a bound on the distancebetween the ideal solution h∗ obtained by training on the target distribution and thehypothesis obtained by training on a reweighted source sample.

Theorem 2 (Cortes and Mohri (2013)). Let q be an arbitrary distribution over SX andlet h∗ and hq be the hypotheses minimizing λ‖h‖2

K +LP (h, fP ) and λ‖h‖2K +Lq(h, fQ)

respectively. Then, the following inequality holds:

λ‖h∗ − hq‖2K ≤ µ ηH(fP , fQ) + disc(P , q). (1.8)

18

Page 32: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

It is immediate that the choice of qmin precisely minimizes the right hand side ofthe previous bound. This is the discrepancy minimization (DM) algorithm of Cortesand Mohri (2013). To our knowledge, discrepancy minimization is the only adaptationalgorithm that has been derived from learning guarantees. Besides its theoretical moti-vation, the DM algorithm has been shown to outperform several adaptation algorithmsin different tasks (Cortes and Mohri, 2013). In the next section, however, we discuss animportant drawback of the DM algorithm. Moreover, we propose a new, robust algo-rithm to address this limitation while still having sound theoretical guarantees.

1.5 AlgorithmIn the previous section we showed that the discrepancy was the correct measure of di-vergence between distributions in domain adaptation. Moreover, by selecting qmin asa reweighting function we could ensure the solution of the discrepancy minimizationalgorithm and the ideal solution to be close. However, by using a fixed reweighting,we implicitly assumed a worst case scenario. That is, we defined the discrepancy as amaximum over all pairs of hypotheses. But, the maximizing pair of hypotheses may noteven be among the candidates ever considered by the learning algorithm. Thus, a learn-ing algorithm based on discrepancy minimization tends to be too conservative. Moreprecisely, the bound given by Theorem 2 can, in some favorable cases, become loose.We attempt to address this issue by using a hypothesis-dependent reweighting.

1.5.1 Main ideaOur algorithm is motivated by the following observation: in the absence of a domainadaptation problem, the learner would have access to the labels of the points in T .Therefore, he would return the hypothesis h∗ solution of the optimization problemminh∈H F (h), where F is the convex function defined for all h ∈ H by

F (h) = λ‖h‖2K + LP (h, fP ), (1.9)

In view of that, we can formulate our objective, in the presence of a domain adaptationproblem, as that of finding a hypothesis hwhose lossLP (h, fP ) with respect to the targetdomain is as close as possible to LP (h∗, fP ). To do so, we will seek in fact a hypothesish that is as close as possible to h∗, which would imply the closeness of the losses withrespect to the target domains. We do not have access to fP and can only access thelabels of the training sample S . Thus, we must resort to using in our objective function,instead of LP (h, fP ), a reweighted empirical loss over the training sample S. Thusfar, the motivation behind our algorithm matches the ideas behind the DM algorithm.However, instead of using a fixed set of weights, the main idea behind our algorithm isto define, for any h ∈ H, a reweighting function Qh : SX = x1, . . . , xm → R such

19

Page 33: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

that the objective function G defined for all h ∈ H by

G(h) = λ‖h‖2K + LQh(h, fQ), (1.10)

is uniformly close to F , thereby resulting in close minimizers. Since the first termof (1.9) and (1.10) coincide, the idea consists equivalently of seeking Qh such thatLQh(h, fQ) and LP (h, fP ) are as close as possible. Observe that this departs from thestandard reweighting methods: instead of reweighting the training sample with somefixed set of weights, we allow the weights to vary as a function of the hypothesis h. Notethat we have further relaxed the condition commonly adopted by reweighting techniquesthat the weights must be non-negative and sum to one. Allowing the weights to be ina richer space than the space of probabilities over SX could raise over-fitting concernsbut, we will later see that this in fact does not affect our learning guarantees and leadsto good empirical results.

Of course, searching for Qh to directly minimize |LQh(h, fQ) − LP (h, fP )| is, ingeneral, not possible since we do not have access to fP . But, it is instructive to considerthe imaginary case where the average loss LP (h, fP ) is known to us for any h ∈ H. Qh

could then be determined via

Qh = argminq∈F(SX ,R)

|Lq(h, fQ)− LP (h, fP )|, (1.11)

where F(SX,R) is the set of real-valued functions defined over SX . For any h, wecan in fact select Qh such that LQh(h, fQ) = LP (h, fP ) since Lq(h, fQ) is a linearfunction of q and thus the optimization problem (1.11) reduces to solving a simple linearequation. With this choice of Qh, the objective functions F and G coincide and byminimizing G we can recover the ideal solution h∗. Note that, in general, the DMalgorithm could not recover that ideal solution. Even a finer discrepancy minimizationalgorithm exploiting the knowledge of LP (h, fP ) for all h and seeking a distributionq′min minimizing maxh∈H |Lq(h, fQ) − LP (h, fP )| could not, in general, recover theideal solution since we could not have Lq′min

(h, fQ) = LP (h, fP ) for all h ∈ H.Of course, in practice access to LP (h, fP ) is unfeasible since the sample T is un-

labeled. Instead, we will consider a non-empty convex set of candidate hypothesesH ′′ ⊆ H that could contain a good approximation of fP . Using H ′′ as a set of surrogatelabeling functions leads to the following definition of Qh instead of (1.11):

Qh = argminq∈F(SX,R)

suph′′∈H′′

|Lq(h, fQ)− LP (h, h′′)|. (1.12)

The choice of the subset H ′′ is of course key. Our choice will be based on the theo-retical analysis of Section 1.6. Nevertheless, in the following section, we present theformulation of the optimization problem for an arbitrary choice of the convex subsetH ′′.

20

Page 34: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

1.5.2 Formulation of optimization problemThe following result provides a more explicit expression for LQh(h, fQ) leading to asimpler formulation of the optimization problem defining our algorithm.

Proposition 3. For any h ∈ H, let Qh be defined by (1.12). Then, the following identityholds for any h ∈ H:

LQh(h, fQ) =1

2

(maxh′′∈H′′

LP (h, h′′) + minh′′∈H′′

LP (h, h′′)).

Proof. For any h ∈ H, the equation Lq(h, fQ) = l with l ∈ R admits a solution q ∈F(SX,R). Thus, we have Lq(h, fQ) | q ∈ F(SX ,R) = R and for any h ∈ H, we canwrite

LQh(h, fQ) = argminl∈Lq(h,fQ):q∈F(SX ,R)

maxh′′∈H′′

|l − LP (h, h′′)|

= argminl∈R

maxh′′∈H′′

|l − LP (h, h′′)|

= argminl∈R

maxh′′∈H′′

maxLP (h, h′′)− l, l − LP (h, h′′)

= argminl∈R

max

maxh′′∈H′′

LP (h, h′′)− l, l − minh′′∈H′′

LP (h, h′′)

=1

2

(maxh′′∈H′′

LP (h, h′′) + minh′′∈H′′

LP (h, h′′)),

since the minimizing l is obtained for maxh′′∈H′′

LP (h, h′′)− l= l −minh′′∈H′′

LP (h, h′′).

In view of this proposition, with our choice of Qh based on (1.12), the objectivefunction G of our algorithm (1.10) can be equivalently written for all h ∈ H as follows:

G(h) = λ‖h‖2K +

1

2

(maxh′′∈H′′

LP (h, h′′) + minh′′∈H′′

LP (h, h′′)). (1.13)

The function h 7→ maxh′′∈H′′ LP (h, h′′) is convex as a pointwise maximum of the con-vex functions h 7→ LP (h, h′′). Since the loss function L is jointly convex, so is LP ,therefore, the function derived by partial minimization over a non-empty convex set H ′′

for one of the arguments, h 7→ minh′′∈H′′ LP (h, h′′), also defines a convex function(Boyd and Vandenberghe, 2004). Thus, G is a convex function as a sum of convexfunctions.

21

Page 35: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

1.6 Generalized DiscrepancyHere we introduce the notion of generalized discrepancy, which will be used to derivelearning guarantees for our algorithm.

Let A(H) be the set of all functions U : h 7→ Uh mapping H to F(SX ,R) suchthat for all h ∈ H , h 7→ LUh(h, fQ) is a convex function. A(H) contains all constantfunctions U such that Uh = q for all h ∈ H , where q is a distribution over SX . We willabuse the notation and denote these functions also by q. By Proposition 3, A(H) alsoincludes the function Q : h→ Qh used by our algorithm.

Definition 8 (Generalized discrepancy). For any U ∈ A(H), we define the notion ofgeneralized discrepancy between P and U as the quantity DISC(P ,U) given by

DISC(P ,U) = maxh∈H,h′′∈H′′

|LP (h, h′′)− LUh(h, fQ)|. (1.14)

We also denote by dP1 (fP , H′′) the following distance of fP to H ′′:

dP1 (fP , H′′) = min

h0∈H′′E(P|h0(x)− fP (x)|. (1.15)

We now provide an upper bound on the Y-discrepancy in terms of the generalized dis-crepancy as well as the term dP1 (fP , H

′′). This bound can be used with Proposition 1 toprovide a generalization bound based on the generalized discrepancy.

Proposition 4. For any distribution q over SX and any set H ′′, the following inequalityholds:

discY(P , q) ≤ DISC(P , q) + µ dP1 (fP , H′′).

Proof. For any h0 ∈ H ′′, by the triangle inequality, we can write

discY(P , q) = suph∈H|Lq(h, fQ)− LP (h, fP )|

≤ suph∈H|Lq(h, fQ)− LP (h, h0)|+ sup

h∈H|LP (h, h0)− LP (h, fP )|

≤ suph∈H

suph′′∈H′′

|Lq(h, fQ)− LP (h, h′′)|+ suph∈H|LP (h, h0)− LP (h, fP )|.

By the µ-admissibility of the loss, the last term can be bounded as follows:

suph∈H|LP (h, h0)− LP (h, fP )| ≤ µ E

P|fP (x)− h0(x)|.

22

Page 36: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

H

bP1 (fP , H

00)

H(fP , fQ) fP

fQ

H 00

Figure 1.4: Depiction of the distances ηH(fP , fQ) and δP1 (fP , H′′).

Using this inequality and minimizing over h0 ∈ H ′′ yields:

discY(P , q) ≤ suph∈H

suph′′∈H′′

|Lq(h, fQ)− LP (h, h′′)|+ µ dP1 (fP , H′′)

= DISC(P , q) + µ dP1 (fP , H′′),

which completes the proof.

Corollary 2. Let H ′′ ⊂ H be a convex set and q a distribution over SX . Then, for anyδ > 0, with probability at least 1− δ:

LP (h, fP ) ≤ Lq(h, fQ) + DISC(P , q) + µdP1 (fP , H′′) + 2Rn(HP

L ) +M

√log(1/δ)

2n.

(1.16)

In general, the generalized discrepancy bound given by (1.16) and the discrepancybound derived in (1.6) are not comparable. Indeed, as we shall see, the generalizeddiscrepancy is always tighter than the discrepancy. However, as depicted in Figure 1.4,δP1 (fP , H

′′) can be larger than ηH(fP , fQ). Nonetheless, when L is an LP loss for somep ≥ 1 we can show the existence of a set H ′′ for which (1.16) is a tighter bound than(1.6). The result is expressed in terms of the local discrepancy defined by:

discH′′(P , q) = suph∈H,h′′∈H′′

|LP (h, h′′)− Lq(h, h′′)|,

which is a finer measure than the standard discrepancy for which the supremum is de-fined over a pair of hypothesis both in H ⊇ H ′′.

Theorem 3. Let q be an arbitrary distribution over SX and let L be the Lp loss forsome p ≥ 1. If H := B(r) : r ≥ 0 denotes the set of balls defined by B(r) = h′′ ∈

23

Page 37: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

H|Lq(h′′, fQ) ≤ rp, then there exists H ′′ ∈H such that the following holds:

DISC(P , q) + µ dP1 (fP , H′′) ≤ discH′′(P , q) + µ ηH(fP , fQ).

Proof. Fix a distribution q over SX . Let h∗0 be an element of argminh0∈H(LP (h0, fP )

1p+

Lq(h0, fQ)1p). Choose H ′′ ∈ H as H ′′ = h′′ ∈ H|Lq(h

′′, fQ) ≤ rp with r =

Lq(h∗0, fQ)

1p . Then, by definition, h∗0 is in H ′′. Furthermore, for the Lp loss, it is not

hard to show that for all h, h′′ ∈ H , |Lq(h, h′′) − Lq(h, fQ)| ≤ µ[Lq(h

′′, fQ)]1p (see

Appendix A.3). In view of this inequality, we can write:

DISC(P , q) = suph∈H,h′′∈H′′

|LP (h, h′′)− Lq(h, fQ)|

≤ suph∈H,h′′∈H′′

|LP (h, h′′)− Lq(h, h′′)|+ sup

h∈H,h′′∈H′′|Lq(h, h

′′)− Lq(h, fQ)|

≤ discH′′(P , q) + suph′′∈H′′

µLq(h′′, fQ)

1p

≤ discH′′(P , q) + µr = discH′′(P , q) + µLq(h∗0, fQ)

1p .

Using this inequality, Jensen’s inequality, and the fact that h∗0 is in H ′′, we can write

µ dP1 (fP , H′′) + DISC(P , q)

≤ µ minh0∈H′′

Ex∈P

[|fP (x)− h0(x)|

]+ µLq(h

∗0, fQ)

1p + discH′′(P , q)

≤ µ minh0∈H′′

Ex∈P

[|fP (x)− h0(x)|p

] 1p + µLq(h

∗0, fQ)

1p + discH′′(P , q)

≤ µLP (h∗0, fP )1p + µLq(h

∗0, fQ)

1p + discH′′(P , q)

= µ minh0∈H

(LP (h0, fP )

1p + Lq(h0, fQ)

1p

)+ discH′′(P , q)

≤ µ minh0∈H

(max

x∈supp(P )|fP (x)− h0(x)|+ max

x∈supp(Q)|fQ(x)− h0(x)|

)+ discH′′(P , q)

= µ ηH(fP , fQ) + discH′′(P , q).

which concludes the proof.

The previous theorem shows that the generalized discrepancy is in fact a finer mea-sure than the discrepancy. Therefore, an algorithm minimizing (1.16) would benefitfrom a superior learning guarantee than the DM algorithm. However the problem de-fined by (1.16) is again not jointly convex on q and h. Therefore, we proceed as in thecase of discrepancy minimization and show that our proposed algorithm minimizes abound on the distance of the solution of our algorithm to the ideal hypothesis h∗.

24

Page 38: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Theorem 4. Let U be an arbitrary element of A(H) and let h∗ and hU be the hypothe-ses minimizing λ‖h‖2

K + LP (h, fP ) and λ‖h‖2K + LUh(h, fQ) respectively. Then, the

following inequality holds for any convex set H ′′ ⊆ H:

λ‖h∗ − hU‖2K ≤ µ dP1 (fP , H

′′) + DISC(P ,U). (1.17)

Proof. Fix U ∈ A(H) and let GP denote h 7→ LP (h, fP ) and GU the function h 7→LUh(h, fQ). Since h 7→ λ‖h‖2

K + GP (h) is convex and differentiable and since h∗ isits minimizer, the gradient is zero at h∗, that is 2λh∗ = −∇GP (h∗). Similarly, sinceh 7→ λ‖h‖2

K + GU(h) is convex, it admits a sub-differential at any h ∈ H. Since hU isa minimizer, its sub-differential at hU must contain 0. Thus, there exists a sub-gradientg0 ∈ ∂GU(hU) such that 2λhU = −g0, where ∂GU(hU) denotes the sub-differential ofGU at hU. Using these two equalities we can write

2λ‖h∗ − hU‖2K = 〈h∗ − hU, g0 −∇GP (h∗)〉

= 〈g0, h∗ − hU〉 − 〈∇GP (h∗), h∗ − hU〉

≤ GU(h∗)−GU(hU) +GP (hU)−GP (h∗)

= LP (hU, fP )− LUh(hU, fQ) + LUh(h∗, fQ)− LP (h∗, fP )

≤ 2 suph∈H|LP (h, fP )− LUh(h, fQ)|,

where, for the first inequality, we used the convexity of GU combined with the sub-gradient property of g0 ∈ ∂GU(hU), and the convexity of GP . For any h ∈ H , usingthe µ-admissibility of the loss, we can upper bound the operand of the max operator asfollows:

|LP (h, fP )− LUh(h, fQ)| ≤ |LP (h, fP )− LP (h, h0)|+ |LP (h, h0)− LUh(h, fQ)|≤ µ E

x∼P|fP (x)− h0(x)|+ sup

h′′∈H′′|LP (h, h′′)− LUh(h, fQ)|,

where h0 is an arbitrary element of H ′′. Since this bound holds for all h0 ∈ H ′′, itfollows immediately that

λ‖h∗ − hU‖2K ≤ µ min

h0∈H′′EP

[|fP (x)− h0(x)|

]+ sup

h∈Hsuph′′∈H′′

|LP (h, h′′)− LUh(h, fQ)|,

which concludes the proof.

It is now clear that our choice of Q : h 7→ Qh minimizes the right hand side of (1.17)

25

Page 39: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

among all functions U ∈ A(H). Indeed, for any U we have

DISC(P ,U) = suph∈H

suph′′∈H′′

|LP (h, h′′)− LUh(h, fQ)|

≥ suph∈H

minq∈F(SX )

suph′′∈H′′

|LP (h, h′′)− Lq(h, fQ)|

= suph∈H

suph′′∈H′′

|LP (h, h′′)− LQh(h, fQ)| = DISC(P ,Q).

In view of Theorem 3 we have that for any constant function U ∈ A(H) with Uh = qfor some fixed distribution q over SX , the right-hand side of the bound of Theorem 2is lower bounded by the right-hand side of the bound of Theorem 4, since the localdiscrepancy is a finer quantity than the discrepancy: discH′′(P , q) ≤ disc(P , q). Thus,our algorithm benefits from a more favorable guarantee than the DM algorithm for thatparticular choice of H ′′, especially since our choice of Q is based on the minimizationover all elements in A(H) and not just the subset of constant functions mapping to adistribution. The following pointwise guarantee follows directly from Theorem 4.

Corollary 3. Let h∗ be a minimizer of λ‖h‖2K + LP (h, fP ) and hQ a minimizer of

λ‖h‖2K + LQh(h, fQ). Then, the following holds for any convex set H ′′ ⊆ H and for all

(x, y) ∈ X × Y:

|L(hQ(x), y)− L(h∗(x), y)| ≤ µR

√µ dP1 (fP , H ′′) + DISC(P ,Q)

λ, (1.18)

where R2 = supx∈X K(x, x).

Proof. By the µ-admissibility of the loss, the reproducing property of H, and theCauchy-Schwarz inequality, the following holds for all x ∈ X and y ∈ Y:

|L(hQ(x), y)− L(h∗(x), y)| ≤ µ|hQ(x)− h∗(x)|= µ|〈hQ − h∗, K(x, ·)〉K |≤ µ‖hQ − h∗‖K

√K(x, x) ≤ µR‖hQ − h∗‖K .

Upper bounding ‖hQ − h∗‖K using Theorem 4 and using the fact that Q : h → Qh is aminimizer of the bound over all choices of U ∈ A(H) yields the desired result.

The pointwise loss guarantee just presented can be directly used to bound the differ-ence of the expected loss of h∗ and hQ in terms of the same upper bounds, e.g.,

LP (hQ, fP ) ≤ LP (h∗, fP ) + µR

√µ dP1 (fP , H ′′) + DISC(P ,Q)

λ. (1.19)

Similarly, Theorem 3 directly implies the following Corollary.

26

Page 40: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Algorithm 1 Generalized Discrepancy Minimization (GDM) for Lp lossesRequire: Source sample S , target sample T , radius r

Set qmin = DM(S,T );Set H ′′ = h′′ ∈ H | Lqmin(h

′′, fQ) ≤ rp;Let Qh = argminq∈Rm suph′′∈H′′ |Lq(h, fQ)− LP (h, h′′)|;Let hQ = argminh∈H λ‖h‖2 + LQh(h, fQ);return hQ

Corollary 4. Let h∗ be a minimizer of λ‖h‖2K + LP (h, fP ) and hQ a minimizer of

λ‖h‖2K+LQh(h, fQ). Let supx∈X K(x, x) = R2. Then, there exists a choice ofH ′′ ∈H

for which the following inequality holds uniformly over (x, y) ∈ X × Y:

|L(hQ(x), y)− L(h∗(x), y)| ≤ µR

√µηH(fP , fQ)+discH′′(P , qmin)

λ,

where qmin is the solution of the DM algorithm.

The choice of the set H ′′ defining our algorithm is strongly motivated by the theo-retical results of this section. Indeed, in view of Theorem 3 we restrict our choice ofH ′′ to the family H , parametrized only by the radius r. Notice that DISC(P ,Q) andδP1 (fP , H

′′) then become functions of r. Therefore, we may select this parameter as aminimizer of (1.19). This can be done by using as a validation set a small amount oflabeled data from the target domain which is typically available in practice.

1.6.1 Comparison with other learning boundsWe now compare the learning bounds just derived for our algorithm with those of somecommon reweighting techniques. In particular, we compare our bounds with those ofCortes et al. (2008) for the KMM algorithm. A similar comparison however can bederived for other algorithms based on importance weighting such as KLIEP or uLSIF.

Assume P and Q admit densities p and q respectively. For every x ∈ X we denoteby β(x) = p(x)

q(x)the importance ratio and by β = β

∣∣SX

its restriction to SX . We also let

β be the solution to the optimization problem solved by the KMM algorithm. Let hβdenote the solution to

minh∈H

λ‖h‖2 + Lβ(h, fQ), (1.20)

and hβ be the solution tominh∈H

λ‖h‖2 + Lβ(h, fQ). (1.21)

The following proposition due to Cortes et al. (2008) relates the error of these hypothe-ses. The proposition requires the kernel K to be a strictly positive definite universal

27

Page 41: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

kernel, with Gram matrix K given by Kij = K(xi, xj).

Proposition 5. Assume L(h(x), y) ≤ 1 for all (x, y) ∈ X × Y , h ∈ H . For any δ > 0,with probability at least 1− δ we have:

|LP (hβ, fP )− LP (hβ, fP )| ≤ µ2R2λ12max(K)

λ

( εB′√m

+κ1/2

λ1/2min(K)

√B′2

m+

1

n

(1 +

√2 log

2

δ

)), (1.22)

where ε and B′ are the hyperparameters defining the KMM algorithm.

This bound and the one obtained in (1.19) are of course not comparable since thedependence on µ,R and λ is different. And in some cases this dependency is morefavorable in (1.22) whereas for other values of these parameters (1.19) provides a betterbound. Moreover, (1.22) depends on the condition number of K which can be reallylarge in practice. However, a crucial difference between these bounds is that (1.19) isgiven in terms of the ideal hypothesis h∗ and (1.22) is given in terms of hβ , which, inview of the results of (Cortes et al., 2010) is not guaranteed to have a good performanceon the target distribution. Therefore (1.22) does not provide an informative bound ingeneral.

1.6.2 Scenario of additional labeled dataHere, we consider a rather common scenario in practice where, in addition to the labeledsample S drawn from the source domain and the unlabeled sample T from the targetdomain, the learner receives a small amount of labeled data from the target domainT ′ = ((x′′1, y

′′1), . . . , (x′′s , y

′′s )) ∈ (X ×Y)s. This sample is typically too small to be used

solely to train an algorithm and achieve a good performance. However, it can be usefulin at least two ways that we discuss here.

One important benefit of T ′ is to serve as a validation set to determine the param-eter r that defines the convex set H ′′ used by our algorithm. Indeed, as the size of T ′

increases, we our confidence on the choice of the best value of r increases, therebyreducing our test error.xThe sample T ′ can also be used to enhance the discrepancyminimization algorithm as we now show. Let P ′ denote the empirical distribution asso-ciated to T ′. To take advantage of T ′, the DM algorithm can be trained on the sampleof size (m+ s) obtained by combining S and T ′, which corresponds to the new empir-ical distribution Q′ = m

m+sQ + s

m+sP ′. Note that for a fixed m and large values of s,

Q′ essentially ignores the points from the source distribution Q, which corresponds tothe standard supervised learning scenario in the absence of adaptation. Let q′min denotethe discrepancy minimization solution when using Q′. Since supp(Q′) ⊇ supp(Q), the

28

Page 42: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

discrepancy using q′min is a lower bound on the discrepancy using qmin:

disc(q′min, P ) = minsupp(q)⊆supp(Q′)

disc(P , q)

≤ minsupp(q)⊆supp(Q)

disc(P , q) = disc(qmin, P ).

1.7 Optimization SolutionAs shown in Section 1.5.2, the functionG defining our algorithm is convex and thereforeexpression (1.13) is a convex optimization problem. Nevertheless, its formulation doesnot admit a simple algorithmic solution. Indeed, evaluating the term maxh′′∈H′′LP (h,h′′)defining our objective requires solving a non-convex optimization problem, which canbe hard. Here, we exploit the structure of this problem to cast it as a semi-definiteprogram (SDP) for the case of the L2 loss.

1.7.1 SDP formulationAs discussed in Section 1.6, the choice of H ′′ is a key component of our algorithm. Inview of Corollary 4, we will consider the set H ′′ = h′′ | Lqmin(h

′′, fQ) ≤ r2. Equiva-lently, as a result of the reproducing property of H and the representer theorem, H ′′ maybe defined as a ∈ Rm|∑m

j=1 qmin(xj)(∑m

i=1 aiqmin(xi)1/2K(xi, xj) − yj)2 ≤ r2. By

the representer theorem, again, we know the solution to (1.13) will be of the form h =n−1/2

∑ni=1 biK(x′i, ·), for bi ∈ R. Therefore, given normalized kernel matrices Kt, Ks,

Kst defined respectively as Kijt = n−1K(x′i, x

′j), Kij

s = qmin(xi)1/2qmin(xj)

1/2K(xi, xj)

and Kijst = n−1/2qmin(xj)

1/2K(x′i, xj), problem (1.13) is equivalent to

minb∈Rn

λb>Ktb+1

2

(maxa∈Rm

‖Ksa−y‖2≤r2

‖Ksta−Ktb‖2+ mina∈Rm

‖Ksa−y‖2≤r2

‖Ksta−Ktb‖2

), (1.23)

where y = (qmin(x1)1/2y1, . . . , qmin(xm)1/2ym) is the vector of normalized labels.

Lemma 3. The Lagrangian dual of the problem

maxa∈Rm

‖Ksa−y‖2≤r2

1

2‖Ksta‖2 − b>KtKsta,

29

Page 43: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

h0

g1(h) = 0

g2(h) = 0

h0 + bh

Friday, February 7, 2014

Figure 1.5: Illustration of the sampling process on the set H ′′.

is given by

minη≥0,γ

γ

s. t.

(−1

2K>stKst + ηK2

s12K>stKtb− ηKsy

12b>KtKst − ηy>Ks η(‖y‖2 − r2) + γ

) 0.

Furthermore, the duality gap for these problems is zero.

The proof of the lemma is given in Appendix A.1. The lemma helps us derive thefollowing equivalent SDP formulation for our original optimization problem. The proofthe following proposition is given in Appendix A.1.

Proposition 6. Optimization problem (1.23) is equivalent to the following SDP:

maxα,β,ν,Z,z

1

2Tr(K>stKstZ)− β − α

s. t.

(νK2

s + 12K>stKst − 1

4K νKsy + 1

4Kz

νy>Ks + 14z>K α + ν(‖y‖2 − r2)

) 0

(λKt + K2

t12KtKstz

12z>K>stKt β

) 0

(Z zz> 1

) 0 ∧ ν ≥ 0 ∧ Tr(K2

sZ)− 2y>Ksz + ‖y‖2 ≤ r2,

where K = K>stKt(λKt +K2t )†KtKst and A† denotes the pseudo-inverse of the matrix

A.

Albeit this problem can be solved in polynomial time with a standard convex opti-mization solver, in practice solving a moderately sized SDP can be really slow. There-fore, in the next section we propose a more efficient solution to the optimization problemusing sampling, which helps reducing the problem to a simple QP.

30

Page 44: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

1.7.2 QP formulationThe SDP formulation described in the previous section is applicable for a specific choiceof H ′′. In this section, we present an analysis that holds for an arbitrary convex setH ′′. First, notice that the problem of minimizing G (expression (1.13)) is related tothe minimum enclosing ball (MEB) problem. For a set D ⊆ Rd, the MEB problem isdefined as follows:

minu∈Rd

maxv∈D‖u− v‖2.

Omitting the regularization and the min term from (1.13) leads to a problem similar tothe MEB. Thus, we could benefit from the extensive literature and algorithmic studyavailable for this problem (Welzl, 1991; Kumar et al., 2003; Schonherr, 2002; Fischeret al., 2003; Yildirim, 2008). However, to the best of our knowledge, there is currentlyno solution available to this problem in the case of an infinite set D, as in the case of ourproblem. Instead, we present a solution for solving an approximation of (1.13) based onsampling.

Let h1, . . . , hk be a set of hypotheses on the boundary of H ′′, ∂H ′′, and letC = C(h1, . . . , hk) denote their convex hull. The following is the sampling-based ap-proximation of (1.13) that we consider:

minh∈H

λ‖h‖2K +

1

2maxi=1,...,k

LP (h, hi) +1

2minh′∈CLP (h, h′). (1.24)

Proposition 7. Let Y = (Yij) ∈ Rk×n be the matrix defined by Yij = n−1/2hi(x′j) and

y′ = (y′1, . . . , y′k)> ∈ Rk the vector defined by y′i = n−1

∑nj=1 hi(x

′j)

2. Then, the dualproblem of (1.24) is given by

maxα,γ,β

−(Y>α+

γ

2

)>Kt

(λI +

1

2Kt

)−1(Y>α+

γ

2

)(1.25)

− 1

2γ>KtK

†tγ +α>y′ − β

s.t. 1>α =1

2, 1β ≥ −Yγ, α ≥ 0,

where 1 is the vector in Rk with all components equal to 1. Furthermore, the solu-tion h of (1.24) can be recovered from a solution (α,γ, β) of (1.25) by ∀x, h(x) =∑n

i=1 aiK(xi, x), where a =(λI + 1

2Kt)

−1(Y>α+ 12γ).

The proof of the proposition is given in Appendix A.2. The result shows that, givena finite sample h1, . . . , hk on the boundary of H ′′, (1.24) is in fact equivalent to a stan-dard QP. Hence, a solution can be found efficiently with one of the many off-the-shelfalgorithms for quadratic programming.

We now describe the process of sampling from the boundary of the set H ′′, whichis a necessary step for defining problem (1.24). We consider compact sets of the form

31

Page 45: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

H ′′ := h′′ ∈ H | gi(h′′) ≤ 0, where the functions gi are continuous and convex. Forinstance, we could consider the set H ′′ defined in the previous section. More generally,we can consider a family of sets H ′′p = h′′ ∈ H| | ∑m

i=1 qmin(xi)|h(xi)− yi|p ≤ rp.Assume that there exists h0 satisfying gi(h0) < 0. Our sampling process is illustrated

by Figure 1.5 and works as follows: pick a random direction h and define λi to be theminimal solution to the system

(λ ≥ 0) ∧ (gi(h0 + λh) = 0).

Set λi = ∞ if no solution is found and define λ∗ = mini λi. By the convexity andcompactness of H ′′ we can guarantee that λ∗ < ∞. Furthermore, he hypothesis h =h0 + λ∗h satisfies h ∈ H ′′ and gj(h) = 0 for j such that λj = λ∗. The latter isstraightforward, to verify the former, assume that gi(h0 + λ∗h) > 0 for some i. Thecontinuity of gi would imply the existence of λ′i with 0 < λ′i < λ∗ ≤ λi such thatgi(h0 + λ′ih) = 0. This would contradict the choice of λi, thus, the inequality gi(h0 +

λ∗h) ≤ 0 must hold for all i.Since a point h0 with gi(h0) < 0 can be obtained by solving a convex program and

solving the equations defining λi is, in general, simple, the process described providesan efficient way of sampling points from the convex set H ′′.

1.7.3 Implementation for the L2 LossWe now describe how to fully implement our sampling-based algorithm for the casewhere L is equal to the L2 loss. In view of the results of Section 1.4, we let H ′′ =h′′|‖h′′‖K ≤ Λ ∧ Lqmin(h

′′, fQ) ≤ r2. We first describe the steps needed to find apoint h0 ∈ H ′′. Let hΛ be such that ‖hΛ‖K = Λ and λr ∈ R+ be such that the solutionhr to the optimization problem

minh∈H

λr‖h‖2K + Lqmin(h, fQ),

satisfies Lqmin(hr, fQ) = r2. It is easy to verify that the existence of λr is guaranteed forminh∈H Lqmin(h, fQ) ≤ r2 ≤∑m

i=1 qmin(xi)y2i . Furthermore, the convexity of the norm,

as well as the loss, imply that the point h0 = 12(hr + hΛ) is in the interior of H ′′. Of

course, finding λr with the desired properties is not possible. However, since r is chosenvia validation, we do not need to find λr as a function of r. Instead, we can simply selectλr, and not r, through cross-validation.

In order to complete the sampling process, we must have an efficient way of selectinga random direction h. If H ⊂ Rd is a set of linear hypotheses, a direction h can besampled uniformly by letting h = ξ

‖ξ‖ , where ξ is a standard Gaussian random variablein Rd. If H is a subset of a RKHS, by the representer theorem, we only need to considerhypotheses of the form h =

∑mi=1 αiK(xi, ·). Therefore, we can sample a direction

32

Page 46: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

−1.2

−0.8

−0.4

0.0

0.2

SourceTargetDMGDM

w

MSE

(a) (b)

Figure 1.6: (a) Hypotheses obtained by training on source (green circles), target (redtriangles) and using DM (dashed blue) and GDM algorithms (solid blue). (b) Objectivefunctions for source and target distribution as well as GDM and DM algorithms. Thevertical lines show the minimizer for each algorithm. Set H and surrogate hypothesisset H ′′ ⊆ H are shown at the bottom.

h =∑m

i=1 α′iK(xi, ·), where the vector α′ = (α′1, . . . , α

′m) is drawn uniformly from the

unit sphere in Rm. A full implementation of our algorithm thus consists of the followingsteps:• find the distribution qmin = argminq∈Q disc(q, P ). This can be done by using the

smooth approximation algorithm of Cortes and Mohri (2013);• sample points from the set H ′′ using the sampling process described above;• solve the QP introduced in Section 1.7.2.

Given that our algorithm only requires solving a simple QP, its complexity is similar toalgorithms such as KMM and DM which also require solving a QP.

1.8 ExperimentsHere, we report the results of extensive comparisons between GDM and several otheradaptation algorithms which demonstrate the benefits of our algorithm. We use theimplementation described in the previous section. The source code for our algorithm aswell as all other baselines described in this section can be found at http://cims.nyu.edu/˜munoz.

1.8.1 Synthetic Data SetTo compare the performances of the GDM and DM algorithms, we considered the fol-lowing synthetic one-dimensional task, similar to the one considered by Huang et al.(2006): the source domain examples were sampled from the uniform distribution over

33

Page 47: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

the interval [.2, 1] and target instances were sampled uniformly over [0, .25]. The labelswere given by the map x 7→ −x + x3 + ξ, where ξ is a Gaussian random variable withmean 0 and standard deviation 0.1. Our hypothesis set was defined by the family of lin-ear functions without an offset. Figure 1.6(a) shows the regression hypotheses obtainedby training the DM and GDM algorithm as well as those obtained by training on thesource and target distributions. The ideal hypothesis is shown in red. Notice how theGDM solution gives a closer approximation than DM to the ideal solution. In order tobetter understand the difference between the solutions of these algorithms, Figure 1.6(b)depicts the objective function minimized by each algorithm as a function of the slopew of the linear function, the only variable of the hypothesis. The vertical linesshow thevalue of the minimizing hypothesis for each loss. Keeping in mind that the regulariza-tion parameter λ used in ridge regression corresponds to a Lagrange multiplier for theconstraint w2 ≤ Λ2 for some Λ (Cortes and Mohri, 2013) [Lemma 1], the hypothesisset H = w|w2 ≤ Λ2 is depicted at the bottom of this plot. The shaded region repre-sents the set H ′′ = H ∩ h′′|Lqmin(h

′′) ≤ r. It is clear from this plot that reweightingthe sample using qmin helps approximate the target loss function. Nevertheless, it is ourhypothesis-dependent reweighting that allows both objective functions to be uniformlyclose. This should come as no surprise since our algorithm was precisely designed toachieve that.

1.8.2 Adaptation Data SetsWe now present the results of evaluating our algorithm against several other adaptationalgorithms. GDM is compared against DM and training on the uniform distribution.The following baselines were also used:1. The KMM algorithm (Huang et al., 2006), which reweights examples from the

source distribution in order to match the mean of the source and target data in afeature space induced by a universal kernel. The hyper-parameters of this algorithmwere set to the recommended values of B = 1000 and ε =

√m√m−1

.2. KLIEP (Sugiyama et al., 2007). This algorithm estimates the importance ratio of the

source and target distribution by modeling this ratio as a mixture of basis functionsand learning the mixture coefficients from the data. Gaussian kernels were usedas basis functions for this algorithm and KMM. The bandwidth for the kernel wasselected from the set

σd : σ = 2−5, . . . , 25

via validation on the test set, where d

is the mean distance between points sampled from the source domain.3. FE (Daume III, 2007). This algorithm maps source and target data into a common

high-dimensional feature space where the difference of the distributions is expectedto reduce.

In addition to these algorithms we compare GDM to the ideal hypothesis obtained bytraining on T which we denote by Tar.

34

Page 48: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

MSE

Source distribution: kin−8fh

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5 GDM

DMUnifTargetKMMKLIEPFE

kin−8nm kin−8fm kin−8nh

0.0 0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

1.0

DM

/GD

M

0.0 0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

1.0

r Λ

fh to nmfh to fmfh to nh

(a) (b)

Figure 1.7: (a)MSE performance for different adaptation algorithms when adapting fromkin-8fh to the three other kin-8xy domains. (b) Relative error of DM over GDMas a function of the ratio r

Λ.

We selected the set of linear functions as our hypothesis set. The learning algorithmused for all tasks was ridge regression and the performance was evaluated by the meansquared error. We follow the setup of Cortes and Mohri (2011) and for all adaptationalgorithms we selected the parameter λ via 10-fold cross validation over the trainingdata by using a grid search over the set of values λ ∈ 2−10, . . . , 210. The results oftraining on the target distribution are presented for a parameter λ tuned via 10-fold crossvalidation over the target data. We used the QP implementation of our algorithm withthe sampling set H ′′ and the sampling mechanism defined at the end of Section 1.7.2,where the parameter λr was chosen from the same set as λ via cross-validation on asmall amount of data from the target distribution. Whereas there exist other validationtechniques such as transfer cross validation (Zhong et al., 2010), these techniques relyon importance weighting and as such suffer from the issues previously mentioned.

In order to achieve a fair comparison, all other algorithms were allowed to use thesmall amount of labeled data too. Since, with the exception of FE, all other baselines donot propose a way of dealing with labeled data from the target distribution, we simplyadded this data to the training set and ran the algorithms on the extended source data asdiscussed in Section 1.6.2.

The first task we considered is given by the 4 kin-8xy Delve data sets (Rasmussenet al., 1996). These data sets are variations of the same model: a realistic simulationof the forward dynamics of an 8 link all-revolute robot arm. The task in all data setsconsists of predicting the distance of the end-effector from a target. The data sets differby the degree of non-linearity (fairly linear, x=f, or non-linear, x=n) and the amountof noise in the output (moderate, y=m, or high, y=h). The data set defines 4 differentdomains, that is 12 pairs of different distributions and labeling functions. A sample

35

Page 49: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Table 1.2: Adaptation from books (B), kitchen (K), electronics (E) and dvd (D) to all otherdomains. Normalized results: MSE of training on the unweighted source data is equal to 1. Results inbold represent the algorithm with the lowest MSE.

Task: SentimentS T GDM DM Unif Tar KMM KLIEP F

BK 0.763±(0.222) 1.056±(0.289) 1.00 0.517±(0.152) 3.328±(0.845) 3.494±(1.144) 0.942±(0.093)

E 0.574±(0.211) 1.018±(0.206) 1.00 0.367±(0.124) 3.018±(0.319) 3.022±(0.318) 0.857±(0.135)

D 0.936±(0.256) 1.215±(0.255) 1.00 0.623±(0.152) 2.842±(0.492) 2.764±(0.446) 0.936±(0.110)

KB 0.854±(0.119) 1.258±(0.117) 1.00 0.665±(0.085) 2.784±(0.244) 2.642±(0.218) 1.047±(0.047)

E 0.975±(0.131) 1.460±(0.633) 1.00 0.653±(0.201) 2.408±(0.582) 2.157±(0.255) 0.969±(0.131)

D 0.884±(0.101) 1.174±(0.140) 1.00 0.665±(0.071) 2.771±(0.157) 2.620±(0.210) 1.111±(0.059)

EB 0.723±(0.138) 1.016±(0.187) 1.00 0.551±(0.109) 3.433±(0.694) 3.290±(0.583) 1.035±(0.059)

K 1.030±(0.312) 1.277±(0.283) 1.00 0.636±(0.176) 2.173±(0.249) 2.223±(0.293) 0.955±(0.199)

D 0.731±(0.171) 1.005±(0.166) 1.00 0.518±(0.117) 3.363±(0.402) 3.231±(0.483) 0.974±(0.102)

DB 0.992±(0.191) 1.026±(0.090) 1.00 0.740±(0.138) 2.571±(0.616) 2.475±(0.400) 0.986±(0.041)

K 0.870±(0.212) 1.062±(0.318) 1.00 0.557±(0.137) 2.755±(0.375) 2.741±(0.347) 0.940±(0.087)

E 0.674±(0.135) 0.994±(0.171) 1.00 0.478±(0.098) 2.939±(0.501) 2.878±(0.418) 0.907±(0.081)

of 200 points from each domain was used for training and 10 labeled points from thetarget distribution were used to select H ′′. The experiment was carried out 10 times andthe results of testing on a sample of 400 points from the target domain are reported inFigure 1.7(a). The bars represent the median performance of each algorithm. The errorbars are the .25 and .75 quantiles respectively. All results were normalized in such away that the median performance of training on the source is equal to 1. Notice thatthe performance of all algorithms is comparable when adapting to kin8-fm since bothlabeling functions are fairly linear, yet only GDM is able to reasonably adapt to the twodata sets with different labeling functions. In order to better understand the advantagesof GDM over DM we plot the relative error of DM against GDM as a function of theratio r/Λ in Figure 1.7(b), where r is the radius defining H ′′. Notice that when theratio r/Λ is small then both algorithms behave similarly which is most of the times forthe adaptation task fh to fm. On the other hand, a better performance of GDM can beobtained when the ratio is larger. This is due to the fact that r/Λ measures the effectivesize of the set H ′′. A small ratio means that the size of H ′′ is small and therefore thehypothesis returned by GDM will be close to that of DM where as if H ′′ is large thenGDM has the possibility of finding a better hypothesis.

For our next experiment we considered the cross-domain sentiment analysis data setof Blitzer et al. (2007). This data set consists of consumer reviews from 4 differentdomains: books, kitchen, electronics and dvds. We used the top 1000uni-grams and bi-grams as the features for this task. For each pair of adaptation taskswe sampled 700 points from the source distribution and 700 unlabeled points from the

36

Page 50: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Table 1.3: Adaptation from caltech256 (C), imagenet (I), sun (S) and bing (B). Normalizedresults: MSE of training on the unweighted source data is equal to 1.

Task: ImagesS T GDM DM Unif Tar KMM KLIEP F

CI 0.927±(0.051) 1.005±(0.010) 1.00 0.879±(0.048) 2.752±(3.820) 0.936±(0.016) 0.959±(0.035)

S 0.938±(0.064) 0.993±(0.018) 1.00 0.840±(0.057) 0.827±(0.017) 0.835±(0.020) 0.947±(0.025)

B 0.909±(0.040) 1.003±(0.013) 1.00 0.886±(0.052) 0.945±(0.022) 0.942±(0.017) 0.947±(0.019)

IC 1.011±(0.015) 0.951±(0.011) 1.00 0.802±(0.040) 0.989±(0.036) 1.009±(0.042) 0.971±(0.024)

S 1.006±(0.030) 0.992±(0.016) 1.00 0.871±(0.030) 0.930±(0.018) 0.936±(0.016) 0.973±(0.017)

B 0.987±(0.022) 1.009±(0.010) 1.00 0.986±(0.028) 1.011±(0.028) 1.011±(0.028) 0.994±(0.018)

SC 1.022±(0.037) 0.982±(0.035) 1.00 0.759±(0.033) 1.172±(0.043) 1.201±(0.038) 0.938±(0.036)

I 0.924±(0.049) 0.998±(0.030) 1.00 0.831±(0.047) 3.868±(4.231) 1.227±(0.039) 0.947±(0.028)

B 0.898±(0.072) 1.003±(0.044) 1.00 0.821±(0.053) 1.240±(0.039) 1.248±(0.041) 0.945±(0.021)

BC 1.010±(0.014) 0.956±(0.017) 1.00 0.777±(0.031) 1.028±(0.033) 1.032±(0.031) 0.980±(0.019)

I 1.012±(0.010) 1.004±(0.007) 1.00 0.966±(0.009) 2.785±(3.803) 0.981±(0.018) 1.000±(0.004)

S 1.009±(0.018) 0.988±(0.010) 1.00 0.850±(0.035) 0.930±(0.022) 0.934±(0.024) 0.983±(0.013)

target. Only 50 labeled points from the target distribution were used to tune the pa-rameter r of our algorithm. The final evaluation is done on a test set of 1000 points.The mean results and standard deviations of this task are shown in Table 1.2 where theMSE values have been normalized in such a way that the performance of training on thesource without reweighting is always 1.

Finally, we considered a novel domain adaptation task (Tommasi et al., 2014) ofparamount importance in the computer vision community. The domains correspond to4 well known collections of images: bing, caltech256, sun and imagenet.These data sets have been standardized so that they all share the same feature representa-tion and labeling function (Tommasi et al., 2014). We sampled 800 labeled points fromthe source distribution and 800 unlabeled points from the target distribution as well as50 labeled target points to be used for validation of r. The results of testing on 1000points from the target domain are presented in Table 1.3 where, again, the results werenormalized in such a way that the performance of training on the source data is always1.

After analyzing the results of this section we can conclude that the GDM algorithmconsistently outperforms DM and achieves similar or better performance than all othercommon adaptation algorithms. It is worth noticing that in some cases, other algorithmsperform even worse than training on the unweighted sample. This deficiency of theKLIEP algorithm had already been pointed out by Sugiyama et al. (2007) but here weobserve that this problem can also affect the KMM algorithm. Finally, let us point outthat even though the FE algorithm also achieved performances similar to GDM on thesentiment and image adaptation, its performance was far from optimal adapting on the

37

Page 51: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

kin-8xy task. Since there is a lack of theoretical understanding for this algorithm, itis hard to characterize the scenarios where FE would perform better than GDM.

1.9 ConclusionWe presented a new theoretically well-founded domain adaptation algorithm seeking tominimize a less conservative quantity than the DM algorithm. We presented an SDPsolution for the particular case of the L2 loss which can be solved in polynomial time.Our empirical results show that our new algorithm always performs better than or on parwith the otherwise state-of-the-art DM algorithm. We also provided tight generalizationbounds for the domain adaptation problem based on the Y-discrepancy. As pointed outin Section 1.4, an algorithm that minimizes the Y-discrepancy would benefit from thebest possible guarantees. However, the lack of labeled data from the target distributionmakes this algorithm not viable. In future research we would like to analyze a richerscenario where the learner is allowed to ask for a limited number of labels from thetarget distribution. This setup, which is related to active learning, seems to be in fact theclosest one to real-life applications and has started to receive attention from the researchcommunity (Berlind and Urner, 2015). We believe that the discrepancy measure willagain play a central role in the analysis of this scenario.

38

Page 52: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Chapter 2

Drifting

We extend the results of the previous chapter to the more challenging task of learningwith drifting distributions. We prove learning bounds based on the Rademacher com-plexity of the hypothesis set and the Y-discrepancy of distributions both for a naturalextension of the standard PAC learning scenario and a tracking scenario that will belater described. Our bounds are always tighter and in some cases substantially improveupon previous ones based on the L1 distance. We also present a generalization of thestandard on-line to batch conversion to the drifting scenario in terms of the discrepancyand arbitrary convex combinations of hypotheses. We introduce a new algorithm ex-ploiting these learning guarantees, which we show can be formulated as a simple QP.The chapter concludes with an extensive empirical evaluation of our proposed algorithm.

2.1 IntroductionIn Chapter 1 we presented theory and algorithms for solving the problem of domainadaptation. These results are crucial since adaptation is constantly encountered in fieldssuch as natural language processing and computer vision. Domain adaptation, however,deals only with the problem of training on a fixed source distribution and testing on analso fixed target distribution. Therefore, there are common learning tasks that cannot fitinto this framework. For instance in spam detection, political sentiment analysis, finan-cial market prediction under mildly fluctuating economic conditions, or news stories, thelearning environment is not stationary and there is a continuous drift of its parametersover time. This means in particular that the training distribution is in fact not fixed.

There is a large body of literature devoted to the study of related problems both in theon-line and the batch learning scenarios. In the on-line scenario, the target function istypically assumed to be fixed but no distributional assumption is made, thus input pointsmay be chosen adversarially (Cesa-Bianchi and Lugosi, 2006). Variants of this modelwhere the target is allowed to change a fixed number of times have also been studied(Cesa-Bianchi and Lugosi, 2006; Herbster and Warmuth, 1998, 2001; Cavallanti et al.,

39

Page 53: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

2007). In the batch scenario, the case of a fixed input distribution with a drifting tar-get was originally studied by Helmbold and Long (1994). A more general scenariowas introduced by Bartlett (1992) where the joint distribution over the input and labelscould drift over time under the assumption that the L1 distance between the distribu-tions in two consecutive time steps was bounded by ∆. Both generalization bounds andlower bounds have been given for this scenario (Long, 1999; Barve and Long, 1997). Inparticular, Long (1999) showed that if the L1 distance between two consecutive distri-butions is at most ∆, then a generalization error of O((d∆)1/3) is achievable and Barveand Long (1997) proved this bound to be tight. Further improvements were presentedby Freund and Mansour (1997) under the assumption of a constant rate of change fordrifting. Other settings allowing arbitrary but infrequent changes of the target have alsobeen studied (Bartlett et al., 2000). An intermediate model of drift based on a near

relationship was also recently introduced and analyzed by Crammer et al. (2010) whereconsecutive distributions may change arbitrarily, modulo the restriction that the regionof disagreement between nearby functions would only be assigned limited distributionmass at any time.

This chapter deals with the analysis of learning in the presence of drifting distri-butions in the batch setting. We consider both the general drift model introduced byBartlett (1992) and a related drifting PAC model that we will later describe. We presentnew generalization bounds for both models (Sections 2.3 and 2.4). Unlike the L1 dis-tance used by previous authors to measure the distance between distributions in thedrifting scenario , our bounds are based on the Y-discrepancy between distributions. Asshown in Chapter 1, the Y-discrepancy is a finer measure than the L1 distance. Further-more it can be estimated from finite samples unlike the L1 distance (see for examplelower bounds on the sample complexity of testing closeness by Valiant (2011)).

The learning bounds we present in Sections 2.3 and 2.4 are tighter than previousbounds both because they are given in terms of the discrepancy distance, and becausethey are given in terms of the Rademacher complexity instead of the VC-dimension.Additionally, our proofs are often simpler and more concise. We also present a general-ization of the standard on-line to batch conversion to the scenario of drifting distributionsin terms of the discrepancy measure (Section 2.5). Our guarantees hold for convex com-binations of the hypotheses generated by an on-line learning algorithm. These boundslead to the definition of a natural meta-algorithm which consists of selecting the con-vex combination of weights in order to minimize the discrepancy-based learning bound(Section 2.6). We show that this optimization problem can be formulated as a simpleQP and report the results of several experiments demonstrating its benefits. Finally wediscuss the practicality of our algorithm in some natural scenarios.

40

Page 54: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

2.2 PreliminariesIn this section, we introduce some preliminary notation and key definitions that will beused throughout the Chapter. In addition, we describe the learning scenarios that wewill consider.

Let X denote the input space, Y the output space and H a hypothesis set. Weconsider a loss function L : Y × Y → R+ bounded by some constant M > 0. For anytwo functions h, h′ : X → Y and any distribution D over X × Y , we keep the notationof the previous chapter and denote by LD(h) the expected loss of h and by LD(h, h′)the expected loss of h with respect to h′:

LD(h) = E(x,y)∼D

[L(h(x), y)] and LD(h, h′) = Ex∼DX

[L(h(x), h′(x))], (2.1)

where DX is the marginal distribution over X derived from D. We adopt the stan-dard definition of the empirical Rademacher complexity, yet we adapt the definition ofRademacher complexity to our drifting scenario. This definition is related to the notionof sequential Rademacher complexity of Rakhlin et al. (2010).

Definition 9. Let G be a family of functions mapping from a set Z to R and S =(z1, . . . , zT ) a fixed sample of size T with elements in Z . The empirical Rademachercomplexity of G for the sample S is defined by:

RS(G) = Eσ

[supg∈G

1

T

T∑

t=1

σtg(zt)

], (2.2)

whereσ = (σ1, . . . , σT )>, with σts independent uniform random variables taking valuesin −1,+1. The Rademacher complexity of G is the expectation of RS(G) over allsamples S = (z1, . . . , zT ) of size T drawn according to the product distribution D =⊗T

t=1Dt:RT (G) = E

S∼D[RS(G)]. (2.3)

Note that this coincides with the standard Rademacher complexity when the distri-butions Dt, t ∈ [1, T ], all coincide.

Similar to domain adaptation, a key question for the analysis of learning with adrifting scenario is a measure of the difference between two distributionsD andD′. Thedistance used by previous authors is the L1 distance. However, as previously discussed,the L1 distance is not helpful in this context since it can be large even in some ratherfavorable situations. In view of this, we instead use the Y-discrepancy (Definition 5 inthe previous chapter) to measure the distance between two consecutive distributions.

We will present our learning guarantees in terms of the Y-discrepancy discY . Thatis, the most general definition since guarantees in terms of the discrepancy disc can be

41

Page 55: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

straightforwardly derived from them. The advantage of the latter bounds is the fact thatthe discrepancy can be estimated in that case from unlabeled finite samples.

We will consider two different scenarios for the analysis of learning with driftingdistributions: the drifting PAC scenario and the drifting tracking scenario.

The drifting PAC scenario is a natural extension of the PAC scenario, where the ob-jective is to select a hypothesis h out of a hypothesis set H with a small expected lossaccording to the distribution DT+1 after receiving a sample of T ≥ 1 instances drawnfrom the product distribution

⊗Tt=1 Dt. Thus, the focus in this scenario is the perfor-

mance of the hypothesis h with respect to the environment distribution after receivingthe training sample.

The drifting tracking scenario we consider is based on the scenario originally intro-duced by Bartlett (1992) for the zero-one loss and is used to measure the performanceof an algorithm A (as opposed to any hypothesis h). In that learning model, the per-formance of an algorithm is determined based on its average predictions at each timefor a sequence of distributions. We will generalize its definition by using the notion ofdiscrepancy and extending it to other loss functions. The following definitions are thekey concepts defining this model.

Definition 10. For any sample S = (xt, yt)Tt=1 of size T , we denote by hT−1 ∈ H the

hypothesis returned by an algorithm A after receiving the first T − 1 examples andby MT its loss or mistake on xT : MT = L(hT−1(xT ), yT ). For a product distributionD =

⊗Tt=1Dt on (X × Y)T we denote by MT (D) the expected mistake of A:

MT (D) = ES∼D

[MT ] = ES∼D

[L(hT−1(xT ), yT )].

Definition 11. Let ∆ > 0 and let MT be the supremum of MT (D) over all distributionsequences D = (Dt), with discY(Dt, Dt+1) < ∆. Algorithm A is said to (∆, ε)-trackH if there exists t0 such that for T > t0 we have MT < infh∈H LDT (h) + ε.

As suggested by its name, the focus in this scenario is tracking the best hypothesis atevery time. An analysis of the tracking scenario with the L1 distance used to measure thedivergence of distributions instead of the discrepancy was carried out by Long (1999)and Barve and Long (1997), including both upper and lower bounds for MT in terms of∆. Their analysis makes use of an algorithm very similar to empirical risk minimization,which we will also use in our theoretical analysis of both scenarios.

2.3 Drifting PAC ScenarioIn this section, we present guarantees for the drifting PAC scenario in terms of the dis-crepancies of Dt and DT+1 , t ∈ [1, T ], and the Rademacher complexity of the hypoth-esis set.

42

Page 56: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Let us emphasize that learning bounds in the drifting scenario should of course notbe expected to converge to zero as a function of the sample size but depend instead onthe divergence between distributions.

Theorem 5. Assume that the loss function L is bounded by M . Let D1, . . . , DT+1 be asequence of distributions and let HL = (x, y) 7→ L(h(x), y) : h ∈ H. Then, for anyδ > 0, with probability at least 1− δ, the following holds for all h ∈ H:

LDT+1(h) ≤ 1

T

T∑

t=1

L(h(xt), yt) + 2RT (HL) +1

T

T∑

t=1

discY(Dt, DT+1) +M

√log 1

δ

2T.

Proof. We denote byD the product distribution⊗T

t=1Dt. Let Φ be the function definedover any sample S = ((x1, y1), . . . , (xT , yT )) ∈ (X × Y)T by

Φ(S) = suph∈HLDT+1

(h)− 1

T

T∑

t=1

L(h(xt), yt).

Let S and S ′ be two samples differing by one labeled point, say (xt, yt) in S and (x′t, y′t)

in S ′, then:

Φ(S ′)− Φ(S) ≤ suph∈H

1

T

[L(h(x′t), y

′t)− L(h(xt), yt)

]≤ M

T.

Thus, by McDiarmid’s inequality, the following holds:1

PS∼D

[Φ(S)− E

S∼D[Φ(S)] > ε

]≤ exp(−2Tε2/M2).

1Note that McDiarmid’s inequality does not require points to be drawn according to the same distri-bution but only that they would be drawn independently.

43

Page 57: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

We now bound ES∼D[Φ(S)] by first rewriting it, as follows:

E[

suph∈HLDT+1

(h)− 1

T

T∑

t=1

LDt(h) +1

T

T∑

t=1

LDt(h)− 1

T

T∑

t=1

L(h(xt), yt)]

≤E[

suph∈HLDT+1

(h)− 1

T

T∑

t=1

LDt(h)]+E[

suph∈H

1

T

T∑

t=1

LDt(h)− 1

T

T∑

t=1

L(h(xt), yt)]

≤E[ 1

T

T∑

t=1

suph∈H

(LDT+1

(h)− LDt(h))+suph∈H

1

T

T∑

t=1

(LDt(h)− L(h(xt), yt)

)]

≤ 1

T

T∑

t=1

discY(Dt, DT+1) + E[

suph∈H

1

T

T∑

t=1

(LDt(h)− L(h(xt), yt)

)].

It is not hard to see, using a symmetrization argument as in the non-sequential case, thatthe second term can be bounded by 2RT (HL).

Observe that the bound of Theorem 5 is tight as a function of the divergence measure(discrepancy) we are using. Consider for example the case where D1 = . . . = DT+1,then a standard Rademacher complexity generalization bound holds for all h ∈ H:

LDT+1(h) ≤ 1

T

T∑

t=1

L(h(xt), yt) + 2RT (HL) +O(1/√T ).

Now, our generalization bound for LDT+1(h) includes only the additive term

discY(Dt, DT+1), but by definition of the discrepancy, for any ε > 0, there exists h ∈ Hsuch that the inequality |LDT+1

(h)− LDT (h)| < discY(Dt, DT+1) + ε holds.Next, we present PAC learning bounds for empirical risk minimization. Let h∗T be

a best-in class hypothesis in H , that is one with the best expected loss. By a similarreasoning as in Theorem 5, we can show that with probability 1− δ

2we have

1

T

T∑

t=1

L(h∗T (xt), yt)≤LDT+1(h∗T ) + 2RT (HL) +

1

T

T∑

t=1

discY(Dt, DT+1) + 2M

√log 2

δ

2T.

Let hT be a hypothesis returned by empirical risk minimization (ERM). The last inequal-ity, along with the bound of Theorem 5 and the union bound imply that with probability1− δ the following holds:

LDT+1(hT )−LDT+1

(h∗T ) ≤ 4RT (HL) +2

T

T∑

t=1

discY(Dt, DT+1) + 2M

√log 2

δ

2T, (2.4)

44

Page 58: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

where we used the fact that∑T

t=1 L(hT (xt), yt) −∑T

t=1 L(h∗(xt), yt) < 0 since hT isan empirical minimizer.

This learning bound indicates a trade-off: larger values of the sample size T guar-antee smaller first and third terms; however, as T increases, the average discrepancyterm is likely to grow as well, thereby making learning increasingly challenging. Thissuggests an algorithm similar to empirical risk minimization but limited to the last mexamples instead of the whole sample with m < T . This algorithm was previously usedin Barve and Long (1997) for the study of the tracking scenario. We will use it here toprove several theoretical guarantees in the PAC learning model.

Proposition 8. Let ∆ ≥ 0. Assume that (Dt)t≥0 is a sequence of distributions suchthat discY(Dt, Dt+1) ≤ ∆ for all t ≥ 0. Fix m ≥ 1 and let hT denote the hypothesisreturned by the algorithmA that minimizes

∑Tt=T−m L(h(xt), yt) after receiving T > m

examples. Then, for any δ > 0, with probability at least 1 − δ, the following learningbound holds:

LDT+1(hT )− inf

h∈HLDT+1

(h) ≤ 4Rm(HL) + (m+ 1)∆ + 2M

√log 2

δ

2m. (2.5)

Proof. The proof is straightforward. Notice that the algorithm discards the first T −mexamples and considers exactly m instances. Thus, as in inequality (2.4), we have:

LDT+1(hT )− LDT+1

(h∗T ) ≤ 4Rm(HL) +2

m

T∑

t=T−m

disc(Dt, DT+1) + 2M

√log 2

δ

2m.

Finally, we can use the triangle inequality to bound disc(Dt, DT+1) by (T + 1−m)∆.Thus, the sum of the discrepancy terms can be bounded by (m+ 1)∆.

To obtain the best learning guarantee, we can select m to minimize the boundjust presented. This requires expressing the Rademacher complexity in terms of m.The following is the result obtained when using a VC-dimension upper bound for theRademacher complexity.

Corollary 5. Fix ∆ > 0. Let H be a hypothesis set with VC-dimension d such that

for all m ≥ 1, Rm(HL) ≤ C4

√dm

for some constant C > 0. Assume that (Dt)t>0

is a sequence of distributions such that discY(Dt, Dt+1) ≤ ∆ for all t ≥ 0. Then,there exists an algorithm A such that for any δ > 0, the hypothesis hT it returns after

receiving T >[C+C′

2

] 23( d

∆2 )13 instances, whereC ′ = 2M

√log( 2

δ)

2d, satisfies the following

with probability at least 1− δ:

LDT+1(hT )− inf

h∈HLDT+1

(h) ≤ 3

[C + C ′

2

]2/3

(d∆)1/3 + ∆. (2.6)

45

Page 59: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Proof. Fix δ > 0. Replacing Rm(HL) by the upper bound C4

√dm

in (2.5) yields

LDT+1(hT )− inf

h∈HLDT+1

(h) ≤ (C + C ′)

√d

m+ (m+ 1)∆. (2.7)

Choosing m = (C+C′

2)

23 ( d

∆2 )13 to minimize the right-hand side gives exactly (2.6).

When H has finite VC-dimension d, it is known that Rm(HL) can be bounded byC√d/m for some constant C > 0, by using a chaining argument (Dudley, 1984; Pol-

lard, 1984; Talagrand, 2005). Thus, the assumption of the previous corollary holds formany loss functions L, when H has finite VC-dimension.

2.4 Drifting Tracking ScenarioIn this section, we present a simpler proof of the bounds given by Long (1999) for theagnostic case demonstrating that using the discrepancy as a measure of the divergencebetween distributions leads to tighter and more informative bounds than using the L1

distance.

Proposition 9. Let ∆ > 0 and let (Dt)t≥0 be a sequence of distributions such thatdiscY(Dt, Dt+1) ≤ ∆ for all t ≥ 0. Let m > 1 and let hT be as in Proposition 8. Then,

ED

[MT+1]− infhLDT+1

(h) ≤ 4Rm(HL) + 2M

√π

m+ (m+ 1)∆. (2.8)

Proof. Let D =⊗T+1

t=1 Dt and D′ =⊗T

t=1 Dt. By Fubini’s Theorem we can write:

ED

[MT+1]− infhLDT+1

(h) = ED′

[LDT+1

(hT )− infhLDT+1

(h)]. (2.9)

Let φ−1(δ) = 4Rm(HL) + (m + 1)∆ + 2M

√log 2

δ

2m. By inequality (2.5), for β >

4Rm(HL) + (m+ 1)∆, the following holds:

PD′

[LDT+1(hT )− inf

hLDT+1

(h) > β] < φ(β).

Thus, the expectation on the right-hand side of (2.9) can be bounded as follows:

ED′

[LDT+1

(hT )− infhLDT+1

(h)]≤ 4Rm(HL) + (m + 1)∆ +

∫ ∞

4Rm(HL)+(m+1)∆

φ(β)dβ.

Using the change of variable δ = φ(β), we see the last integral is equivalent to2M

∫ 2

0dδ√m log 2

δ

= 2M√

πm

, which concludes the proof.

46

Page 60: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

The following corollary can be shown using the same proof as that of Corollary 5.

Corollary 6. Fix ∆ > 0. Let H be a hypothesis set with VC-dimension d such that for

all m > 1, 4Rm(HL) ≤ C√

dm

. Let (Dt)t>0 be a sequence of distributions over X × Ysuch that discY(Dt, Dt+1) ≤ ∆. Let C ′ = 2M

√πd

and K = 3[C+C′

2

]2/3. Then, for

T >[C+C′

2

] 23 ( d

∆2 )13 , the following inequality holds:

ED

[MT+1]− infhLDT+1

(h) < K(d∆)1/3 + ∆.

In terms of Definition 11, this corollary shows that algorithmA (∆, K(d∆)1/3 +∆)-tracks H . This result is similar to a result of Long (1999) which states that given ε > 0if ∆ = O(dε3) then A (∆, ε)-tracks H . However, in (Long, 1999), ∆ is an upper boundon the L1 distance and not the discrepancy, making our bound tighter. It is worth notingthat the results of Long (1999) guarantee that the dependency on ∆ cannot be improvedas on the example considered in their lower bound, the L1 distance and the discrepancyagree.

2.5 On-line to Batch ConversionIn this section, we present learning guarantees for drifting distributions in terms of theregret of an on-line learning algorithm A. The algorithm processes a sample (xt)t≥1

sequentially by receiving a sample point xt ∈ X , generating a hypothesis ht, and incur-ring a loss L(ht(xt), yt), with yt ∈ Y . We denote by RT the regret of algorithm A afterprocessing T ≥ 1 sample points:

RT =T∑

t=1

L(ht(xt), yt)− infh∈H

T∑

t=1

L(h(xt), yt).

The standard setting of on-line learning assumes an adversarial scenario with no dis-tributional assumption. Nevertheless, when the data is generated according to somedistribution, the hypotheses returned by an on-line algorithm A can be combined to de-fine a hypothesis with strong learning guarantees in the distributional setting when theregret RT is in O(

√T ) (which is attainable by several regret minimization algorithms)

(Littlestone, 1989; Cesa-Bianchi et al., 2001). Here, we extend these results to the drift-ing scenario and the case of a convex combination of the hypotheses generated by thealgorithm. The following lemma will be needed for the proof of our main result.

Lemma 4. Let S = (xt, yt)Tt=1 be a sample drawn from the distribution D =

⊗Tt=1Dt

and let (ht)Tt=1 be the sequence of hypotheses returned by an on-line algorithm sequen-

tially processing S. Let w = (w1, . . . , wt)> be a vector of non-negative weights veri-

47

Page 61: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

fying∑T

t=1 wt = 1. If the loss function L is bounded by M then, for any δ > 0, withprobability at least 1− δ, each of the following inequalities hold:

T∑

t=1

wtLDT+1(ht) ≤

T∑

t=1

wtL(ht(xt), yt) + ∆(w, T ) +M‖w‖2

√2 log

1

δ

T∑

t=1

wtL(ht(xt), yt) ≤T∑

t=1

wtLDT+1(ht) + ∆(w, T ) +M‖w‖2

√2 log

1

δ,

where ∆(w, T ) denotes the average discrepancy∑T

t=1 wtdiscY(Dt, DT+1).

Proof. Consider the random process: Zt = wtL(ht(xt), yt)−wtL(ht) and let Ft denotethe filtration associated to the sample process. We have: |Zt| ≤Mwt and

ED

[Zt|Ft−1] = ED

[wtL(ht(xt), yt)|Ft−1]− EDt

[wtL(ht(xt), yt)] = 0.

The second equality holds because ht is determined at time t− 1 and xt, yt are indepen-dent of Ft−1. Thus, by Azuma-Hoeffding’s inequality, for any δ > 0, with probabilityat least 1− δ the following holds:

T∑

t=1

wtLDt(ht) ≤T∑

t=1

wtL(ht(xt), yt) +M‖w‖2

√2 log

1

δ. (2.10)

By definition of the discrepancy, the following inequality holds for any t ∈ [1, T ]:

LDT+1(ht) ≤ LDt(ht) + discY(Dt, DT+1).

Summing up these inequalities and using (2.10) to bound∑T

t=1wtLDt(ht) proves thefirst statement. The second statement can be proven in a similar way.

The following theorem is the main result of this section.

Theorem 6. Assume that L is bounded by M and convex with respect to its first ar-gument. Let h1, . . . , hT be the hypotheses returned by A when sequentially processing(xt, yt)

Tt=1 and let h be the hypothesis defined by h =

∑Tt=1wtht, where w1, . . . , wT

are arbitrary non-negative weights verifying∑T

t=1wt = 1. Then, for any δ > 0, with

48

Page 62: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

probability at least 1− δ, h satisfies each of the following learning guarantees:

LDT+1(h) ≤

T∑

t=1

wtL(ht(xt), yt) + ∆(w, T ) +M‖w‖2

√2 log

1

δ

LDT+1(h) ≤ inf

h∈HL(h) +

RT

T+ ∆(w, T ) +M‖w − u0‖1 + 2M‖w‖2

√2 log

2

δ,

(2.11)

where w = (w1, . . . , wT )>, ∆(w, T ) =∑T

t=1wtdiscY(Dt, DT+1), and u0 ∈ RT is thevector with all its components equal to 1/T .

Observe that when all weights are all equal to 1T

, the result we obtain is similar tothe learning guarantee obtained in Theorem 5 when the Rademacher complexity of HL

is O( 1√T

). Also, if the learning scenario is i.i.d., then the term involving the discrepancy∆ vanishes and it can be seen straightforwardly that to minimize the RHS of 2.11 weneed to set wt = 1

T, which results in the known i.i.d. guarantees for on-line to batch

conversion (Littlestone, 1989; Cesa-Bianchi et al., 2001).

Proof. Since L is convex with respect to its first argument, by Jensen’s inequality, wehave LDT+1

(∑T

t=1wtht) ≤∑T

t=1wtLDT+1(ht). Thus, by Lemma 4, for any δ > 0, the

following holds with probability at least 1− δ:

LDT+1

(T∑

t=1

wtht

)≤

T∑

t=1

wtL(ht(xt), yt) + ∆(w, T ) +M‖w‖2

√2 log

1

δ. (2.12)

This proves the first statement of the theorem. To prove the second claim, we willbound the empirical error in terms of the regret. For any h∗ ∈ H , we can writeusing infh∈H

1T

∑Tt=1 L(h(xt), yt) ≤ 1

T

∑Tt=1 L(h∗(xt), yt):

T∑

t=1

wtL(ht(xt), yt)−T∑

t=1

wtL(h∗(xt), yt)

=T∑

t=1

(wt−

1

T

)[L(ht(xt), yt)−L(h∗(xt), yt)]+

1

T

T∑

t=1

[L(ht(xt), yt)−L(h∗(xt), yt)]

≤M‖w − u0‖1 +1

T

T∑

t=1

L(ht(xt), yt)− infh

1

T

T∑

t=1

L(h(xt), yt)

≤M‖w − u0‖1 +RT

T.

Now, by definition of the infimum, for any ε > 0, there exists h∗ ∈ H such thatLDT+1

(h∗) ≤ infh∈H LDT+1(h) + ε. For that choice of h∗, in view of (2.12), with

49

Page 63: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

probability at least 1− δ/2, the following holds:

LDT+1(h) ≤

T∑

t=1

wtL(h∗(xt), yt) +M‖w−u0‖1 +RT

T+ ∆(w, T ) +M‖w‖2

√2 log

2

δ.

By the second statement of Lemma 4, for any δ > 0, with probability at least 1− δ/2,

T∑

t=1

wtL(h∗(xt), yt) ≤ LDT+1(h∗) + ∆(w, T ) +M‖w‖2

√2 log

2

δ.

Combining these last two inequalities, by the union bound, with probability at least 1−δ,the following holds with B(w, δ) = M‖w−u0‖1 + RT

T+ 2M‖w‖2

√2 log 2

δ:

LDT+1(h) ≤ LDT+1

(h∗) + 2∆(w, T ) +B(w, δ)

≤ infh∈HLDT+1

(h) + ε+ 2∆(w, T ) +B(w, δ).

The last inequality holds for all ε > 0, therefore also for ε = 0 by taking the limit.

The above inequality can be made to hold uniformly over w at the additional cost ofa log log2 ‖w − u0‖−1

1 term.

Corollary 7. Under the conditions fo the previous theorem the following inequality issatisfied uniformly for all w such that ‖w − u0‖1 ≤ 1:

T∑

t=1

wtLDT+1(ht) ≤

T∑

t=1

wtL(ht(xt), yt) + ∆(w, T ) + 4M(‖w‖2 + ‖w − u0‖1)

+M(‖w‖2 + ‖w − u0‖1)(√

2 log3

δ+√

log log2 2‖w − u0‖−1).

Proof. Let (w(k))∞k=0 be any sequence of positive weights and defineεk = M‖w(k)‖2(ε + 2

√log k) + ∆(w(k), T ). If hk =

∑Tt=1 w(k)tL(ht(xt), yt), then

by Lemma 4 and the union bound we have:

P(∃k s.t

T∑

t=1

w(k)tLDT+1(ht)−

T∑

t=1

w(k)tL(ht(xt), yt) ≥ εk

)

≤∞∑

k=0

e− 1

2

(εk−∆(w(k),T ))2

M2‖w(k)‖22 =∞∑

k=0

e−12

(ε+2√

log k)2

≤ e−ε2

2

∞∑

k=0

1

k2≤ 3e−

ε2

2 .

50

Page 64: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Let us select w(k) in such a way that ‖w(k)−u0‖1 ≤ 12k

and let w satisfy ‖w−u0‖1 ≤1. There exists k such that

‖w(k)− u0‖1 ≤ ‖w − u0‖1 ≤ ‖w(k − 1)− u0‖1 = 2‖w(k)− u0‖1. (2.13)

Therefore, we have

T∑

t=1

wt(LDt+1(h)− L(ht(xt), yt)) (2.14)

≤T∑

t=1

w(k)t(LDt+1(ht)− L(ht(xt), yt)

)+M‖w −w(k)‖1

≤T∑

t=1

w(k)t(LDt+1(ht)− L(ht(xt), yt)

)+ 2M‖w − u0‖1.

Where the last inequality follows from the triangle inequality and the fact that ‖w(k)−u0‖1 ≤ ‖w − u0‖1. Similarly, ∆(w, T ) ≤ ∆(w(k), T ) + 2M‖w − u0‖1 and

‖w‖2 ≤ ‖w(k)‖2 + ‖w(k)− u0‖2 ≤ ‖w(k)‖2 + ‖w(k)− u0‖1. (2.15)

Finally, by definition of w(k) and inequality (2.13) we have

k = log2 ‖w(k)− u0‖−11 ≤ log2

2

‖w − u0‖1

. (2.16)

Let εw = M(‖w‖2 + ‖w− u0‖1)(ε+ 2√

log log2 2‖w − u0‖−11 ) + 4M(‖w‖2 + ‖w−

u0‖1) + ∆(w, T ). In view of (2.16) and (2.14) we see that:

P(

supw:‖w−u0‖1≤1

T∑

t=1

wtLDT+1(ht)−

T∑

t=1

wtL(ht(xt), yt)− εw ≥ 0)

≤ P(∃k s.t

T∑

t=1

w(k)tLDT+1(ht)−

T∑

t=1

w(k)tL(ht(xt), yt)− εk ≥ 0)≤ 3e−

ε2

2 .

Setting the right hand side equal to δ and solving for δ we see that with probability at

51

Page 65: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

least 1− δ

T∑

t=1

wtLDT+1(ht) ≤

T∑

t=1

wtL(ht(xt), yt) + ∆(w, T ) + 4M(‖w‖2 + ‖w − u0‖1)

+M(‖w‖2 + ‖w − u0‖1)(√

2 log3

δ+√

log log2 2‖w − u0‖−11

).

2.6 AlgorithmThe results of the previous section suggest a natural algorithm based on the values ofthe discrepancy between distributions. Let (ht)

Tt=1 be the sequence of hypotheses gener-

ated by an on-line algorithm. Theorem 6 provides a learning guarantee for any convexcombination of these hypotheses. The convex combination based on the weight vectorw minimizing the bound of Theorem 6 benefits from the most favorable guarantee. Thisleads to an algorithm for determining w based on the following convex optimizationproblem:

minw

λ‖w‖22 +

T∑

t=1

wt (discY(Dt, DT+1) + L(ht(xt), yt)) (2.17)

subject to:( T∑

t=1

wt = 1)∧ (∀t ∈ [1, T ], wt ≥ 0),

where λ ≥ 0 is a regularization parameter. This is a standard QP problem that can beefficiently solved using a variety of techniques and available software.

In practice, the discrepancy values discY(Dt, DT+1) are not available since they re-quire labeled samples. But, in the deterministic scenario where the labeling functionf is in H , we have discY(Dt, DT+1) ≤ disc(Dt, DT+1). Thus, the discrepancy val-ues disc(Dt, DT+1) can be used instead in our learning bounds and in the optimization(2.17). This also holds approximately when f is not in H but is close to some h ∈ H .

As shown in the previous chapter, given two (unlabeled) samples of size n from Dt

and DT+1, the discrepancy disc(Dt, DT+1) can be estimated within O(1/√n), when

Rn(HL) = O(1/√n). In many realistic settings, for tasks such as spam filtering, the

distribution Dt does not change within a day. This gives us the opportunity to collectan independent unlabeled sample of size n from each distribution Dt. If we choosen T , by the union bound, with high probability, all of our estimated discrepancieswill be within O(1/

√T ) of their exact counterparts disc(Dt, DT+1).

Additionally, in many cases, the distributions Dt remains unchanged over longerperiods (cycles) which may be known to us. This in fact typically holds for some tasks

52

Page 66: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Cycle

Dis

crep

ancy

0.0

0.2

0.4

0.6

0.8

1.0

Cycle

Dis

crep

ancy

0.0

0.2

0.4

0.6

0.8

Figure 2.1: Barplot of estimated discrepancies for the continuous drifting and alternatingdrifting scenarios.

such as spam filtering, political sentiment analysis, some financial market predictionproblems, and other problems. For example, in the absence of any major political eventsuch as a debate, speech, or a prominent measure, we can expect the political sentimentto remain stable. In such scenarios, it should be even easier to collect an unlabeledsample from each distribution. More crucially, we do not need then to estimate thediscrepancy for all t ∈ [1, T ] but only once for each cycle.

2.7 ExperimentsHere, we report the performance of our algorithm in both synthetic and real-worlddatasets for the task of regression. Throughout this section, we let H denote the set oflinear hypotheses x 7→ v>x. As our base on-line algorithm A we use the Widrow-Hoffalgorithm Widrow and Hoff (1988) which minimizes the square loss using stochasticgradient descent. We set the learning rate of this algorithm to ηt = 0.001√

tand the param-

eter λ of our algorithm is chosen through validation over the set 2k|k = −20, . . . 20.As discussed in the previous section, we consider scenarios where drifting occurs

on cycles. In view of this, we compare our algorithm against the following naturalbaselines: the algorithm that averages all hypotheses returned by A, which we denoteby avg and the algorithm that only averages over the hypotheses obtained over the lastcycle (last).

2.7.1 Synthetic data setsWe create 8 different data sets in the following way: we generate T = 9600 examples inR50. These examples are divided into different cycles, each cycle having the same dis-tribution. For each experiment, we select the cycle size to be k ∈ 100, 200, 400, 600

53

Page 67: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

and instances in cycle i have a Gaussian distribution with mean µi and covariance ma-trix 0.5I, where I is the identity matrix. The labeling function for all examples wasgiven by x 7→ (0.2 ∗∑50

j=1 xj + 1)2 − 1. To introduce drifting on the distributions,we let µ1 = (0, . . . , 0)> and µi = µi−1 + εi, where the drift εi is given by one of thefollowing:1. Continuous drifting. We let εi be a uniform random variable in [−.1, .1]50.2. Alternating drifting. We select ε0 = (.1, . . . , .1)> and define εi = (−1)iε0.

As suggested by their names, the continuous drifting scenario keeps changing thedistribution at every cycle whereas the alternating scenario maintains the same distribu-tion for all even cycles and all odd cycles respectively. The discrepancies needed for ouralgorithm were estimated from 6000 unlabeled samples of each distribution pair.

We replicated each experiment 50 times and the results are presented in Figure 2.2and Figure 2.3 where we report the MSE of the final hypotheses output by each algo-rithm. The first thing to notice is that our proposed algorithm is never significantly worsethan the two other algorithms. For the continuous drifting scenario, there is in fact nostatistically significant difference between the performance of our algorithm and that oflast. This can be explained by the fact that in the continuous drifting scenario the lastcycle had the smallest discrepancy with respect to the testing distribution and in mostcases, our algorithm therefore considered only hypotheses on this cycle.

In the alternating scenario however, we can see in Figure 2.1 that the discrepancybetween the distribution of the last cycle and the testing distribution is large. Thereforeusing only hypotheses from the last round should be detrimental to learning. This effectcan in fact be seen in Figure 2.3 where our algorithm instead learns to use hypothesesfrom alternating cycles. In fact, as the size of the cycle increases the gap in the perfor-mance of the algorithms increases too. This can be explained by the fact that a largerper-cycle sample allows the last algorithm to learn a better hypothesis for the wrongdistribution. Thus, making its performance suffer.

2.7.2 Real-world data setsWe test our algorithm on a financial data set consisting of the following stock tickers:IBM, GOOG, MSFT, INTC and AAPL. Our goal is to predict the price of GOOG asa function of the other four prices. We consider one day of trading as a cycle and foreach day we sample the price of these stocks every 2 minutes. This yields 195 pointsper cycle. These points were used to estimate the discrepancy disc between the cycles,which is required by our algorithm. The data consists of 20 days of trading beginning onFebruary 9th 2015 and is available at http://cims.nyu.edu/˜munoz. We reportthe median performance of each algorithm on days 5, 10, 15 and 20. The results areshown in Table 2.1.

Notice that, albeit having similar performance to last and avg on the initialrounds, our algorithm seems to have a lot less variability in its results and to, in general,

54

Page 68: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0.5

1.0

1.5

2.0

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

0.5

1.0

1.5

2.0

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

0.50

0.75

1.00

2400480064009600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

0.4

0.5

0.6

0.7

0.8

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

Figure 2.2: MSE of different algorithms for the continuous drifting scenario. Differentplots represent different cycle sizes: k = 100 (top-left), k = 200 (bottom-left), k = 400(top-right) and k = 600 (bottom-right).

1

2

3

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

0.5

1.0

1.5

2.0

2.5

3.0

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

0.5

1.0

1.5

2.0

2.5

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

1

2

3

2400 4800 6400 9600Training Rounds

MSE

Algorithm

Weighted

Last

Avg

Figure 2.3: MSE of different algorithms for the alternating drifting scenario. Differentplots represent different cycle sizes: k = 100 (top-left), k = 200 (bottom-left), k = 400(top-right) and k = 600 (bottom-right).

adapt much better to this difficult task.

55

Page 69: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Day Weighted Avg Last5 0.558 0.552 0.663

10 38.328 82.001 37.93915 4.149 60.203 6.98120 10.345 13.365 251.941

Table 2.1: Results of different algorithms in the financial data set. Bold results aresignificant at the 0.05 level.

2.8 ConclusionWe presented a theoretical analysis of the problem of learning with drifting distributionsin the batch setting. Our learning guarantees improve upon previous ones based on theL1 distance, in some cases substantially, and our proofs are simpler and concise. Thesebounds benefit from the notion of discrepancy which seems to be the natural measureof the divergence between distributions in a drifting scenario. This work motivates anumber of related studies, in particular a discrepancy-based analysis of the scenariointroduced by Crammer et al. (2010) and further improvements of the algorithm wepresented, in particular by exploiting the specific on-line learning algorithm used.

56

Page 70: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Part II

Auctions

57

Page 71: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Chapter 3

Learning in Auctions

Second-price auctions with reserve play a critical role in the revenue of modern searchengine and popular online sites since the revenue of these companies often directly de-pends on the outcome of such auctions. The choice of the reserve price is the mainmechanism through which the auction revenue can be influenced in these electronicmarkets. We cast the problem of selecting the reserve price to optimize revenue as alearning problem and present a full theoretical analysis dealing with the complex prop-erties of the corresponding loss function. We further give novel algorithms for solvingthis problem and report the results of several experiments in both synthetic and real-world data demonstrating their effectiveness.

3.1 IntroductionOver the past few years, advertisement has gradually moved away from the traditionalprinted promotion to the more tailored and directed online publicity. The advantagesof online advertisement are clear: since most modern search engine and popular onlinesite companies such as Microsoft, Facebook, Google, eBay, or Amazon, may collectinformation about the users’ behavior, advertisers can better target the population sectortheir brand is intended for.

More recently, a new method for selling advertisements has gained momentum. Un-like the standard contracts between publishers and advertisers where some amount ofimpressions is required to be fulfilled by the publisher, an Ad Exchange works in a waysimilar to a financial exchange where advertisers bid and compete between each otherfor an ad slot. The winner then pays the publisher and his ad is displayed.

The design of such auctions and their properties are crucial since they generate alarge fraction of the revenue of popular online sites. These questions have motivatedextensive research on the topic of auctioning in the last decade or so, particularly in thetheoretical computer science and economic theory communities. Much of this work hasfocused on the analysis of mechanism design, either to prove some useful property of an

58

Page 72: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

existing auctioning mechanism, to analyze its computational efficiency, or to search foran optimal revenue maximization truthful mechanism (see Muthukrishnan (2009) for agood discussion of key research problems related to Ad Exchange and references to afast growing literature therein).

One particularly important problem is that of determining an auction mechanism thatachieves optimal revenue (Muthukrishnan, 2009). In the ideal scenario where the val-uation of the bidders is drawn i.i.d. from a given continuous distribution, this is knownto be achievable for a single item, one-shot auction (see for example (Myerson, 1981)).Extensions to different scenarios have been done and have produced a series of interest-ing results including (Riley and Samuelson, 1981; Milgrom and Weber, 1982; Myerson,1981; Nisan et al., 2007), all of them based on some assumptions such as buyers havingi.i.d. valuations.

The results of these publications have set the basis for most Ad Exchanges in prac-tice: the mechanism widely adopted for selling ad slots is that of a Vickrey auction (Vick-rey, 1961) or second-price auction with reserve price (Easley and Kleinberg, 2010). Insuch auctions, the winning bidder (if any) pays the maximum of the second-place bidand the reserve price. The reserve price can be set by the publisher or automatically bythe exchange. The popularity of these auctions relies on the fact that they are incentive-compatible, i.e., bidders bid exactly what they are willing to pay. It is clear that therevenue of the publisher depends greatly on how the reserve price is set: if set too low,the winner of the auction might end up paying only a small amount, even if his bid washigh; on the other hand, if it is set too high, then bidders may not bid higher than thereserve price and the ad slot will not be sold.

We propose a learning approach to the problem of determining the reserve priceto optimize revenue in such auctions. The general idea is to leverage the informationgained from past auctions to predict a beneficial reserve price. Since every transactionon an Exchange is logged, it is natural to seek to exploit that data. This could be used toestimate the probability distribution of the bidders, which can then be used indirectly tocome up with the optimal reserve price (Myerson, 1981; Ostrovsky and Schwarz, 2011).Instead, we will seek a discriminative method making use of the loss function related tothe problem and taking advantage of existing user features.

Learning methods have been used in the past for the related problems of designingincentive-compatible auction mechanisms (Balcan et al., 2008; Blum et al., 2004), foralgorithmic bidding (Langford et al., 2010; Amin et al., 2012), and even for predictingbid landscapes (Cui et al., 2011). Another closely related problem for which machinelearning solutions have been proposed is that of revenue optimization for sponsoredsearch ads and click-through rate predictions (Zhu et al., 2009; He et al., 2014; Devanurand Kakade, 2009). But, to our knowledge, no prior work has used historical datain combination with user features for the sole purpose of revenue optimization in thiscontext. In fact, the only publications we are aware of that are directly related to ourobjective are (Ostrovsky and Schwarz, 2011) and (Cesa-Bianchi et al., 2013), which

59

Page 73: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

considers a more general case than (Ostrovsky and Schwarz, 2011).The scenario studied by Cesa-Bianchi et al. (2013) is that of censored information,

which motivates their use of a regret minimization algorithm to optimize the revenueof the seller. Our analysis assumes instead access to full information. We argue thatthis is a more realistic scenario since most companies do in fact have access to the fullhistorical data. The learning scenario we consider is also more general since it includesthe use of features, as is standard in supervised learning. Since user information is com-municated to advertisers and bids are made based on that information, it is only naturalto include user features in the formulation of the learning problem. A special case ofour analysis coincides with the no-feature scenario considered by Cesa-Bianchi et al.(2013), assuming full information. But, our results further extend those of this papereven in that scenario. In particular, we present an O(m logm) algorithm for solving akey optimization problem used as a subroutine by these authors, for which they do notseem to give an algorithm. We also do not assume that buyers’ bids are sampled i.i.d.from a common distribution. Instead, we only assume that the full outcome of eachauction is independent and identically distributed. This subtle distinction makes ourscenario closer to reality as it is unlikely for all bidders to follow the same underlyingvalue distribution. Moreover, even though our scenario does not take into account a pos-sible strategic behavior of bidders between rounds, it allows for bidders to be correlated,which is common in practice.

This chapter is organized as follows: in Section 3.2, we describe the setup and givea formal description of the learning problem. We discuss the relations between the sce-nario we consider and previous work on learning in auctions in Section 3.3. In particular,we show that, unlike previous work, our problem can be cast as that of minimizing theexpected value of a loss function, which is a standard learning problem. Unlike mostwork in this field, however, the loss function naturally associated to this problem doesnot admit favorable properties such as convexity or Lipschitz continuity. In fact the lossfunction is discontinuous. Therefore, the theoretical and algorithmic analysis of thisproblem raises several non-trivial technical issues. Nevertheless, we use a decomposi-tion of the loss to derive generalization bounds for this problem (see Section 3.4). Thesebounds suggest the use of structural risk minimization to determine a learning solutionbenefiting from strong guarantees. This, however, poses a new challenge: solving ahighly non-convex optimization problem. Similar algorithmic problems have been ofcourse previously encountered in the learning literature, most notably when seeking tominimize a regularized empirical 0-1 loss in binary classification. A standard method inmachine learning for dealing with such issues consists of resorting to a convex surrogateloss (such as the hinge loss commonly used in linear classification). However, we showin Section 3.4.2 that no convex loss function is calibrated for the natural loss functionfor this problem. That is, minimizing a convex surrogate could in fact be detrimental tolearning. This fact is further empirically verified in Section 4.6.

The impossibility results of Section 3.4.2 prompt us to search for surrogate loss

60

Page 74: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

functions with weaker regularity properties such as Lipschitz continuity. We describe aloss function with precisely that property which we further show to be consistent withthe original loss. We also provide finite sample learning guarantees for that loss function,which suggest minimizing its empirical value while controlling the complexity of thehypothesis set. This leads to an optimization problem which, albeit non-convex, admitsa favorable decomposition as a difference of two convex functions (DC-programming). Thus, we suggest using the DC-programming algorithm (DCA) introduced by Taoand An (1998) to solve our optimization problem. This algorithm admits favorableconvergence guarantees to a local minimum. To further improve upon DCA, we proposea combinatorial algorithm to cycle through different local minima with the guarantee ofreducing the objective function at every iteration. Finally, in Section 4.6, we show thatour algorithm outperforms several different baselines in various synthetic and real-worldrevenue optimization tasks.

3.2 SetupWe start with the description of the problem and our formulation of the learning setup.We study second-price auctions with reserve, the type of auctions adopted in many AdExchanges. In such auctions, the bidders submit their bids simultaneously and the win-ner, if any, pays the maximum of the value of the second-place bid and a reserve pricer set by the seller. This type of auctions benefits from the same truthfulness propertyas second-price auctions (or Vickrey auctions) Vickrey (1961): truthful bidding can beshown to be a dominant strategy in such auctions. The choice of the reserve price ris the only mechanism through which the seller can influence the auction revenue. Itschoice is thus critical: if set too low, the amount paid by the winner could be too small;if set too high, the ad slot could be lost. How can we select the reserve price to optimizerevenue?

We consider the problem of learning to set the reserve prices to optimize revenuein second-price auctions with reserve. The outcome of an auction can be encoded bythe highest and second-highest bids which we denote by a vector b = (b(1), b(2)) ∈B ⊂ R2

+. We will assume that there exists an upper bound M ∈ (0,+∞) for the bids:supb∈B b

(1) = M . For a given reserve price r and bid pair b, by definition, the revenueof an auction is given by

Revenue(r,b) = b(2)1r<b(2) + r1b(2)≤r≤b(1) .

We consider the general scenario where a feature vector x ∈ X ⊂ RN is associated witheach auction. In the auction theory literature, this feature vector is commonly referred toas public information. In the context of online advertisement, this could be for exampleinformation about the user’s location, gender or age. The learning problem can thus beformulated as that of selecting out of a hypothesis set H of functions mapping X to R a

61

Page 75: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

hypothesis h with high expected revenue

E(x,b)∼D

[Revenue(h(x),b)], (3.1)

where D is an unknown distribution according to which pairs (x,b) are drawn, wherewe have made the implicit assumption that bids are independent of the reserve price.Instead of the revenue, we will consider a loss function L defined for all (r,b) byL(r,b) = −Revenue(r,b), and will equivalently seek a hypothesis h with small ex-pected loss L(h) := E(x,b)∼D[L(h(x),b)]. As in standard supervised learning sce-narios, we assume access to a training sample S = ((x1,b1), . . . , (xm,bm)) of sizem ≥ 1 drawn i.i.d. according to D. We will denote by LS(h) the empirical lossLS(h) = 1

m

∑mi=1 L(h(xi,bi). Notice that we only assume that the auction outcomes

are i.i.d. and not that bidders are independent of each other with the same underlyingbid distribution, as in some previous work (Cesa-Bianchi et al., 2013; Ostrovsky andSchwarz, 2011). In the next sections, we will present a detailed study of this learningproblem, starting with a review of the related literature.

3.3 Previous workHere, we briefly discuss some previous work related to the study of auctions from alearning standpoint. One of the earliest contributions in this literature is that of Blumet al. (2004) where the authors studied a posted-price auction mechanism where a selleroffers some good at a certain price and where the buyer decides to either accept that priceor reject it. It is not hard to see that this type of auctions is equivalent to second-priceauctions with reserve with a single buyer. The authors consider a scenario of repeatedinteractions with different buyers where the goal is to design an incentive-compatiblemethod of setting prices that is competitive with the best fixed-priced strategy in hind-sight. A fixed-price strategy is one that simply offers the same price to all buyers. Usinga variant of the EXP3 algorithm of Auer et al. (2002), the authors designed a pricingalgorithm achieving a (1 + ε)-approximation to the best fixed-price strategy. This samescenario was also studied by Kleinberg and Leighton (2003b) who gave an online algo-rithm whose regret after T rounds is in O(T 2/3).

A step further in the design of optimal pricing strategies was proposed by Balcanet al. (2008). One of the problems considered by the authors was that of setting prices forn buyers in a posted-price auction as a function of their public information. Unlike theon-line scenario of Blum et al. (2004), Balcan et al. (2008) considered a batch scenariowhere all buyers are known in advance. However, the comparison class considered wasno longer that of simple fixed-price strategies but functions mapping public informationto prices. This makes the problem more challenging and in fact closer to the scenariowe consider. The authors showed that finding a (1 + ε)-optimal truthful mechanism is

62

Page 76: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

equivalent to finding an algorithm to optimize the empirical risk associated to the lossfunction we consider (in the case b(2) ≡ 0). There are multiple connections between thiswork and our results. In particular, the authors pointed out that the discontinuity andasymmetry of the loss function presented several challenges to their analysis. We willsee that, in fact, the same problems appear in the derivation of our learning guarantees.But, we will present an algorithm for minimizing the empirical risk which was a crucialelement missing in their results.

A different line of work by Cui et al. (2011) focused on predicting the highest bidof a second-price auction. To estimate the distribution of the highest bid, the authorspartitioned the space of advertisers based on their campaign objectives and estimatedthe distribution for each partition. Within each partition, the distribution of the highestbid was modeled as a mixture of log-normal distributions where the means and standarddeviations of the mixtures were estimated as a function of the data features. While itmay seem natural to seek to predict the highest bid, we show that this is not necessaryand that in fact accurate predictions of the highest bid do not necessarily translate intoalgorithms achieving large revenue (see Section 4.6).

As already mentioned, the closest previous work to ours is that of Cesa-Bianchiet al. (2013), who studied the problem of directly optimizing the revenue under a partialinformation setting where the learner can only observe the value of the second-highestbid, if it is higher than the reserve price. In particular, the highest bid remains unknownto the learner. This is a natural scenario for auctions such as those of eBay where onlythe price at which an object is sold is reported. To do so, the authors expressed theexpected revenue in terms of the quantity q(t) = P[b(2) > t]. This can be done asfollows:

Eb

[Revenue(r,b)] = Eb(2)

[b(2)1r<b(2) ] + r P[b(2) ≤ r ≤ b(1)] (3.2)

=

∫ +∞

0

P[b(2)1r<b(2) > t] dt+ r P[b(2) ≤ r ≤ b(1)]

=

∫ r

0

P[r < b(2)] dt+

∫ ∞

r

P[b(2) > t]dt+ r P[b(2) ≤ r ≤ b(1)]

=

∫ ∞

r

P[b(2) > t] dt+ r(P[b(2) > r] + 1−P[b(2) > r]−P[b(1) < r])

=

∫ ∞

r

P[b(2) > t] dt+ r P[b(1) ≥ r].

The main observation of Cesa-Bianchi et al. (2013) was that the quantity q(t) can beestimated from the observed outcomes of previous auctions. Furthermore, if the buyers’bids are i.i.d., then, one can express P[b(1) ≥ r] as a function of the estimated valueof q(r). This implies that the right-hand side of (3.2) can be accurately estimated andtherefore an optimal reserve price can be selected. Their algorithm makes calls to a

63

Page 77: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

procedure that maximizes the empirical revenue. The authors, however, did not describean algorithm for that maximization. A by-product of our work is an efficient algorithmfor that procedure. The guarantees of Cesa-Bianchi et al. (2013) are similar to thosepresented in the next section in the special case of learning without features. However,our derivation is different since we consider a batch scenario while Cesa-Bianchi et al.(2013) treated an online setup for which they presented regret guarantees.

3.4 Learning GuaranteesThe problem we consider is an instance of the well known family of supervised learningproblems. However, the loss function L does not admit any of the properties such asconvexity or Lipschitz continuity often assumed in the analysis of the generalizationerror, as shown by Figure 3.1(a). Furthermore, L is discontinuous and, unlike the 0-1 loss function whose discontinuity point is independent of the label, its discontinuitydepends on the outcome b of the auction. Thus, the problem of learning with the lossfunction L requires a new analysis.

3.4.1 Generalization boundTo analyze the complexity of the family of functions LH mapping X × B to R definedby

LH = (x,b) 7→ L(h(x),b) : h ∈ H,we decompose L as a sum of two loss functions l1 and l2 with more favorable propertiesthan L. We have L = l1 + l2 with l1 and l2 defined for all (r,b) ∈ R× B by

l1(r,b) = −b(2)1r<b(2) − r1b(2)≤r≤b(1) − b(1)1r>b(1)

l2(r,b) = b(1)1r>b(1) .

These functions are shown in Figure 3.1(b). Note that, for a fixed b, the functionr 7→ l1(r,b) is 1-Lipschitz since the slope of the lines defining the function is at most1. We will consider the corresponding families of loss functions: l1H = (x,b) 7→l1(h(x),b) : h ∈ H and l2H = (x,b) 7→ l2(h(x),b) : h ∈ H and use the notionof pseudo-dimension as well as those of empirical and average Rademacher complexityto measure their complexities. The pseudo-dimension is a standard complexity mea-sure (Pollard, 1984) extending the notion of VC-dimension to real-valued functions(see also Mohri et al. (2012)). For a family of functions G and finite sample S =(z1, . . . , zm) of size m, the empirical Rademacher complexity is defined by RS(G) =Eσ[supg∈G

1m

∑mi=1 σig(zi)

], where σ = (σ1, . . . , σm)>, with σis independent uniform

random variables taking values in −1,+1. The Rademacher complexity of G is de-fined as Rm(G) = ES∼Dm [RS(G)].

64

Page 78: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0 1 2 3 4 5 6 7

-5

-4

-3

-2

-1

0

1 b2

b1

0 1 2 3 4 5 6 7-5

-4

-3

-2

-1

0

b(1)

b(2)

−b(2)

0 1 2 3 4 5 6 7-1

0

1

2

3

4

b(1)

(a) (b)

Figure 3.1: (a) Plot of the loss function r 7→ L(r,b) for fixed values of b(1) and b(2); (b)Functions l1 on the left and l2 on the right.

To bound the complexity of LH , we will first bound the complexity of the family ofloss functions l1H and l2H . Since l1 is 1-Lipschitz, the complexity of the class l1H canbe readily bounded by that of H , as shown by the following proposition.

Proposition 10. For any sample S=((x1,b1), . . . , (xm,bm)), the empirical Rademachercomplexity of l1H can be bounded as follows:

RS(l1H) ≤ RS(H).

Proof. By definition of the empirical Rademacher complexity, we can write

RS(l1H) =1

mEσ

[suph∈H

m∑

i=1

σil1(h(xi),bi)

]=

1

mEσ

[suph∈H

m∑

i=1

σi(ψi h)(xi)

],

where, for all i ∈ [1,m], ψi is the function defined by ψi : r 7→ l1(r,bi). For any i ∈[1,m], ψi is 1-Lipschitz, thus, by the contraction lemma of Appendix B.1 (Lemma 13),the following inequality holds:

RS(l1H) ≤ 1

mEσ

[suph∈H

m∑

i=1

σih(xi)

]= RS(H),

which completes the proof.

As shown by the following proposition, the complexity of l2H can be bounded interms of the pseudo-dimension of H .

Proposition 11. Let d = Pdim(H) denote the pseudo-dimension of H , then, for anysample S = ((x1,b1), . . . , (xm,bm)), the empirical Rademacher complexity of l2H canbe bounded as follows:

RS(l2H) ≤M

√2d log em

d

m.

65

Page 79: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Proof. By definition of the empirical Rademacher complexity, we can write

RS(l2H) =1

mEσ

[suph∈H

m∑

i=1

σib(1)i 1

h(xi)>b(1)i

]=

1

mEσ

[suph∈H

m∑

i=1

σiΨi(1h(xi)>b(1)i

)

],

where, for all i ∈ [1,m], Ψi is theM -Lipschitz function x 7→ b(1)i x. Thus, by Lemma 13

combined with Massart’s lemma (see for example Mohri et al. (2012)), we can write

RS(l2H) ≤ M

mEσ

[suph∈H

m∑

i=1

σi1h(xi)>b(1)i

]≤M

√2d′ log em

d′

m,

where d′ = VCdim((x,b) 7→ 1h(x)−b(1)>0 : (x,b) ∈ X × B). Since the second bidcomponent b(2) plays no role in this definition, d′ coincides with VCdim((x, b(1)) 7→1h(x)−b(1)>0 : (x, b(1)) ∈ X × B1), where B1 is the projection of B ⊆ R2 onto its firstcomponent, and is upper-bounded by VCdim((x, t) 7→ 1h(x)−t>0 : (x, t) ∈ X × R),that is, the pseudo-dimension of H .

Propositions 10 and 11 can be used to derive the following generalization bound forthe learning problem we consider.

Theorem 7. For any δ > 0, with probability at least 1 − δ over the draw of an i.i.d.sample S of size m, the following inequality holds for all h ∈ H:

L(h) ≤ LS(h) + 2Rm(H) + 2M

√2d log em

d

m+M

√log 1

δ

2m.

Proof. By a standard property of the Rademacher complexity, since L = l1 + l2, thefollowing inequality holds: Rm(LH) ≤ Rm(l1H) + Rm(l2H). Thus, in view of Propo-sitions 10 and 11, the Rademacher complexity of LH can be bounded via

Rm(LH) ≤ Rm(H) +M

√2d log em

d

m.

The result then follows by the application of a standard Rademacher complexity bound(Koltchinskii and Panchenko, 2002).

This learning bound invites us to consider an algorithm seeking h ∈ H to minimizethe empirical loss LS(h), while controlling the complexity (Rademacher complexityand pseudo-dimension) of the hypothesis set H . In the following section, we discussthe computational problem of minimizing the empirical loss and suggest the use of asurrogate loss leading to a more tractable problem.

66

Page 80: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0 1 2 3 4 5 6 7

-5

-4

-3

-2

-1

0

1 b2

b1

(a) (b)Figure 3.2: (a) Piecewise linear convex surrogate loss Lp. (b) Comparison of the sumof real losses

∑mi=1 L(·,bi) for m = 500 with the sum of convex surrogate losses. Note

that the minimizers are significantly different.

3.4.2 Surrogate LossAs pointed out in Section 3.4, the loss function L does not admit most properties oftraditional loss functions used in machine learning: for any fixed b, L(·,b) is not dif-ferentiable (at two points), it is not convex nor Lipschitz, and in fact it is discontinuous.For any fixed b, L(·,b) is quasi-convex,1 a property that is often desirable since thereexist several solutions for quasi-convex optimization problems. However, in general,a sum of quasi-convex functions, such as the sum

∑mi=1 L(·,bi) appearing in the defi-

nition of the empirical loss, is not quasi-convex and a fortiori not convex.2 In fact, ingeneral, such a sum may admit exponentially many local minima. This leads us to seeka surrogate loss function with more favorable optimization properties.

A standard method in machine learning consists of replacing the loss function Lwith a convex upper bound (Bartlett et al., 2006). A natural candidate in our case is thepiecewise linear function Lp shown in Figure 3.2(a). While this is a convex loss func-tion, and thus convenient for optimization, it is not calibrated. That is, it is possible forrp ∈ argminEb[Lp(r,b)] to have a large expected true loss. Therefore, it does not pro-vide us with a useful surrogate. The calibration problem is illustrated by Figure 3.2(b)in dimension one, where the true objective function to be minimized

∑mi=1 L(r,bi) is

compared with the sum of the surrogate losses. The next theorem shows that this prob-lem in fact affects any non-constant convex surrogate. It is expressed in terms of theloss L : R × R+ → R defined by L(r, b) = −r1r≤b, which coincides with L when thesecond bid is 0.

1A function f : R→ R is said to be quasi-convex if for any α ∈ R the sub-level set x : f(x) ≤ α isconvex.

2It is known that, under some separability condition, if a finite sum of quasi-convex functions on anopen convex set is quasi-convex, then all but perhaps one of them is convex (Debreu and Koopmans,1982).

67

Page 81: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Definition 12. We say that a function Lc : [0,M ]× [0,M ] → R is consistent with L if,for any distribution D, there exists a minimizer r∗ ∈ argminr Eb∼D[Lc(r, b)] such thatr∗ ∈ argminr Eb∼D[L(r, b)].

Definition 13. We say that a sequence of convex functions (Ln)n∈N mapping [0,M ] ×[0,M ] to R is weakly consistent with L if there exists a sequence (rn)n∈N in R withrn ∈ argminr Eb∼D[Ln(r, b)] for all n ∈ N such that limn→+∞ rn = r∗ with r∗ ∈argminEb∼D[L(r, b)].

Proposition 12 (Convex surrogates). Let Lc : [0,M ]× [0,M ]→ R be a bounded func-tion, convex with respect to its first argument. If Lc is consistent with L, then Lc(·, b) isconstant for any b ∈ [0,M ].

Proof. The idea behind the proof is the following: for any two bids b1 < b2, thereexists a distributionD with support b1, b2 such that Eb∼D[L(r, b)] is minimized at bothr = b1 and r = b2. We show this implies that Eb∼D[Lc(r, b)] must attain a minimum atboth points too. By convexity of Lc it follows that Eb∼D[Lc(r, b)] must be constant onthe interval [b1, b2]. The main part of the proof will be showing that this implies that thefunction Lc(·, b1) must also be constant on the interval [b1, b2]. Finally, since the valueof b2 was chosen arbitrarily, it will follow that Lc(·, b1) is constant.

Let 0 < b1 < b2 < M and, for any µ ∈ [0, 1], let Dµ denote the probabilitydistribution with support included in b1, b2 defined by Dµ(b1) = µ and let Eµ denotethe expectation with respect to this distribution. A straightforward calculation shows thatthe unique minimizer of Eµ[L(r, b)] is given by b2 if µ > b2−b1

b2and by b1 if µ < b2−b1

b2.

Therefore, if Fµ(r) = Eµ[Lc(r, b)], it must be the case that b2 is a minimizer of Fµ forµ > b2−b1

b2and b1 is a minimizer of Fµ for µ < b2−b1

b2.

For a convex function f : R → R, we denote by f− its left-derivative and by f+

its right-derivative, which are guaranteed to exist. We will also denote here, for anyb ∈ R, by g−(·, b) and g+(·, b) the left- and right-derivatives of the function g(·, b) andby g′(·, b) its derivative, when it exists. Recall that for a convex function f , if x0 is aminimizer, then f−(x0) ≤ 0 ≤ f+(x0). In view of that and the minimizing propertiesof b1 and b2, the following inequalities hold:

0 ≥ F−µ (b2) = µL−c (b2, b1) + (1− µ)L−c (b2, b2) for µ >b2 − b1

b2

, (3.3)

0 ≤ F+µ (b1) ≤ F−µ (b2) for µ <

b2 − b1

b2

, (3.4)

where the second inequality in (3.4) holds by convexity of Fµ and the fact that b1 < b2.By setting µ = b2−b1

b2, it follows from inequalities (3.3) and (3.4) that F−µ (b2) = 0 and

F+µ (b1) = 0. By convexity of Fµ, it follows that Fµ is constant on the interval (b1, b2).

We now show this may only happen if Lc(·, b1) is also constant. By rearranging terms

68

Page 82: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

in (3.3) and plugging in the expression of µ, we obtain the equivalent condition

(b2 − b1)L−c (b2, b1) = −b1L−c (b2, b2).

Since Lc is a bounded function, it follows that L−c (b2, b1) is bounded for any b1, b2 ∈(0,M), therefore as b1 → b2 we must have b2L

−c (b2, b2) = 0, which implies L−c (b2, b2)=

0 for all b2 > 0. In view of this, inequality (3.3) may only be satisfied if L−c (b2, b1) ≤0. However, the convexity of Lc implies L−c (b2, b1) ≥ L−c (b1, b1) = 0. Therefore,L−c (b2, b1) = 0 must hold for all b2 > b1 > 0. Similarly, by definition of Fµ, the firstinequality in (3.4) implies

µL+c (b1, b1) + (1− µ)L+

c (b1, b2) ≥ 0. (3.5)

Nevertheless, for any b2 > b1 we have 0 = L−c (b1, b1) ≤ L+c (b1, b1) ≤ L−c (b2, b1) = 0.

Consequently, L+c (b1, b1) = 0 for all b1 > 0. Furthermore, L+

c (b1, b2) ≤ L+c (b2, b2) = 0.

Therefore, for inequality (3.5) to be satisfied, we must have L+c (b1, b2) = 0 for all

b1 < b2.Thus far, we have shown that for any b > 0, if r ≥ b, then L−c (r, b) = 0, while

L+c (r, b) = 0 for r ≤ b. A simple convexity argument shows that Lc(·, b) is then dif-

ferentiable and L′c(r, b) = 0 for all r ∈ (0,M), which in turn implies that Lc(·, b) is aconstant function.

The result of the previous proposition can be considerably strengthened, as shownby the following theorem. As in the proof of the previous proposition, to simplify thenotation, for any b ∈ R, we will denote by g′(·, b) the derivative of a differentiablefunction g(·, b).

Theorem 8. Let (Ln)n∈N denote a sequence of functions mapping [0,M ] × [0,M ] toR that are convex and differentiable with respect to their first argument and satisfy thefollowing conditions:• supb∈[0,M ],n∈N max(|L′n(0, b)|, |L′n(M, b)| = K <∞;• (Ln)n is weakly consistent with L;• Ln(0, b) = 0 for all n ∈ N and for all b.

If the sequence (Ln)n converges pointwise to a function Lc, then Ln(·, b) convergesuniformly to Lc(·, b) ≡ 0.

We defer the proof of this theorem to Appendix B.2 and present here only a sketch ofthe proof. We first show that the convexity of the functions Ln implies that the conver-gence to Lc must be uniform and that Lc is convex with respect to its first argument. Thisfact and the weak consistency of the sequence Ln will then imply that Lc is consistentwith L and therefore must be constant by Proposition 12.

The theorem just presented shows that even a weakly consistent sequence of con-vex losses is uniformly close to a constant function and therefore not helpful to tackle

69

Page 83: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

the learning task we consider. This suggests searching for surrogate losses that admitweaker regularity assumptions such as Lipschitz continuity.

Perhaps, the most natural surrogate loss function is then Lγ , an upper bound on Ldefined for all γ > 0 by:

Lγ(r,b) = −b(2)1r≤b(2) − r1b(2)<r≤

((1−γ)b(1)

)∨b(2)

+(1− γ

γ∨ b(2)

b(1) − b(2)

)(r − b(1))1(

(1−γ)b(1))∨b(2)<r≤b(1)

,

where c ∨ d = max(c, d). The plot of this function is shown in Figure 3.3(a). The maxterms ensure that the function is well defined if (1 − γ)b(1) < b(2). However, this turnsout to be also a poor choice as Lγ is a loose upper bound on L in the most critical region,that is around the minimum of the loss L. Thus, instead, we will consider, for any γ > 0,the loss function Lγ defined as follows:

Lγ(r,b) = −b(2)1r≤b(2) − r1b(2)<r≤b(1) +1

γ(r − (1 + γ)b(1))1b(1)<r≤(1+γ)b(1) , (3.6)

whose plot is shown in Figure 3.3(a).3 A comparison between the sum of L-losses andthe sum of Lγ-losses is shown in Figure 3.3(b). Observe that the fit is considerablybetter than that of the piecewise linear convex surrogate loss shown in Figure 3.2(b).A possible concern associated with the loss function Lγ is that it is a lower bound forL. One might think then that minimizing it would not lead to an informative solution.However, we argue that this problem arises significantly with upper bounding lossessuch as the convex surrogate, which we showed not to lead to a useful minimizer, orLγ , which is a poor approximation of L near its minimum. By matching the originalloss L in the region of interest, around the minimal value, the loss function Lγ leadsto more informative solutions for this problem. In fact, we show that that the expectedloss Lγ(h) : = Ex,b[Lγ(h)] admits a minimizer close to the minimizer of L(h). SinceLγ → L as γ → 0, this result may seem trivial. However, this convergence is notuniform and therefore calibration is not guaranteed.

Theorem 9. Let H be a closed, convex subset of a linear space of functions containing0. Then, the following inequality holds for all γ ≥ 0:

L(h∗γ)− Lγ(h∗γ) ≤ γM.

Notice that, since L ≥ Lγ for all γ ≥ 0, the theorem implies that limγ→0 L(h∗γ) =L(h∗). Indeed, let h∗ denote the best-in-class hypothesis for the loss function L. Then,

3Technically, the theoretical and algorithmic results we present for Lγ could be developed in a some-what similar way for Lγ .

70

Page 84: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0 1 2 3 4 5

-4

-3

-2

-1

0

1b2 (1 ! !)b1

b1

0 1 2 3 4 5

-4

-3

-2

-1

0

1b2 (1 + !)b1

b1

(a) (b)

Figure 3.3: (a) Comparison of the true loss L with surrogate loss Lγ on the left andsurrogate loss Lγ on the right, for γ = 0.1. (b) Comparison of

∑500i=1 L(r,bi) and∑500

i=1 Lγ(r,bi)

the following straightforward inequalities hold:

L(h∗) ≤ L(h∗γ)

≤ Lγ(h∗γ) + γM

≤ Lγ(h∗) + γM ≤ L(h∗) + γM.

By letting γ → 0 we see that L(h∗γ)→ L(h∗). This is a remarkable result as it not onlyprovides a convergence guarantee but it also gives us an explicit rate of convergence.We will later exploit this fact to come up with an optimal choice for γ.

The proof of Theorem 9 is based on the following partitioning of X × B in fourregions where Lγ is defined as an affine function:

I1 = (x,b)|h∗γ(x) ≤ b(2) I2 = (x,b)|h∗γ(x) ∈ (b(2), b(1)]I3 = (x,b)|h∗γ(x) ∈ (b(1), (1 + γ)b(1)] I4 = (x,b)|h∗γ(x) > (1 + γ)b(1),

Notice that Lγ and L differ only on I3. Therefore, we only need to bound the measureof this set which can be done as in Lemma 14 (see Appendix B.3).

Theorem 9. . We can express the difference as

Ex,b

[L(h∗γ(x),b)− Lγ(h∗γ(x),b)

]=

4∑

k=1

Ex,b

[(L(h∗γ(x),b)− Lγ(h∗γ(x),b))1Ik(x,b)

]

= Ex,b

[(L(h∗γ(x),b)− Lγ(h∗γ(x),b))1I3(x,b)

]

= Ex,b

[1

γ((1 + γ)b(1) − h∗γ(x))1I3(x,b))

]. (3.7)

Furthermore, for (x,b) ∈ I3, we know that b(1) < h∗γ(x). Thus, we can bound (3.7)

71

Page 85: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

by Ex,b[h∗γ(x)1I3(x,b)], which, by Lemma 14 in Appendix B.3, is upper bounded by

γ Ex,b

[h∗γ(x)1I2(x,b)

]. Thus, the following inequalities hold:

Ex,b

[L(h∗γ(x),b)

]− E

x,b

[Lγ(h

∗γ(x),b)

]

≤ γ Ex,b

[h∗γ(x)1I2(x,b)

]≤ γ E

x,b

[b(1)1I2(x,b)

]≤ γM,

using the fact that h∗γ(x) ≤ b(1) for (x,b) ∈ I2.

The 1/γ-Lipschitzness of Lγ can be used to prove the following generalizationbound.

Theorem 10. Fix γ ∈ (0, 1] and let S denote a sample of size m. Then, for any δ > 0,with probability at least 1 − δ over the choice of the sample S, for all h ∈ H , thefollowing holds:

Lγ(h) ≤ Lγ(h) +2

γRm(H) +M

√log 1

δ

2m. (3.8)

Proof. Let Lγ,H denote the family of functions (x,b) → Lγ(h(x), b) : h ∈ H. Theloss function Lγ is 1

γ-Lipschitz since the slope of the lines defining it is at most 1

γ.

Thus, using the contraction lemma (Lemma 13) as in the proof of Proposition 10, givesRm(Lγ,H) ≤ 1

γRm(H). The application of a standard Rademacher complexity bound

to the family of functions Lγ,H then shows that for any δ > 0, with probability at least1− δ, for any h ∈ H , the following holds:

Lγ(h) ≤ Lγ(h) +2

γRm(H) +M

√log 1

δ

2m.

We conclude this section by showing that Lγ admits a stronger form of consistency.More precisely, we prove that the generalization error of the best-in-class hypothesisL∗ := L(h∗) can be lower bounded in terms of that of the empirical minimizer of Lγ ,hγ : = argminh∈H Lγ(h).

Theorem 11. LetM = supb∈B b(1) and letH be a hypothesis set with pseudo-dimension

d = Pdim(H). Then, for any δ > 0 and a fixed value of γ > 0, with probability at least1− δ over the choice of a sample S of size m, the following inequality holds:

L∗ ≤ L(hγ) ≤ L∗ +2γ + 2

γRm(H) + γM + 2M

√2d log εm

d

m+ 2M

√log 2

δ

2m.

72

Page 86: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Proof. By Theorem 7, with probability at least 1− δ/2, the following holds:

L(hγ) ≤ LS(hγ) + 2Rm(H) + 2M

√2d log εm

d

m+M

√log 2

δ

2m. (3.9)

Applying Lemma 14 with the empirical distribution induced by the sample, we canbound LS(hγ) by Lγ(hγ) + γM . The first term of the previous expression is less thanLγ(h∗γ) by definition of hγ . Moreover, the same analysis used in the proof of Theorem 10shows that with probability 1− δ/2,

Lγ(h∗γ) ≤ Lγ(h∗γ) +2

γRm(H) +M

√log 2

δ

2m.

Finally, by definition of h∗γ and using the fact that L is an upper bound on Lγ , we canwrite Lγ(h∗γ) ≤ Lγ(h∗) ≤ L(h∗). Thus,

LS(hγ) ≤ L(h∗) +2

γRm(H) +M

√log 2

δ

2m+ γM.

Replacing this inequality in (3.9) and applying the union bound yields the result.

This bound can be extended to hold uniformly over all γ at the price of a term in

O(√

log log21γ√

m

). Thus, for appropriate choices of γ as a function of m (for instance

γ = 1/m1/4) we can guarantee the convergence of L(hγ) to L∗, a stronger form ofconsistency (See Appendix B.3).

These results are reminiscent of the standard margin bounds with γ playing the roleof a margin. The situation here is however somewhat different. Our learning boundssuggest, for a fixed γ ∈ (0, 1], to seek a hypothesis h minimizing the empirical lossLγ(h) while controlling a complexity term upper bounding Rm(H), which in the caseof a family of linear hypotheses could be ‖h‖2

K for some PSD kernelK. Since the boundcan hold uniformly for all γ, we can use it to select γ out of a finite set of possible gridsearch values. Alternatively, γ can be set via cross-validation. In the next section, wepresent algorithms for solving this regularized empirical risk minimization problem.

3.5 AlgorithmsIn this section, we show how to minimize the empirical risk under two regimes: firstwe analyze the no-feature scenario considered in Cesa-Bianchi et al. (2013) and thenwe present an algorithm to solve the more general feature-based revenue optimization

73

Page 87: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

−a(1)

b(2)

−a(2)r

b(1)

a(3)r − a(4)

(1 + η)b(1) b(2)i b

(1)i

nk nk+1

(1 + η)b(1)i

Vi(nk,bi) = −a(2)i nk

Vi(nk+1,bi) = −a(2)i nk+1

(a) (b)

Figure 3.4: (a) Prototypical v-function. (b) Illustration of the fact that the definition ofVi(r,bi) does not change on an interval [nk, nk+1].

problem.

3.5.1 No-Feature CaseWe now present a general algorithm to optimize sums of functions similar to Lγ or L inthe one-dimensional case.

Definition 14. We will say that function V : R× B → R is a v-function if it admits thefollowing form:

V (r,b) = −a(1)1r≤b(2) − a(2)r1b(2)<r≤b(1) + (a(3)r − a(4))1b(1)<r<(1+η)b(1) ,

with a(1) > 0 and η > 0 constants and a(2), a(3), a(4) defined by a(1) = ηa(3)b(2), a(2) =ηa(3), and a(4) = a(3)(1 + η)b(1).

Figure 3.4(a) illustrates this family of loss functions. A v-function is a generalizationof Lγ and L. Indeed, any v-function V satisfies V (r,b) ≤ 0 and attains its minimum atb(1). Finally, as can be seen straightforwardly from Figure 3.3, Lγ is a v-function for anyγ > 0. We consider the following general problem of minimizing a sum of v-functions:

minr≥0

F (r) :=m∑

i=1

Vi(r,bi). (3.10)

Observe that this is not a trivial problem since, for any fixed bi, Vi(·,bi) is non-convexand that, in general, a sum of m such functions may admit many local minima. Ofcourse, we can seek a solution that is ε-close to the optimal reserve via a grid searchover points ri = iε. However, the guarantees for that algorithm would depend on thecontinuity of the function. In particular, this algorithm might fail for the loss L. Instead,we exploit the particular structure of a v-function to exactly minimize F . The followingproposition, which is proven in Appendix B.4, shows that the minimum is attained atone of the highest bids, which matches the intuition. Notice that for the loss function L

74

Page 88: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

this is immediate since if r is not a highest bid, one can raise the reserve price withoutincreasing any of the component losses.

Proposition 13. Problem (3.10) admits a solution r∗ that satisfies r∗ = b(1)i for some

i ∈ [1,m].

Problem (3.10) can thus be reduced to examining the value of the function for them arguments b(1)

i , i ∈ [1,m]. This yields a straightforward method for solving theoptimization which consists of computing F (b

(1)i ) for all i and taking the minimum.

But, since the computation of each F (b(1)i ) takes O(m), the overall computational cost

is in O(m2), which can be prohibitive for even moderately large values of m.Instead, we present a combinatorial algorithm to solve the optimization problem

(3.10) in O(m logm). LetN =⋃ib

(1)i , b

(2)i , (1 + η)b

(1)i denote the set of all boundary

points associated with the functions V (·,bi). The algorithm proceeds as follows: first,sort the set N to obtain the ordered sequence (n1, . . . , n3m), which can be achievedin O(m logm) using a comparison-based sorting algorithm. Next, evaluate F (n1) andcompute F (nk+1) from F (nk) for all k.

The main idea of the algorithm is the following: since the definition of Vi(·, bi)can only change at boundary points (see also Figure 3.4(b)), computing F (nk+1) fromF (nk) can be achieved in constant time. Indeed, since between nk and nk+1 there areonly two boundary points, we can compute V (nk+1,bi) from V (nk,bi) by calculatingV for only two values of bi, which can be done in constant time. We now give a moredetailed description and proof of correctness of our algorithm.

Proposition 14. There exists an algorithm to solve the optimization problem (3.10) inO(m logm).

Proof. The pseudocode of the algorithm is given in Algorithm 2, where a(1)i , ..., a

(4)i

denote the parameters defining the functions Vi(r,bi). We will prove that, after runningAlgorithm 2, we can compute F (nj) in constant time using:

F (nj) = c(1)j + c

(2)j nj + c

(3)j nj + c

(4)j . (3.11)

This holds trivially for n1 since by definition n1 ≤ b(2)i for all i and therefore F (n1) =

−∑mi=1 a

(1)i . Now, assume that (3.11) holds for j, we prove that it must also hold for

j + 1. Suppose nj = b(2)i for some i (the cases nj = b

(1)i and nj = (1 + η)b

(1)i can be

handled in the same way). Then Vi(nj,bi) = −a(1)i and we can write

k 6=i

Vk(nj,bk) = F (nj)− V (nj,bi) = (c(1)j + c

(2)j nj + c

(3)j nj + c

(4)j ) + a

(1)i .

75

Page 89: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Algorithm 2 Sorting

N :=⋃mi=1b

(1)i , b

(2)i , (1 + η)b

(1)i ;

n1, ..., n3m) = Sort(N );Set ci := (c

(1)i , c

(2)i , c

(3)i , c

(4)i ) = 0 for i = 1, ..., 3m;

Set c(1)1 = −∑m

i=1 a(1)i ;

for j = 2, ..., 3m doSet cj = cj−1;if nj−1 = b

(2)i for some i then

c(1)j = c

(1)j + a

(1)i ;

c(2)j = c

(2)j − a(2)

i ;else if nj−1 = b

(1)i for some i then

c(2)j = c

(2)j + a

(2)i ;

c(3)j = c

(3)j + a

(3)i ;

c(4)j = c

(4)j − a(4)

i ;elsec

(3)j = c

(3)j − a(3)

i ;c

(4)j = c

(4)j + a

(4)i ;

end ifend for

76

Page 90: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Thus, by construction we would have:

c(1)j+1 + c

(2)j+1nj+1 + c

(3)j+1nj+1 + c

(4)j+1

= c(1)j + a

(1)i + (c

(2)j − a(2)

i )nj+1 + c(3)j nj+1 + c

(4)j

= (c(1)j + c

(2)j nj+1 + c

(3)j nj+1 + c

(4)j ) + a

(1)i − a(2)

i nj+1

=∑

k 6=i

Vk(nj+1,bk)− a(2)i nj+1,

where the last equality holds since the definition of Vk(r,bk) does not change for r ∈[nj, nj+1] and k 6= i. Finally, since nj was a boundary point, the definition of Vi(r,bi)must change from −a(1)

i to −a(2)i r, thus the last equation is indeed equal to F (nj+1). A

similar argument can be given if nj = b(1)i or nj = (1 + η)b

(1)i .

We proceed to analyze the complexity of the algorithm: sorting the set N can beperformed in O(m logm) and each iteration takes only constant time. Thus, the evalua-tion of all points can be achieved in linear time and, clearly, the minimum can then alsobe obtained in linear time. Therefore, the overall time complexity of the algorithm is inO(m logm).

The algorithm just proposed can be straightforwardly extended to solve the mini-mization of F over a set of r-values bounded by Λ, that is r : 0 ≤ r ≤ Λ. Indeed, weneed only compute F (b

(1)i ) for i ∈ [1,m] such that b(1)

i < Λ and of course also F (Λ),thus the computational complexity in this regularized case remains in O(m logm).

3.5.2 General CaseWe now present our main algorithm for revenue optimization in the presence of fea-tures. This problem presents new challenges characteristic of non-convex optimizationproblems in higher dimensions. Therefore, our proposed algorithm can only guaranteeconvergence to a local minimum. Nevertheless, we provide a simple method for cyclingthrough these local minima with the guarantee of reducing the objective function at eachtime.

We consider the case of a hypothesis set H ⊂ RN of linear functions x 7→ w · xwith bounded norm, ‖w‖ ≤ Λ, for some Λ ≥ 0. This can be immediately generalizedto non-linear hypotheses by using a positive definite kernel.

The results of Theorem 10 suggest seeking, for a fixed γ ≥ 0, the vector w solutionto the following optimization problem: min‖w‖≤Λ

∑mi=1 Lγ(w · xi,bi). Replacing the

original loss L with Lγ helped us remove the discontinuity of the loss. But, we stillface an optimization problem based on a sum of non-convex functions. This problemcan be formulated as a DC-programming (difference of convex functions programming)problem which is a well studied problem in non-convex optimization. Indeed, Lγ can

77

Page 91: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

be decomposed as follows for all (r,b) ∈ R×B: Lγ(r,b) = u(r,b)− v(r,b), with theconvex functions u and v defined by

u(r,b) = −r1r<b(1) + r−(1+γ)b(1)

γ1r≥b(1)

v(r,b) = (−r + b(2))1r<b(2) + r−(1+γ)b(1)

γ1r>(1+γ)b(1) .

Using the decomposition Lγ = u − v, our optimization problem can be formulated asfollows:

minw∈RN

U(w)− V (w) subject to ‖w‖ ≤ Λ, (3.12)

where U(w) =∑m

i=1 u(w · xi,bi) and V (w) =∑m

i=1 v(w · xi,bi), which shows thatit can be formulated as a DC-programming problem. The global minimum of the opti-mization problem (3.12) can be found using a cutting plane method (Horst and Thoai,1999), but that method only converges in the limit and does not admit known algo-rithmic convergence guarantees.4 There exists also a branch-and-bound algorithm withexponential convergence for DC-programming (Horst and Thoai, 1999) for finding theglobal minimum. Nevertheless, in (Tao and An, 1997), it is pointed out that such com-binatorial algorithms fail to solve real-world DC-programs in high dimensions. In fact,our implementation of this algorithm shows that the convergence of the algorithm inpractice is extremely slow for even moderately high-dimensional problems. Anotherattractive solution for finding the global solution of a DC-programming problem over apolyhedral convex set is the combinatorial solution of Tuy (1964). However, this methodrequires explicitly specifying the slope and offsets for the piecewise linear function cor-responding to a sum of Lγ losses and incurs an exponential cost in time and space.

An alternative consists of using the DC algorithm (DCA) , a primal-dual sub-differential method of Dinh Tao and Hoai An Tao and An (1998), (see also Tao and An(1997) for a good survey). This algorithm is applicable when u and v are proper lowersemi-continuous convex functions as in our case. When v is differentiable, the DC algo-rithm coincides with the CCCP algorithm of Yuille and Rangarajan (2003), which hasbeen used in several contexts in machine learning and analyzed by Sriperumbudur andLanckriet (2012).

The general proof of convergence of the DC algorithm was given by Tao and An(1998). In some special cases, the DC algorithm can be used to find the global minimumof the problem as in the trust region problem (Tao and An, 1998), but, in general, the DCalgorithm or its special case CCCP are only guaranteed to converge to a critical point(Tao and An, 1998; Sriperumbudur and Lanckriet, 2012). Nevertheless, the number ofiterations of the DC algorithm is relatively small. Its convergence has been shown tobe in fact linear for DC-programming problems such as ours (Yen et al., 2012). Thealgorithm we are proposing goes one step further than that of Tao and An (1998): we

4Some claims of Horst and Thoai (1999), e.g., Proposition 4.4 used in support of the cutting planealgorithm, are incorrect (Tuy, 2002).

78

Page 92: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

use DCA to find a local minimum but then restart our algorithm with a new seed thatis guaranteed to reduce the objective function. Unfortunately, we are not in the sameregime as in the trust region problem of Tao and An (1998) where the number of localminima is linear in the size of the input. Indeed, here, the number of local minima canbe exponential in the number of dimensions of the feature space and it is not clear tous how the combinatorial structure of the problem could help us rule out some localminima faster and make the optimization more tractable.

In the following, we describe more in detail the solution we propose for solvingthe DC-programming problem (3.12). The functions v and V are not differentiable inour context but they admit a sub-gradient at all points. We will denote by δV (w) anarbitrary element of the sub-gradient ∂V (w), which coincides with ∇V (w) at pointsw where V is differentiable. The DC algorithm then coincides with CCCP, modulothe replacement of the gradient of V by δV (w). It consists of starting with a weightvector w0 ≤ Λ and of iteratively solving a sequence of convex optimization problemsobtained by replacing V with its linear approximation giving wt as a function of wt−1,for t = 1, . . . , T : wt ∈ argmin‖w‖≤Λ U(w) − δV (wt−1) · w. This problem can berewritten in our context as the following:

min‖w‖2≤Λ2,s

m∑

i=1

si − δV (wt−1) ·w (3.13)

subject to (si≥−w · xi)∧[si≥

1

γ

(w · xi−(1 + γ)b

(1)i

)].

The problem is equivalent to a QP (quadratic-programming). Indeed, by convex duality,there exists a λ > 0 such that the above problem is equivalent to

minw∈RN

λ‖w‖2 +m∑

i=1

si − δV (wt−1) ·w

subject to (si≥−w · xi)∧[si≥

1

γ

(w · xi−(1 + γ)b

(1)i

)]

which is a simple QP that can be tackled by one of many off-the-shelf QP solvers. Ofcourse, the value of λ as a function of Λ does not admit a simple expression. Instead,we select λ through validation which is then equivalent to choosing the optimal value ofΛ through validation.

We now address the problem of the DC algorithm converging to a local minimum.A common practice is to restart the DC algorithm at a new random point. Instead, wepropose an algorithm that iterates along different local minima, with the guarantee ofreducing the function at every change of local minimum. The algorithm is simple and isbased on the observation that the function Lγ is positive homogeneous. Indeed, for any

79

Page 93: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

DC Algorithmw← w0 . initializationwhile v 6= w dov← DCA(w) . DC algorithmu← v

‖v‖η∗ ← min0≤η≤Λ

∑u·xi>0 Lγ(ηu · xi,bi)

w← η∗vend while

Figure 3.5: Pseudocode of our DC-programming algorithm.

η > 0 and (r,b),

Lγ(ηr, ηb) = −ηb(2)1ηr<ηb(2) − ηr1ηb(2)≤ηr≤ηb(1) +ηr−(1+γ)ηb(1)

γ1ηb(1)<ηr<η(1+γ)b(1)

= ηLγ(r,b).

Minimizing the objective function of (3.12) in a fixed direction u, ‖u‖ = 1, can bereformulated as follows: min0≤η≤Λ

∑mi=1 Lγ(ηu · xi,bi). Since for u · xi ≤ 0 the

function η 7→ Lγ(ηu ·xi,bi) is constant and equal to−b(2)i , the problem is equivalent to

solvingmin

0≤η≤Λ

u·xi>0

Lγ(ηu · xi,bi).

Furthermore, sinceLγ is positive homogeneous, for all i ∈ [1,m] with u·xi > 0, Lγ(ηu·xi,bi) = (u · xi)Lγ(η,bi/(u · xi)). But η 7→ (u · xi)Lγ(η,bi/(u · xi)) is a v-functionand thus the problem can efficiently be optimized using the combinatorial algorithm forthe no-feature case (Section 3.5.1). This leads to the optimization algorithm describedin Figure 3.5. The last step of each iteration of our algorithm can be viewed as a linesearch and this is in fact the step that reduces the objective function the most in practice.This is because we are then precisely minimizing the objective function even thoughthis is for some fixed direction. Since in general this line search does not find a localminimum (we are likely to decrease the objective value in other directions that are notthe one in which the line search was performed) running DCA helps us find a betterdirection for the next iteration of the line search.

3.6 ExperimentsIn this section, we report the results of several experiments with synthetic and real-worlddata demonstrating the benefits of our algorithm. Since the use of features for reserveprice optimization has not been previously studied in the literature, we are not aware ofany baseline for comparison with our algorithm. Therefore, its performance is measured

80

Page 94: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

against three natural strategies that we now describe.As mentioned before, a standard solution for solving this problem would be the

use of a convex surrogate loss. In view of that, we compare against the solution ofthe regularized empirical risk minimization of the convex surrogate loss Lα shown inFigure 3.2(a) parametrized by α ∈ [0, 1] and defined by

Lα(r,b)=

−r if r < b(1) + α(b(2) − b(1))(

(1−α)b(1)+αb(2)

α(b(1)−b(2))

)(r − b(1)) otherwise.

A second alternative consists of using ridge regression to estimate the first bid and of us-ing its prediction as the reserve price. A third algorithm consists of minimizing the losswhile ignoring the feature vectors xi, i.e., solving the problem minr≤Λ

∑ni=1 L(r,bi).

It is worth mentioning that this third approach is very similar to what advertisementexchanges currently use to suggest reserve prices to publishers. By using the empiricalversion of equation (3.2), we see that this algorithm is equivalent to finding the empiricaldistribution of bids and optimizing the expected revenue with respect to this empiricaldistribution as in (Ostrovsky and Schwarz, 2011) and (Cesa-Bianchi et al., 2013).

3.6.1 Artificial Data SetsWe generated 4 different synthetic data sets with different correlation levels betweenfeatures and bids. For all our experiments, the feature vectors x ∈ R21 were generated inas follows: x ∈ R20 was sampled from a standard Gaussian distribution and x = (x, 1)was created by adding an offset feature. We now describe the bid generating processfor each of the experiments as a function of the feature vector x. For our first threeexperiments, shown in Figure 3.6(a)-(c), the highest bid and second highest bid wereset to max

(∣∣∣∑21

i=1 xi

∣∣∣ + ε1,∣∣∣∑21

i=1xi2

∣∣∣ + ε2

)+

and min(∣∣∣∑21

i=1 xi

∣∣∣ + ε1,∣∣∣∑21

i=1xi2

∣∣∣ +

ε2

)+

respectively, where εi is a Gaussian random variable with mean 0. The standard

deviation of the Gaussian noise was varied over the set 0, 0.25, 0.5.For our last artificial experiment, we used a generative model motivated by previous

empirical observations (Ostrovsky and Schwarz, 2011; Lahaie and Pennock, 2007): bidswere generated by sampling two values from a log-normal distribution with means x ·wand x·w

2and standard deviation 0.5, with w a random vector sampled from a standard

Gaussian distribution.For all our experiments, the parameters λ, γ and α were selected respectively from

the sets 2i|i ∈ [−5, 5], 0.1, 0.01, 0.001, and 0.1, 0.2, . . . , 0.9 via validation overa set consisting of the same number of examples as the training set. Our algorithm wasinitialized using the best solution of the convex surrogate optimization problem. Thetest set consisted of 5,000 examples drawn from the same distribution as the trainingset. Each experiment was repeated 10 times and the mean revenue of each algorithm is

81

Page 95: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

shown in Figure 3.6. The plots are normalized in such a way that the revenue obtainedby setting no reserve price is equal to 0 and the maximum possible revenue (which canbe obtained by setting the reserve price equal to the highest bid) is equal to 1. Theperformance of the ridge regression algorithm is not included in Figure 3.6(d) as it wastoo inferior to be comparable with the performance of the other algorithms.

By inspecting the results in Figure 3.6(a), we see that, even in the simplest noiselessscenario, our algorithm outperforms all other techniques. The reader could argue thatthese results are, in fact, not surprising since the bids were generated by a locally linearfunction of the feature vectors, thereby ensuring the success of our algorithm. Never-theless, one would expect this to be the case too for algorithms that leverage the use offeatures such as the convex surrogate and ridge regression. But one can see that this isin fact not true even for low levels of noise. It is also worth noticing that the use of ridgeregression is actually worse than setting the reserve price to 0. This fact can be easilyunderstood by noticing that the square loss used in regression is symmetric. Therefore,we can expect several reserve prices to be above the highest bid, making the revenue ofthese auctions equal to zero. Another notable feature is that as the noise level increases,the performance of feature-based algorithms decreases. This is true for any learningalgorithm: if the features are not relevant to the prediction task, the performance ofthe algorithm will suffer. However, for the convex surrogate algorithm, a more criticalissue occurs: the performance of this algorithm actually decreases as the sample sizeincreases, which shows that in general learning with a convex surrogate is not possible.This is an empirical verification of the inconsistency result provided in Section 3.4.2.This lack of calibration can also be seen in Figure 3.6(d), where in fact the performanceof this algorithm approaches the use of no reserve price.

In order to better understand the reason behind the performance discrepancy betweenfeature-based algorithms, we analyze the reserve prices offered by each algorithm. InFigure 3.7 we see that the convex surrogate algorithm tends to offer lower reserve prices.This should be intuitively clear as high reserve prices are over-penalized by the chosenconvex surrogate as shown in Figure 3.2(b). On the other hand, reserve prices suggestedby the regression algorithm seem to be concentrated and symmetric around their mean.Therefore we can infer that about 50% of the reserve prices offered will be higher thanthe highest bid thereby yielding zero revenue. Finally, our algorithm seems to generallyoffer higher prices. This suggests that the increase in revenue comes from auctionswhere the highest bid is large but the second bid is small. This bidding phenomenon isin fact commonly observed in practice (Amin et al., 2013).

3.6.2 Real-world Data SetsDue to proprietary data and confidentiality reasons, we cannot present empirical re-sults for AdExchange data. However, we were able to procure an eBay data set con-sisting of approximately 70,000 second-price auctions of collector sport cards. The

82

Page 96: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

(a)-0.2-0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

100 200 400 800 1600

NF CVX DC Reg

(b)-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

100 200 400 800 1600

NF CVX DC Reg

(c)-0.15

-0.1-0.05

0 0.05

0.1 0.15

0.2 0.25

0.3 0.35

0.4

100 200 400 800 1600

NF CVX DC Reg

(d)-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

200 300 400 600 800 1000120016002400

DC CVX NF

Figure 3.6: Plots of expected revenue against sample size for different algorithms: DCalgorithm (DC), convex surrogate (CVX), ridge regression (Reg) and the algorithm thatuses no feature to set reserve prices (NF). For (a)-(c) bids are generated with differentnoise standard deviation (a) 0, (b) 0.25, (c) 0.5. The bids in (d) were generated using agenerative model.

full data set can be accessed using the following URL: http://cims.nyu.edu/˜munoz/data. Some other sources of auction data are accessible (e.g., http://modelingonlineauctions.com/datasets), but features are not available forthose datasets. To the best of our knowledge, with the exception of the one used here,there is no publicly available data set for online auctions including features that could bereadily used with our algorithm. The features used here include information about theseller such as positive feedback percent, seller rating and seller country; as well as infor-mation about the card such as whether the player is in the sport’s Hall of Fame. The finaldimension of the feature vectors is 78. The values of these features are both continuousand categorical. For our experiments we also included an extra offset feature.

Since the highest bid is not reported by eBay, our algorithm cannot be straightfor-wardly used on this data set. In order to generate highest bids, we calculated the meanprice of each object (each card was generally sold more than once) and set the highestbid to be the maximum between this average and the second highest bid.

Figure 3.8 shows the revenue gained using different algorithms including our DC al-gorithm, using a convex surrogate, or the algorithm that ignores features. It also showsthe results obtained by using no reserve price (NR) and the highest possible revenueobtained by setting the reserve price to the highest bid (HB). We randomly sampled2,000 examples for training, 2,000 examples for validation and 2,000 examples for test-ing. This experiment was repeated 10 times. Figure 3.8(b) shows the mean revenue

83

Page 97: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0 2 4 6 8 10

050

100

150

050

100

150

0 2 4 6 8 10

Reserve price

Fre

quen

cy

DC bid distribution

0 2 4 6 8 10

050

100

150

200

050

100

150

200

0 2 4 6 8 10

Reserve price

Fre

quen

cy

CVX bid distribution

0 2 4 6 8 10

020

040

060

080

00

200

400

600

800

0 2 4 6 8 10

Reserve price

Fre

quen

cy

Regression bid distribution

Figure 3.7: Distribution of reserve prices for each algorithm. The algorithms weretrained on 800 samples using noisy bids with standard deviation 0.5.

25

30

35

40

45

50

55

CVX NF DC HB NR

Figure 3.8: Results of the eBay data set. Comparison of our algorithm (DC) againsta convex surrogate (CVX), using no features (NF), setting no reserve (NR) and settingreserve price to highest bid (HB).

for each algorithm and their standard deviations. The results of this experiment showthat the use of features is crucial for revenue optimization. Indeed, setting an optimalreserve price for all objects seems to achieve the same revenue as no reserve price. In-stead, our algorithm achieves a 22% increase on the revenue obtained by not setting areserve price whereas the non-calibrated convex surrogate algorithm only obtains a 3%revenue improvement. Furthermore, our algorithm is able to obtain as much as 70% ofthe achievable revenue with knowledge of the highest bid.

3.7 ConclusionWe presented a comprehensive theoretical and algorithmic analysis of the learning prob-lem of revenue optimization in second-price auctions with reserve. The specific prop-erties of the loss function for this problem required a new analysis and new learningguarantees. The algorithmic solutions we presented are practically applicable to rev-enue optimization problems for this type of auctions in most realistic settings. Our

84

Page 98: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

experimental results further demonstrate their effectiveness. Much of the analysis andalgorithms presented, in particular our study of calibration questions, can also be of in-terest in other learning problems. In particular, as we will see on Chapter 4 they arerelevant to the study of learning problems arising in the study of generalized second-price auctions.

85

Page 99: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Chapter 4

Generalized Second-Price Auctions

We present an extensive analysis of the key problem of learning optimal reserve pricesfor generalized second-price auctions. We describe two algorithms for this task: onebased on density estimation, and a novel algorithm benefiting from solid theoreticalguarantees and with a very favorable running-time complexity ofO(nS log(nS)), wheren is the sample size and S the number of slots. Our theoretical guarantees are more fa-vorable than those previously presented in the literature. Additionally, we show thateven if bidders do not play at an equilibrium, our second algorithm is still well definedand minimizes a quantity of interest. To our knowledge, this is the first attempt to ap-ply learning algorithms to the problem of reserve price optimization in GSP auctions.Finally, we present the first convergence analysis of empirical equilibrium bidding func-tions to the unique Bayesian-Nash equilibrium of a GSP.

4.1 IntroductionThe Generalized Second-Price (GSP) auction is currently the standard mechanism usedfor selling sponsored search advertisement. As suggested by the name, this mechanismgeneralizes the standard second-price auction of Vickrey (1961) to multiple items. In thecase of sponsored search advertisement, these items correspond to ad slots which havebeen ranked by their position. Given this ranking, the GSP auction works as follows:first, each advertiser places a bid; next, the seller, based on the bids placed, assigns ascore to each bidder. The highest scored advertiser is assigned to the slot in the bestposition, that is, the one with the highest likelihood of being clicked on. The second-highest score obtains the second best item and so on, until all slots have been allocated orall advertisers have been assigned to a slot. As with second-price auctions, the bidder’spayment is independent of his bid. Instead, it depends solely on the bid of the advertiserassigned to the position below.

In spite of its similarity with second-price auctions, the GSP auction is not anincentive-compatible mechanism, that is, bidders have an incentive to lie about their

86

Page 100: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

valuations. This is in stark contrast with second-price auctions where truth revealing isin fact a dominant strategy. It is for this reason that predicting the behavior of bidders ina GSP auction is challenging. This is further worsened by the fact that these auctions arerepeated multiple times a day. The study of all possible equilibria of this repeated gameis at the very least difficult. While incentive compatible generalizations of the second-price auction exist, namely the Vickrey-Clark-Gloves (VCG) mechanism, the simplicityof the payment rule for GSP auctions as well as the large revenue generated by them hasmade the adoption of VCG mechanisms unlikely.

Since its introduction by Google, GSP auctions have generated billions of dollarsacross different online advertisement companies. It is therefore not surprising that it hasbecome a topic of great interest for diverse fields such as Economics, Algorithmic GameTheory and more recently Machine Learning.

The first analysis of GSP auctions was carried out independently by Edelman et al.(2007) and Varian (2007). Both publications considered a full information scenario,that is one where the advertisers’ valuations are publicly known. This assumption isweakly supported by the fact that repeated interactions allow advertisers to infer theiradversaries’ valuations. (Varian, 2007) studied the so-called Symmetric Nash Equilibria(SNE) which is a subset of the Nash equilibria with several favorable properties. In par-ticular, Varian showed that any SNE induces an efficient allocation, that is an allocationwhere the highest positions are assigned to advertisers with high values. Furthermore,the revenue achieved by the seller when advertisers play an SNE is always at least thatof the one obtained by VCG. The authors also presented some empirical results showingthat some bidders indeed play by using an SNE. However, no theoretical justification canbe given for the choice of this subset of equilibria (Borgers et al., 2013; Edelman andSchwarz, 2010). A finer analysis of the full information scenario was given by Lucieret al. (2012). The authors proved that, excluding the payment of the highest bidder, therevenue achieved at any Nash equilibrium is at least one half that of the VCG auction.

Since the assumption of full information can be unrealistic, a more modern line ofresearch has instead considered a Bayesian scenario for this auction. In a Bayesian set-ting, it is assumed that advertisers’ valuations are i.i.d. samples drawn from a commondistribution. Gomes and Sweeney (2014) characterized all symmetric Bayes-Nash equi-libria and showed that any symmetric equilibrium must be efficient. This work was laterextended by Sun et al. (2014) to account for the quality score of each advertiser. Themain contribution of this work was the design of an algorithm for the crucial problemof revenue optimization for the GSP auction. Lahaie and Pennock (2007) studied differ-ent squashing ranking rules for advertisers commonly used in practice and showed thatnone of these rules are necessarily optimal in equilibrium. Lucier et al. (2012) showedthat the GSP auction with an optimal reserve price achieves at least 1/6 of the opti-mal revenue (of any auction) in a Bayesian equilibrium. Most recently, Thompson andLeyton-Brown (2013) compared different allocation rules and showed that an anchoringallocation rule is optimal when valuations are sampled i.i.d. from a uniform distribution.

87

Page 101: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

With the exception of (Sun et al., 2014), none of these authors has proposed an algorithmfor revenue optimization using historical data.

Zhu et al. (2009) introduced a ranking algorithm to learn an optimal allocation rule.The proposed ranking is a convex combination of a quality score based on the featuresof the advertisement as well as a revenue score which depends on the value of the bids.This work was later extended in (He et al., 2014) where, in addition to the rankingfunction, a behavioral model of the advertisers is learned by the authors.

The rest of this chapter is organized as follows. In Section 4.2, we give a learningformulation of the problem of selecting reserve prices in a GSP auction. In Section 4.3,we discuss previous work related to this problem. Next, we present and analyze twolearning algorithms for this problem in Section 4.4, one based on density estimationextending to this setting an algorithm of Guerre et al. (2000), and a novel discriminativealgorithm taking into account the loss function and benefiting from favorable learningguarantees. Section 4.5 provides a convergence analysis of the empirical equilibriumbidding function to the true equilibrium bidding function in a GSP. On its own, thisresult is of great interest as it justifies the common assumption of buyers playing asymmetric Bayes-Nash equilibrium. Finally, in Section 4.6, we report the results ofexperiments comparing our algorithms and demonstrating in particular the benefits ofthe second algorithm.

4.2 ModelFor the most part, we will use the model defined by Sun et al. (2014) for GSP auctionswith incomplete information. We consider N bidders competing for S slots with N ≥S. Let vi ∈ [0, 1] and bi ∈ [0, 1] denote the per-click valuation of bidder i and hisbid respectively. Let the position factor cs ∈ [0, 1] represent the probability of a usernoticing an ad in position s and let ei ∈ [0, 1] denote the expected click-through rateof advertiser i. That is ei is the probability of ad i being clicked on given that it wasnoticed by the user. We will adopt the common assumption that cs > cs+1 (Gomes andSweeney, 2014; Lahaie and Pennock, 2007; Sun et al., 2014; Thompson and Leyton-Brown, 2013). Define the score of bidder i to be si = eivi. Following (Sun et al., 2014),we assume that si is an i.i.d. realization of a random variable with distribution F anddensity function f . Finally, we assume that advertisers bid in an efficient symmetricBayes-Nash equilibrium. This is motivated by the fact that even though advertisers maynot infer what the valuation of their adversaries is from repeated interactions, they cancertainly estimate the distribution F .

Define π : s 7→ π(s) as the function mapping slots to advertisers, i.e. π(s) = i ifadvertiser i is allocated to position s. For a vector x = (x1, . . . , xN) ∈ RN , we usethe notation x(s) := xπ(s). Finally, denote by ri the reserve price for advertiser i. Anadvertiser may participate in the auction only if bi ≥ ri. In this chapter we present an

88

Page 102: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

analysis of the two most common ranking rules (Qin et al., 2014):1. Rank-by-bid. Advertisers who bid above their reserve price are ranked in descend-

ing order of their bids and the payment of advertiser π(s) is equal to max(r(s), b(s+1)).2. Rank-by-revenue. Each advertiser is assigned a quality score qi :=qi(bi)=eibi1bi≥ri

and the ranking is done by sorting these scores in descending order. The paymentof advertiser π(s) is given by max

(r(s), q

(s+1)

e(s)

).

In both setups, only advertisers bidding above their reserve price are considered. Noticethat rank-by-bid is a particular case of rank-by-revenue where all click-through ratesei are equal to 1. Given a vector of reserve prices r and a bid vector b, we define therevenue function to be

Rev(r,b) =S∑

s=1

cs

(q(s+1)

e(s)1q(s+1)≥e(s)r(s) + r(s)1q(s+1)<e(s)r(s)≤q(s)

)

Using the notation of (Mohri and Medina, 2014), we define the loss function

L(r,b) = −Rev(r,b).

Given an i.i.d. sample S = (b1, . . . ,bn) of realizations of an auction, our objective willbe to find a reserve price vector r∗ that maximizes the expected revenue. Equivalently,r∗ should be a solution to the following optimization problem:

minr∈[0,1]N

Eb

[L(r,b)]. (4.1)

4.3 Previous WorkIt has been shown, both theoretically and empirically, that reserve prices can increasethe revenue of an auction (Myerson, 1981; Ostrovsky and Schwarz, 2011). The choiceof an appropriate reserve price therefore becomes crucial. If it is chosen too low, theseller might lose some revenue. On the other hand, if it set too high, then the advertisersmay not wish to bid above that value and the seller will not obtain any revenue from theauction.

Mohri and Medina (2014) gave a learning algorithm using historical data to esti-mate the optimal reserve price for a second-price auction in a very general setting. Anextension of this work to the GSP auction is not straightforward. Indeed, as we willshow later, the optimal reserve price vector depends on the distribution of the adver-tisers’ valuation. In a second-price auction, these valuations are observed since it isan incentive-compatible mechanism. This does not hold for GSP auctions. Moreover,in (Mohri and Medina, 2014), only one reserve price had to be estimated. In contrast,our model requires the estimation of up to N parameters with intricate dependenciesbetween them.

89

Page 103: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

The problem of estimating valuations from observed bids in a non-incentive compat-ible mechanism has been previously analyzed. Guerre et al. (2000) described a way ofestimating valuations from observed bids in a first-price auction. We will show that thismethod can be extended to the GSP auction. The rate of convergence of this algorithm,however, in general will be worse than the standard learning rate of O

(1√n

).

Sun et al. (2014) showed that, for advertisers playing an efficient equilibrium, theoptimal reserve price is given by ri = r

eiwhere r satisfies

r =1− F (r)

f(r).

The authors suggest learning r via a maximum likelihood technique over some paramet-ric family to estimate f and F , and use these estimates in the above expression. Thereare two main drawbacks for this algorithm. The first is a standard problem of paramet-ric statistics: there are no guarantees on the convergence of their estimation procedurewhen the density function f is not part of the parametric family considered. While thisproblem can be addressed by the use of a non-parametric estimation algorithm such askernel density estimation, the fact remains that the function f is the density for the un-observable scores si and therefore cannot be properly estimated. The solution proposedby the authors assumes that the bids in fact form a perfect SNE and so advertisers valua-tions may be recovered using the process described in (Varian, 2007). There is howeverno justification for this assumption and, in fact, we show in Section 4.6 that bids playedin a Bayes-Nash equilibrium do not in general form a SNE.

4.4 Learning AlgorithmsHere, we present and analyze two algorithms for learning the optimal reserve price fora GSP auction when advertisers play a symmetric equilibrium.

4.4.1 Density estimation algorithmFirst, we derive an extension of the algorithm of (Guerre et al., 2000) to GSP auctions.To do so, we first derive a formula for the bidding strategy at equilibrium. Let zs(v)denote the probability of winning position s given that the advertiser’s valuation is v. Itis not hard to verify that

zs(v) =

(N − 1

s− 1

)(1− F (v))s−1F p(v),

where p = N − s. Indeed, in an efficient equilibrium, the bidder with the s-th highestvaluation must be assigned to the s-th highest position. Therefore an advertiser with

90

Page 104: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

valuation v is assigned to position s if and only if s− 1 bidders have a higher valuationand p have a lower valuation.

For a rank-by-bid auction, Gomes and Sweeney (2014) showed the following results.

Theorem 12 (Gomes and Sweeney (2014)). A GSP auction has a unique efficient sym-metric Bayes-Nash equilibrium with bidding strategy β(v) if and only if β(v) is strictlyincreasing and satisfies the following integral equation

S∑

s=1

cs

∫ v

0

dzs(t)

dttdt =

S∑

s=1

cs

(N − 1

s− 1

)(1− F (v))s−1

∫ v

0

β(t)pF p−1(t)f(t)dt. (4.2)

Furthermore, the optimal reserve price r∗ satisfies

r∗ =1− F (r∗)

f(r∗). (4.3)

The authors show that, if the click probabilities cs are sufficiently diverse, then, β isguaranteed to be strictly increasing. When ranking is done by revenue, Sun et al. (2014)gave the following theorem.

Theorem 13 (Sun et al. (2014)). Let β be defined by the previous theorem. If advertisersbid in a Bayes-Nash equilibrium then bi = β(vi)

ei. Moreover, the optimal reserve price

vector r∗ is given by r∗i = rei

where r satisfies equation (4.3).

We are now able to present the foundation of our first algorithm. Instead of assumingthat the bids constitute an SNE we follow the ideas of (Guerre et al., 2000) and infer thescores si only from observables bi. Our result is presented for the rank-by-bid GSPauction but an extension to the rank-by-revenue mechanism is trivial.

Lemma 5. Let v1, . . . , vn be an i.i.d. sample of valuations with distribution F and letbi = β(vi) be the bid played at equilibrium. Then the random variables bi are i.i.d. withdistribution G(b) = F (β−1(b)) and density g(b) = f(β−1(b))

β′(β−1(b)). Furthermore,

vi = β−1(bi) =

∑Ss=1 cs

(N−1s−1

)(1−G(bi))

s−1bipG(bi)p−1g(bi)∑S

s=1 cs(N−1s−1

)dzdb

(bi)(4.4)

−∑S

s=1 cs(s− 1)(1−G(bi))s−2g(bi)

∫ bi0pG(u)p−1ug(u)du

∑Ss=1 cs

(N−1s−1

)dzdb

(bi),

where zs(b) := zs(β−1(b)) and is given by

(N−1s−1

)(1−G(b))s−1G(b)p−1.

Proof. By definition bi = β(vi), is a function of only vi. Since β does not depend on theother samples either, it follows that (bi)

Ni=1 must be an i.i.d. sample. Using the fact that

91

Page 105: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

β is a strictly increasing function we also have G(b) = P (bi ≤ b) = P (vi ≤ β−1(b)) =F (β−1(b)) and a simple application of the chain rule gives us the expression for thedensity g(b). To prove the second statement observe that by the change of variablev = β−1(b), the right-hand side of (4.2) is equal to

S∑

s=1

(N−1

s−1

)(1−G(b))s−1

∫ β−1(b)

0

pβ(t)F p−1(t)f(t)dt

=S∑

s=1

(N−1

s−1

)(1−G(b))s−1

∫ b

0

puG(u)p−1(u)g(u)du.

The last equality follows by the change of variable t = β(u) and from the fact thatg(b) = f(β−1(b))

β′(β−1(b)). The same change of variables applied to the left-hand side of (4.2)

yields the following integral equation:

S∑

s=1

(N−1

s−1

)∫ b

0

β−1(u)dz

du(u)du

=S∑

s=1

(N−1

s−1

)(1−G(b))s−1

∫ b

0

upG(u)p−1(u)g(u)du.

Taking the derivative with respect to b of both sides of this equation and rearrangingterms lead to the desired expression.

The previous Lemma shows that we may recover the valuation of an advertiser fromits bid. We therefore propose the following algorithm for estimating the value of r.

1. Assumin a symmetric Bayes-Nash equilibrium, use the sample S to estimate G andg.

2. Plug this estimates in (4.4) to obtain approximate samples from the distribution F .3. Use the approximate samples to find estimates f and F of the valuations density

and cumulative distribution functions respectively.4. Use F and f to estimate r.

In order to avoid the use of parametric methods, a kernel density estimation algorithmcan be used to estimate g and f . While this algorithm addresses both drawbacks of thealgorithm proposed by Sun et al. (2014), it can be shown (Guerre et al., 2000)[Theorem2] that if f isR times continuously differentiable, then, after seeing n samples, ‖f−f‖∞is in Ω

(1

nR/(2R+3)

)independently of the algorithm used to estimate f . In particular, note

that for R = 1 the rate is in Ω(

1n1/4

). This unfavorable rate of convergence can be

attributed to the fact that a two-step estimation algorithm is being used (estimation ofg and f ). But, even with access to bidder valuations, the rate can only be improved toΩ(

1nR/(2R+1)

)(Guerre et al., 2000). Furthermore, a small error in the estimation of f

92

Page 106: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

affects the denominator of the equation defining r and can result in a large error on theestimate of r.

4.4.2 Discriminative algorithmIn view of the problems associated with density estimation, we propose to use empiricalrisk minimization to find an approximation to the optimal reserve price. In particular,we are interested in solving the following optimization problem:

minr∈[0,1]N

n∑

i=1

L(r,bi). (4.5)

We first show that, when bidders play in equilibrium, the optimization problem (4.1) canbe considerably simplified.

Proposition 15. If advertisers play a symmetric Bayes-Nash equilibrium then

minr∈[0,1]N

Eb

[L(r,b)] = minr∈[0,1]

Eb

[L(r,b)],

where qi := qi(bi) = eibi and

L(r,b) = −S∑

s=1

cse(s)

(q(s+1)1q(s+1)≥r + r1q(s+1)<r≤q(s)

).

Proof. Since advertisers play a symmetric Bayes-Nash equilibrium, the optimal reserveprice vector r∗ is of the form r∗i = r

ei. Therefore, letting D = r|ri = r

ei, r ∈ [0, 1] we

have minr∈[0,1]N Eb[L(r,b)] = minr∈D Eb[L(r,b)]. Furthermore, when restricted toD,the objective function L is given by

−S∑

s=1

cse(s)

(q(s+1)1q(s+1)≥r + r1q(s+1)<r≤q(s)

).

Thus, we are left with showing that replacing q(s) with q(s) in this expression does notaffect its value. Let r ≥ 0, since qi = qi1qi≥r, in general the equality q(s) = q(s) doesnot hold. Nevertheless, if s0 denotes the largest index less than or equal to S satisfyingq(s0) > 0, then q(s) ≥ r for all s ≤ s0 and q(s) = q(s). On the other hand, for S ≥ s > s0,

93

Page 107: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

1q(s)≥r = 1q(s)≥r = 0. Thus,

S∑

s=1

cse(s)

(q(s+1)1q(s+1)≥r + r1q(s+1)<r≤q(s)

)

=

s0∑

s=1

cse(s)

(q(s+1)1q(s+1)≥r + r1q(s+1)<r≤q(s)

)

=

s0∑

s=1

cse(s)

(q(s+1)1q(s+1)≥r + r1q(s+1)<r≤q(s)

)

= −L(r,b),

which completes the proof.

In view of this proposition, we can replace the challenging problem of solving anoptimization problem in RN with solving the following simpler empirical risk mini-mization problem

minr∈[0,1]

n∑

i=1

L(r,bi) = minr∈[0,1]

n∑

i=1

S∑

s=1

Ls,i(r, q(s), q(s+1)), (4.6)

where Ls,i(r, q(s)), q(s+1)) := − cse(s)

(q(s+1)i 1

q(s+1)i ≥r − r1

q(s+1)i <r≤q(s)

i). In order to effi-

ciently minimize this highly non-convex function, we draw upon the work of the previ-ous chapter on minimization of sums of v-functions.

Definition 15. A function V : R3 → R is a v-function if it admits the following form:

V (r, q1, q2) = −a(1)1r≤q2 − a(2)r1q2<r≤q1 +(1

ηr − a(3)

)1q1<r<(1+η)q1 ,

with 0 ≤ a(1), a(2), a(3), η ≤ ∞ constants satisfying a(1) = a(2)q2, −a(2)q11η>0 =(1ηq1 − a(3)

)1η>0. Under the convention that 0 · ∞ = 0.

As suggested by their name, these functions admit a characteristic “V shape”. It isclear from Figure 4.1 that Ls,i is a v-function with a(1) = cs

e(s)q

(s+1)i , a(2) = cs

e(s)and

η = 0. Thus, we can apply the optimization algorithm given on the previous chapter tominimize (4.6) in O(nS log nS) time. The adaptation of this general algorithm to ourproblem is presented in Algorithm 3.

We conclude this section by presenting learning guarantees for our algorithm. Ourbounds are given in terms of the Rademacher complexity and the VC-dimension.

Definition 16. Let X be a set and let G := g : X → R be a family of functions.Given a sample S = (x1, . . . , xn) ∈ X , the empirical Rademacher complexity of G is

94

Page 108: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Algorithm 3 Minimization algorithm

Require: Scores (q(s)i ), i ∈ 1, . . . , n and s ∈ 1, . . . , S.

Define (p(1)is , p

(2)is ) = (q

(s)i , q

(s+1)i );

Set m = nS;N :=

⋃ni=1

⋃Ss=1p

(1)is , p

(2)is ;

(n1, ..., n2m) = Sort(N );Set di := (d1, d2) = 0

Set d1 = −∑ni=1

∑Ss=1

cseip

(2)is ;

Set r∗ = −1 and L∗ = inffor j = 2, . . . , 2m do

if nj−1 = p(2)is then

d1 = d1 + cseip

(2)is

d2 = d2 − csei

else if nj−1 = p(1)is then

d2 = d2 + cses

end ifL = d1 − nj ∗ d2;if L < L∗ thenL∗ = L;r∗ = nj;

end ifend forreturn r∗;

defined by

RS(G) =1

nEσ

[supg∈G

1

n

n∑

i=1

σig(xi)],

where σi is a random variable distributed uniformly over the set −1, 1.

Proposition 16. Let m = mini ei > 0 and M =∑S

s=1 cs. Then, for any δ > 0, withprobability at least 1 − δ over the draw of a sample S of size n, each of the followinginequalities holds for all r ∈ [0, 1]:

E[L(r,b)] ≤ 1

n

n∑

i=1

L(r,bi) +M

m

( 1√n

+

√log(en)

n

)+

√M log(1/δ)

mn(4.7)

95

Page 109: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

- cs bi(s+1)

bi(s+1) bi

(s)

Figure 4.1: Depiction of the loss Li,s. Notice that the loss in fact resembles a broken“V”

and

1

n

n∑

i=1

L(r,bi) ≤ E[L(r,b)] +M

m

( 1√n

+

√log(en)

n

)+

√M log(1/δ)

2mn. (4.8)

Proof. Let Ψ: S 7→ supr∈[0,1]1n

∑ni=1 L(r,bi)−E[L(r,b)]. Let S i be a sample obtained

from S by replacing bi with b′i. It is not hard to verify that |Ψ(S)−Ψ(S i)| ≤ Mnm

. Thus,it follows from a standard learning bound that, with probability at least 1− δ,

E[L(r,b)] ≤ 1

n

n∑

i=1

L(r,bi) + RS(R) +

√M log(1/δ)

2mn.

where R = Lr : b 7→ L(r,b)|r ∈ [0, 1]. We proceed to bound the empiricalRademacher complexity of the class R. For q1 > q2 ≥ 0 let L(r, q1, q2) = q21q2>r +r1q1≥r≥q2 . By definition of Rademacher complexity we have

RS(R) =1

nEσ

[supr∈[0,1]

n∑

i=1

σiLr(bi)]

=1

nEσ

[supr∈[0,1]

n∑

i=1

σi

S∑

s=1

csesL(r, q

(s)i , q

(s+1)i )

]

≤ 1

nEσ

[ S∑

s=1

supr∈[0,1]

n∑

i=1

σiψs(L(r, q(s)i , q

(s+1)i ))

],

where ψs is the csm

-Lipschitz function mapping x 7→ cse(s)x. Therefore, by Talagrand’s

96

Page 110: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

contraction lemma (Ledoux and Talagrand, 2011), the last term is bounded by

S∑

s=1

csnm

[supr∈[0,1]

n∑

i=1

σiL(r, q(s)i , q

(s+1)i )

]=

S∑

s=1

csmRSs(R),

where Ss =((q

(s)1 , q

(s+1)1 ), . . . , (q

(s)n , q

(s+1)n )

)and R := L(r, ·, ·)|r ∈ [0, 1]. The loss

L(r, q(s), q(s+1)) in fact evaluates to the negative revenue of a second-price auction withhighest bid q(s) and second highest bid q(s+1) (Mohri and Medina, 2014). Therefore, byPropositions 9 and 10 of (Mohri and Medina, 2014) we can write

RSs(R) ≤ 1

nEσ

[supr∈[0,1]

n∑

i=1

rσi

]+

√2 log en

n

≤( 1√

n+

√2 log en

n

).

Corollary 8. Under the hypothesis of Proposition 16, let r denote the empirical min-imizer and r∗ the minimizer of the expected loss. With probability at least 1 − δ, wehave

E[L(r,b)]− E[L(r∗,b)] ≤ 2(√M log(2/δ)

2mn+

M

m

( 1√n

+

√log(en)

n

)).

Proof. By the union bound, we (4.7) and (4.8) hold simultaneously with probability atleast 1− δ if δ is replaced by δ/2 in these equations. Adding both inequalities we obtain

E[L(r,b)]− E[L(r∗,b)] ≤ 1

n

n∑

i=1

L(r,bi)− L(r∗,bi) + 2(√M log(2/δ)

2mn

+M

m

( 1√n

+

√log(en)

n

))

The result now follows by using the fact that r is an empirical minimizer and thereforethat the difference appearing on the right-hand side of this inequality is less than or equalto 0.

It is worth noting that our algorithm is well defined whether or not the buyers bid inequilibrium. Indeed, the algorithm consists of the minimization over r of an observablequantity. While we can guarantee convergence to a solution of 4.1 only when buyers

97

Page 111: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

play a symmetric BNE, our algorithm will still find an approximate solution to

minr∈[0,1]

Eb[L(r, b)],

which remains a quantity of interest that can be close to 4.1 if buyers are close to theequilibrium.

4.5 Convergence of Empirical EquilibriaA crucial assumption in the study of GSP auctions, including this work, is that adver-tisers bid in a Bayes-Nash equilibrium (Lucier et al., 2012; Sun et al., 2014). Thisassumption is partially justified by the fact that advertisers can infer the underlying dis-tribution F using as observations the outcomes of the past repeated auctions and canthereby implement an efficient equilibrium.

In this section, we provide a stronger theoretical justification in support of this as-sumption: we quantify the difference between the bidding function calculated usingobserved empirical distributions and the true symmetric bidding function in equilibria.For the sake of notation simplicity, we will consider only the rank-by-bid GSP auction.

Let Sv = (v1, . . . , vn) be an i.i.d. sample of values drawn from a continuous distri-bution F with density function f . Assume without loss of generality that v1 ≤ . . . ≤ vnand let v denote the vector defined by vi = vi. Let F denote the empirical distributionfunction induced by Sv and let F ∈ Rn and G ∈ Rn be defined by Fi = F (vi) = i/nand Gi = 1− Fi.

We consider a discrete GSP auction where the advertiser’s valuations are i.i.d. sam-ples drawn from a distribution F . In the event where two or more advertisers admit thesame valuation, ties are broken randomly. Denote by β the bidding function for thisauction in equilibrium (when it exists). We are interested in characterizing β and inproviding guarantees on the convergence of β to β as the sample size increases.

We first introduce the notation used throughout this section.

Definition 17. Given a vector F ∈ Rn, the backwards difference operator ∆ : Rn → Rn

is defined as:∆Fi = Fi − Fi−1,

for i > 1 and ∆F1 = F1.

We will denote ∆∆Fi by ∆2Fi. Given any k ∈ N and a vector F, the vector Fk

is defined as Fki = (Fi)

k. Finally, we will use the following notation for multinomialcoefficients as follows: (

a

b1, . . . bn

)=

a!

b1! . . . bn!,

where a, b1, . . . , bn ∈ N and∑n

i=1 bi = a.

98

Page 112: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Let us now define the discrete analog of the function zs that quantifies the probabilityof winning slot s.

Proposition 17. In a symmetric efficient equilibrium of the discrete GSP the probabilityzs(v) that an advertiser with valuation v is assigned to slot s is given by

zs(v) =N−s∑

j=0

s−1∑

k=0

(N − 1

j, k,N−1−j−k

)Fji−1G

ki

(N − j − k)nN−1−j−k .

if v = vi and by

zs(v) =

(N − 1

s− 1

)limv′→v−

F (v′)p(1− F (v))s−1 =: z−s (v),

where p = N − s.

In particular, notice that z−s (vi) admits the simple expression

z−s (vi) =

(N − 1

s− 1

)Fpi−1G

s−1i−1 ,

which is the discrete version of the function zs. On the other hand, even though functionzs(vi) does not admit a closed-form, it is not hard to show that

zs(vi) =

(N − 1

s− 1

)Fpi−1G

s−1i +O

( 1

n

). (4.9)

This can again can be thought of as a discrete version of zs. The proof of this and allother propositions in this section are deferred to Appendix C. Let us now define thelower triangular matrix M(s) by:

Mij(s) = −(N − 1

s− 1

)n∆Fp

j∆Gsi

s,

for i > j and

Mii(s) =N−s−1∑

j=0

s−1∑

k=0

(N − 1

j, k,N−1−j−k

)Fji−1G

ki

(N − j − k)nN−1−j−k .

Proposition 18. If the discrete GSP auction admits a symmetric efficient equilibrium,then its bidding function β satisfies β(vi) = βi, where β is the solution of the followinglinear equation.

Mβ = u, (4.10)

99

Page 113: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

with M =∑S

s=1 csM(s) and

ui =S∑

s=1

(cszs(vi)vi −

i∑

j=1

z−s (vj)∆vj

). (4.11)

To gain some insight on the relationship between β, defined by Proposition 18 andβ, defined in Theorem 12; we compare equations (4.10) and (4.2). An integration byparts of the right-hand side of (4.2) and the change of variable G(v) = 1 − F (v) showthat β satisfies

S∑

s=1

csvzs(v)−∫ v

0

dzs(t)

dttdt =

S∑

s=1

cs

(N − 1

s− 1

)G(v)s−1

∫ v

0

β(t)dF p. (4.12)

On the other hand, equation (4.10) implies that for all i

ui =S∑

s=1

cs

[Mii(s)βi −

(N − 1

s− 1

)n∆Gs

i

s

i−1∑

j=1

∆Fpjβj

](4.13)

Moreover, by Lemma 15 and Proposition 31 in Appendix C, the equalities −n∆Gsi

s=

Gs−1i +O

(1n

)and

Mii(s) =1

2n

(N − 1

s− 1

)pFp−1

i−1Gs−1i +O

( 1

n2

),

hold. Thus, equation (4.13) resembles a numerical scheme for solving (4.12) where theintegral on the right-hand side is approximated by the trapezoidal rule. Equation (4.12)is in fact a Volterra equation of the first kind with kernel

K(t, v) =S∑

s=1

(N − 1

s− 1

)G(v)s−1pF p−1(t).

Therefore, we could benefit from the extensive literature on the convergence analysisof numerical schemes for this type of equations (Baker, 1977; Kress et al., 1989; Linz,1985). However, equations of the first kind are in general ill-posed problems (Kresset al., 1989), that is small perturbations on the equation can produce large errors onthe solution. When the kernel K satisfies mint∈[0,1]K(t, t) > 0, there exists a standardtechnique to transform an equation of the first kind to an equation of the second kind,which is a well posed problem. This makes the convergence analysis for this type ofproblems much simpler. The kernel function appearing in (4.12) does not satisfy thisproperty and therefore we cannot use these results for our problem. To the best of

100

Page 114: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0

0.02

0.04

0.06

0.08

0.1

0 100 200 300 400 500

Sample size

max(βi - βi-1)C n-1/2

Figure 4.2: (a) Empirical verification of Assumption 2. Values were generated using auniform distribution over [0, 1] and the parameters of the auction were N = 3, s = 2.The blue line corresponds to the quantity maxi ∆βi for different values of n. In red weplot the desired upper bound for C = 1/2.

our knowledge, the problem of using a simple quadrature method to solve a Volterraequation of the first kind with vanishing kernel has not been addressed previously.

In addition to dealing with an uncommon integral equation, we need to address theproblem that the elements of (4.10) are not exact evaluations of the functions defining(4.12) but rather stochastic approximations of these functions. Finally, the grid pointsused for the numerical approximation are also random.

In order to prove convergence of the function β to β we will make the followingassumptions

Assumption 1. There exists a constant c > 0 such that f(x) > c for all x ∈ [0, 1].

This assumption is needed to ensure that the difference between consecutive samplesvi − vi−1 goes to 0 as n → ∞, which is a necessary condition for the convergence ofany numerical scheme.

Assumption 2. The solution β of (4.10) satisfies vi,βi ≥ 0 for all i and maxi ∆βi ≤C√n

, for some universal constant C.

Since βi is a bidding strategy in equilibrium, it is reasonable to expect that vi ≥βi ≥ 0. On the other hand, the assumption on ∆βi is related to the smoothness of thesolution. If the function β is smooth, we should expect the approximation β to be smoothtoo. Both assumptions can in practice be verified empirically; for example, by using adensity estimation algorithm for Assumption 1. On the other hand, Assumption 2 can beverified by calculating the desired statistics as in Figure 4.2, which depicts the quantitymaxi∈1,...,n ∆βi as a function of the sample size n.

Assumption 3. The solution β to (4.2) is twice continuously differentiable.

This is satisfied if for instance the distribution function F is twice continuouslydifferentiable. We can now present our main result.

101

Page 115: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Theorem 14. If Assumptions 1, 2 and 3 are satisfied, then, with probability at least 1−δover the draw of a sample of size n, the following bound holds for all i ∈ [1, n]:

|β(vi)− β(vi)| ≤ eC( log(2/δ)N/2√

nq(n, δ/2)3 +

Cq(n, δ/2)

n3/2

).

where q(n, δ) = 2c

log(nc/2δ) with c defined in Assumption 1, and whereC is a universalconstant.

The proof of this theorem is highly technical and we defer it to Appendix C.6 andpresent here only a sketch of the proof.1. We take the the discrete derivative of (4.10) to obtain the new system of equations

dMβ = dui (4.14)

where dMij = Mij −Mi,j−1 and dui = ui − ui−1. This step is standard in theanalysis of numerical methods for Volterra equations of the first kind.

2. Since dMii = Mii and these values go to 0 as(in

)N−S it follows that (4.14) is ill-conditioned and therefore a straight forward comparison of the solutions β and βwill not work. Instead, we analyze the vector ψ = v − β and show that it satisfiesthe equation

dMψ = p

for some vector p defined in Appendix C.6. Furthermore, we show that that ψi ≤C i2

n2 for some universal constant C; and similarly the function ψ(v) = v−β(v) willalso satisfy ψ(v) ≤ Cv2. Therefore |ψ(vi)− ψi| ≤ C i2

n2 . In particular for i ≤ n3/4

we have |ψ(vi)−ψi| = |β(vi)− βi| ≤ C√n.

3. Using the fact that |F (vi) − Fi| is in O( 1√n). We show the sequence of errors

εi = |β(vi)− βi| satisfy the following recurrence:

εi ≤ C( 1√

nq(n, δ) +

dMi,i−1

dMii

εi−1 +1

n

i−2∑

j=1

εj

)

It is not hard to prove that dMi,i−1

dMii∼ 1

i. Since convergence of this term to 0 is too

slow we cannot provide bound on εi based on this recurrence. Instead, we will weuse the fact that |dMiiψi−dMi,i−1ψi−1| ≤ C dMii√

nto bound the difference between

ψ and the solution ψ′ of the equation

dM′iψ′ = pi

Where dM′ii = 2dMii, dM′

i,i−1 = 0 and dM′ij = dMij otherwise. More precisely,

We show that ‖ψ −ψ′‖∞ ≤ Cn2 .

102

Page 116: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0.0 0.2 0.4 0.6 0.8

0.1

0.3

0.5

0.7

0.1

0.3

0.5

0.7

0.0 0.2 0.4 0.6 0.8

b

v

Figure 4.3: Approximation of the empirical bidding function β to the true solution β.The true solution is shown in red and the shaded region represents the confidence intervalof β when simulating the discreteGSP 10 times with a sample of size 200. HereN = 3,S = 2, c1 = 1, c2 = 0.5 and values were sampled uniformly from [0, 1]

4. We show that ε′i = |ψ(vi)−ψ′i| satisfies the recurrence

ε′i ≤ C1√nq(n, δ) +

1

n

i−2∑

j=1

ε′i.

Notice that the term decreasing as 1i

no longer appears in the recurrence. Therefore,we can conclude that ε′i must satisfy the bound given in Theorem 14, which in turnimplies that |β(vi)− βi| also satisfies the bound.

4.6 ExperimentsHere we present preliminary experiments showing the advantages of our algorithm. Wealso present empirical evidence showing that the procedure proposed in (Sun et al.,2014) to estimate valuations from bids is incorrect. In contrast, our density estimationalgorithm correctly recovers valuations from bids in equilibrium.

4.6.1 SetupLet F1 and F2 denote the distributions of two truncated log-normal random variableswith parameters µ1 = log(.5), σ1 = .8 and µ2 = log(2), σ = .1 respectively. Here F1

is truncated to have support in [0, 1.5] and the support of F2 = [0, 2.5]. We consider aGSP with N = 4 advertisers with S = 3 slots and position factors c1 = 1, c2 =, 45 andc3 = 1. Based on the results of Section 4.5 we estimate the bidding function β with asample of 2000 points and we show its plot in Figure 4.4. We proceed to evaluate themethod proposed by Sun et al. (2014) for recovering advertisers’ valuations from bids inequilibrium. The assumption made by the authors is that the advertisers play a SNE in

103

Page 117: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

β(v)v

Figure 4.4: Bidding function for our experiments in blue and identity function in red.Since β is a strictly increasing function it follows from (Gomes and Sweeney, 2014) thatthis GSP admits an equilibrium.

which case valuations can be inferred by solving a simple system of inequalities definingthe SNE (Varian, 2007). Since the authors do not specify which SNE the advertisers areplaying we follow the work of Ostrovsky and Schwarz (2011) and choose the one thatsolves the SNE conditions with equality.

We generated a sample S consisting of n = 300 i.i.d. outcomes of our simulatedauction. Since, N = 4 the effective size of this sample is of 1200 points. We generatedthe outcome bid vectors bi, . . . ,bn by using the equilibrium bidding function β. As-suming that the bids constitute a SNE we estimated the valuations and Figure 4.5 showsa histogram of the original sample as well as the histogram of the estimated valuations.It is clear from this figure that this procedure does not accurately recover the distribu-tion of the valuations. By contrast the histogram of the estimated valuations using ourdensity estimation algorithm is shown in Figure 4.5(c). The kernel function used byour algorithm was a triangular kernel given by K(u) = (1 − |u|)1|u|≤1. Following theexperimental setup of (Guerre et al., 2000) the bandwidth h was set to h = 1.06σn1/5,where σ denotes the standard deviation of the sample of bids.

Finally, we use both our density estimation algorithm and discriminative learningalgorithm to infer the optimal value of r and the associated expected revenue. To testour algorithm we generated a test sample of size n = 500 with the procedure previouslydescribed. The results are shown in Table 4.1

Density estimation Discriminative1.42 ± 0.02 1.85 ± 0.02

Table 4.1: Mean revenue of both our algorithms.

Notice that the difference in revenue between these algorithms does not seem com-parable, even though both solve the same problem. The cause of this discrepancy isa sutil difference in the problems that the discriminative algorithm and the density es-timation algorithm solve. Indeed, the discriminative algorithm assumes that test data

104

Page 118: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.0 2.5

True valuations

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.0 0.5 1.0 1.5 2.0 2.5 3.0

SNE estimates

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.0 0.5 1.0 1.5 2.0 2.5

Kernel estimates

(a) (b) (c)

Figure 4.5: Comparison of methods for estimating valuations from bids. (a) Histogramof true valuations. (b) Valuations estimated under the SNE assumption. (c) Densityestimation algorithm.

comes from the same distribution as the source data. However, as shown by Gomes andSweeney (2014), introducing a reserve price may induce a different equilibrium. Moreprecisely, a reserve price r induces an equilibrium function β(r, v). The reserve pricer∗dens obtained using density estimation is such that r∗dens is optimal for the bidding func-tion β(r∗dens, v) whereas the reserve price r∗disc selected by the discriminative algorithm isoptimal for the equilbrium function β(0, v). We believe that it is reasonable to assumethat bidders do not change their bidding function too fast under the presence of a newreserve price. Therefore, it is more plausible for bids on the test data to be given bybeta(0, v).

4.7 ConclusionWe proposed and analyzed two algorithms for learning optimal reserve prices for gen-eralized second-price auctions. Our first algorithm is based on density estimation andtherefore suffers from the standard problems associated with this family of algorithms.Furthermore, this algorithm is only well defined when bidders play in equilibrium. Oursecond algorithm is novel and is based on learning theory guarantees. We show that thealgorithm admits an efficient O(nS log(nS)) implementation. Furthermore, our theo-retical guarantees are more favorable than those presented for the previous algorithmof (Sun et al., 2014). Moreover, even though it is necessary for advertisers to play inequilibrium for our algorithm to converge to optimality, when bidders do not play anequilibrium, our algorithm is still well defined and minimizes a quantity of interest al-beit over a smaller set. We also presented preliminary experimental results showing theadvantages of our algorithm. To our knowledge, this is the first attempt to apply learn-ing algorithms to the problem of reserve price optimization in GSP auctions. We believethat the use of learning algorithms in revenue optimization is crucial and that this worksuggests a rich research agenda including the extension of this work to a general learning

105

Page 119: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

setup where auctions and advertisers are represented by features. Additionally, in ouranalysis, we considered two different ranking rules. It would be interesting to combinethe algorithm of (Zhu et al., 2009) with this work to learn both a ranking rule and anoptimal reserve price for GSP auctions. Finally, we provided the first analysis of con-vergence of bidding functions in an empirical equilibrium to the true bidding function.This result on its own is of great importance as it better justifies the common assumptionof advertisers playing in a Bayes-Nash equilibrium.

106

Page 120: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Chapter 5

Learning Against Strategic Adversaries

The results of the previous chapter demonstrate the advantages of using machine learn-ing techniques in revenue optimization for auctions. These algorithms however, relyon the assumption that auctions outcomes are independent from each other. In practice,however, buyers may react strategically against a seller trying to optimize his revenue. Inparticular, they might under-bid in order to obtain a more favorable reserve price in thefuture. In this chapter we study the interactions between a seller attempting to optimizehis revenue and a strategic buyer. More precisely, we analyze the problem of revenueoptimization learning algorithms for posted-price auctions with strategic buyers. We an-alyze a very broad family of monotone regret minimization algorithms for this problem,which includes the previously best known algorithm, and show that no algorithm in thatfamily admits a strategic regret more favorable than Ω(

√T ). We then introduce a new

algorithm that achieves a strategic regret differing from the lower bound only by a factorin O(log T ), an exponential improvement upon the previous best algorithm. Our newalgorithm admits a natural analysis and simpler proofs, and the ideas behind its designare general. We also report the results of empirical evaluations comparing our algorithmwith the previous state of the art and show a consistent exponential improvement inseveral different scenarios.

5.1 IntroductionRevenue optimization algorithms such as the one proposed in the previous chapter, crit-ically rely on the assumption that the bids, that is, the outcomes of auctions, are drawni.i.d. according to some unknown distribution. However, this assumption may not holdin practice. In particular, with the knowledge that a revenue optimization algorithmis being used, an advertiser could seek to mislead the publisher by under-bidding. Infact, consistent empirical evidence of strategic behavior by advertisers has been foundby Edelman and Ostrovsky (2007). This motivates the analysis presented in this chap-ter of the interactions between sellers and strategic buyers, that is, buyers that may act

107

Page 121: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

non-truthfully with the goal of maximizing their surplus.The scenario we consider is that of posted-price auctions, which, albeit simpler than

other mechanisms, in fact matches a common situation in AdExchanges where manyauctions admit a single bidder. In this setting, second-price auctions with reserve areequivalent to posted-price auctions: a seller sets a reserve price for a good and the buyerdecides whether or not to accept it (that is to bid higher than the reserve price). In orderto capture the buyer’s strategic behavior, we will analyze an online scenario: at eachtime t, a price pt is offered by the seller and the buyer must decide to either accept it orleave it. This scenario can be modeled as a two-player repeated non-zero sum game withincomplete information, where the seller’s objective is to maximize his revenue, whilethe advertiser seeks to maximize her surplus as described in more detail in Section 5.2.

The literature on non-zero sum games is very rich (Nachbar, 1997, 2001; Morris,1994), but much of the work in that area has focused on characterizing different typesof equilibria, which is not directly relevant to the algorithmic questions arising here.Furthermore, the problem we consider admits a particular structure that can be exploitedto design efficient revenue optimization algorithms.

From the seller’s perspective, this game can also be viewed as a bandit problem(Kuleshov and Precup, 2010; Robbins, 1985) since only the revenue (or reward) forthe prices offered is accessible to the seller. Kleinberg and Leighton (2003a) preciselystudied this continuous bandit setting under the assumption of an oblivious buyer, thatis, one that does not exploit the seller’s behavior (more precisely, the authors assume thatat each round the seller interacts with a different buyer). The authors presented a tightregret bound of Θ(log log T ) for the scenario of a buyer holding a fixed valuation and aregret bound of O(T

23 ) when facing an adversarial buyer by using an elegant reduction

to a discrete bandit problem. However, as argued by Amin et al. (2013), when dealingwith a strategic buyer, the usual definition of regret is no longer meaningful. Indeed,consider the following example: let the valuation of the buyer be given by v ∈ [0, 1] andassume that an algorithm with sublinear regret such as Exp3 (Auer et al., 2002) or UCB(Auer et al., 2002) is used for T rounds by the seller. A possible strategy for the buyer,knowing the seller’s algorithm, would be to accept prices only if they are smaller thansome small value ε, certain that the seller would eventually learn to offer only prices lessthan ε. If ε v, the buyer would considerably boost her surplus while, in theory, theseller would have not incurred a large regret since in hindsight, the best fixed strategywould have been to offer price ε for all rounds. This, however is clearly not optimalfor the seller. The stronger notion of policy regret introduced by Arora et al. (2012) hasbeen shown to be the appropriate one for the analysis of bandit problems with adaptiveadversaries. However, for the example just described, a sublinear policy regret can besimilarly achieved. Thus, this notion of regret is also not the pertinent one for the studyof our scenario.

We will adopt instead the definition of strategic-regret, which was introduced byAmin et al. (2013) precisely for the study of this problem. This notion of regret also

108

Page 122: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

matches the concept of learning loss introduced by (Agrawal, 1995) when facing anoblivious adversary. Using this definition, Amin et al. (2013) presented both upper andlower bounds for the regret of a seller facing a strategic buyer and showed that thebuyer’s surplus must be discounted over time in order to be able to achieve sublinearregret (see Section 5.2). However, the gap between the upper and lower bounds theypresented is in O(

√T ). In the following, we analyze a very broad family of monotone

regret minimization algorithms for this problem (Section 5.3), which includes the algo-rithm of Amin et al. (2013), and show that no algorithm in that family admits a strategicregret more favorable than Ω(

√T ). Next, we introduce a nearly-optimal algorithm that

achieves a strategic regret differing from the lower bound at most by a factor inO(log T )(Section 5.4). This represents an exponential improvement upon the existing best algo-rithm for this setting. Our new algorithm admits a natural analysis and simpler proofs.A key idea behind its design is a method deterring the buyer from lying, that is rejectingprices below her valuation.

5.2 SetupWe consider the following game played by a buyer and a seller. A good, such as anadvertisement space, is repeatedly offered for sale by the seller to the buyer over Trounds. The buyer holds a private valuation v ∈ [0, 1] for that good. At each roundt = 1, . . . , T , a price pt is offered by the seller and a decision at ∈ 0, 1 is made by thebuyer. at takes value 1 when the buyer accepts to buy at that price, 0 otherwise. We willsay that a buyer lies whenever at = 0 while pt < v. At the beginning of the game, thealgorithm A used by the seller to set prices is announced to the buyer. Thus, the buyerplays strategically against this algorithm. The knowledge ofA is a standard assumptionin mechanism design and also matches the practice in AdExchanges.

For any γ ∈ (0, 1), define the discounted surplus of the buyer as follows:

Sur(A, v) =T∑

t=1

γt−1at(v − pt). (5.1)

The value of the discount factor γ indicates the strength of the preference of the buyerfor current surpluses versus future ones. The performance of a seller’s algorithm ismeasured by the notion of strategic-regret (Amin et al., 2013) defined as follows:

Reg(A, v) = Tv −T∑

t=1

atpt. (5.2)

The buyer’s objective is to maximize his discounted surplus, while the seller seeks tominimize his regret. Note that, in view of the discounting factor γ, the buyer is not fully

109

Page 123: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

adversarial. The problem consists of designing algorithms achieving sublinear strategicregret (that is a regret in o(T )).

The motivation behind the definition of strategic-regret is straightforward: a seller,with access to the buyer’s valuation, can set a fixed price for the good ε close to thisvalue. The buyer, having no control on the prices offered, has no option but to acceptthis price in order to optimize his utility. The revenue per round of the seller is thereforev − ε. Since there is no scenario where higher revenue can be achieved, this is a naturalsetting to compare the performance of our algorithm.

To gain more intuition about the problem, let us examine some of the complicationsarising when dealing with a strategic buyer. Suppose the seller attempts to learn thebuyer’s valuation v by performing a binary search. This would be a natural algorithmwhen facing a truthful buyer. However, in view of the buyer’s knowledge of the algo-rithm, for γ 0, it is in her best interest to lie on the initial rounds, thereby quickly,in fact exponentially, decreasing the price offered by the seller. The seller would thenincur an Ω(T ) regret. A binary search approach is therefore “too aggressive”. Indeed,an untruthful buyer can manipulate the seller into offering prices less than v/2 by lyingabout her value even just once! This discussion suggests following a more conservativeapproach. In the next section, we discuss a natural family of conservative algorithms forthis problem.

5.3 Monotone AlgorithmsThe following conservative pricing strategy was introduced by Amin et al. (2013). Letp1 = 1 and β < 1. If price pt is rejected at round t, the lower price pt+1 = βpt is offeredat the next round. If at any time price pt is accepted, then this price is offered for all theremaining rounds. We will denote this algorithm by monotone. The motivation behindits design is clear: for a suitable choice of β, the seller can slowly decrease the pricesoffered, thereby pressing the buyer to reject many prices (which is not convenient forher) before obtaining a favorable price. The authors present an O(Tγ

√T ) regret bound

for this algorithm, with Tγ = 1/(1− γ). A more careful analysis shows that this boundcan be further tightened to O(

√TγT +

√T ) when the discount factor γ is known to the

seller.Despite its sublinear regret, the monotone algorithm remains sub-optimal for cer-

tain choices of γ. Indeed, consider a scenario with γ 1. For this setting, the buyerwould no longer have an incentive to lie, thus, an algorithm such as binary search wouldachieve logarithmic regret, while the regret achieved by the monotone algorithm isonly guaranteed to be in O(

√T ).

One may argue that the monotone algorithm is too specific since it admits a singleparameter β and that perhaps a more complex algorithm with the same monotonic ideacould achieve a more favorable regret. Let us therefore analyze a generic monotone

110

Page 124: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Algorithm 4 Monotone algorithms.Let p1 = 1 and pt ≤ pt−1 for t = 2, . . . T .t← 1p← ptOffer price pwhile (Buyer rejects p) and (t < T ) dot← t+ 1p← ptOffer price p

end whilewhile (t < T ) dot← t+ 1Offer price p

end while

Algorithm 5 Definition of Ar.n = the root of T (T )while Offered prices less than T do

Offer price pnif Accepted then

n = r(n)else

Offer price pn for r roundsn = l(n)

end ifend while

algorithm Am defined by Algorithm 4.

Definition 18. For any buyer’s valuation v ∈ [0, 1], define the acceptance time κ∗ =κ∗(v) as the first time a price offered by the seller using algorithm Am is accepted.

Proposition 19. For any decreasing sequence of prices (pt)Tt=1, there exists a truthful

buyer with valuation v0 such that algorithm Am suffers regret of at least

Reg(Am, v0) ≥ 1

4

√T −√T .

The proof of this proposition is based on the idea that in order to achieve low regretin rounds where the buyer accepts a price, the distance between pt and pt+1 must besmall. However, for this scenario, when facing a buyer with value v 1 the seller willaccumulate a large regret from rounds where the buyer rejects a price. This intuition isformalized in the following Lemma.

Lemma 6. Let (pt)Tt=1 be a decreasing sequence of prices. Assume that the seller faces

a truthful buyer. Then, if v is sampled uniformly at random in the interval [12, 1], the

following inequality holds:

E[κ∗] ≥ 1

32E[v − pκ∗ ].

The proof of this Lemma can be found in Appendix D.1. We proceed to proveProposition 19.

Proof. By definition of the regret, we have Reg(Am, v) = vκ∗+ (T −κ∗)(v− pκ∗). Wecan consider two cases: κ∗(v0) >

√T for some v0 ∈ [1/2, 1] and κ∗(v) ≤

√T for every

v ∈ [1/2, 1]. In the former case, we have Reg(Am, v0) ≥ v0

√T ≥ 1

2

√T , which implies

the statement of the proposition. Thus, we can assume the latter condition.

111

Page 125: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Let v be uniformly distributed over [12, 1]. In view of Lemma 6, we have

E[vκ∗] + E[(T − κ∗)(v − pκ∗)] ≥1

2E[κ∗] + (T −

√T )E[(v − pκ∗)]

≥ 1

2E[κ∗] +

T −√T

32E[κ∗].

The right-hand side is minimized for E[κ∗] =

√T−√T

4. Plugging in this value yields

E[Reg(Am, v)] ≥√T−√T

4, which implies the existence of v0 with Reg(Am, v0) ≥√

T−√T

4.

We have thus shown that any monotone algorithm Am suffers a regret of at leastΩ(√T ), even when facing a truthful buyer. A tighter lower bound can be given under a

mild condition on the prices offered.

Definition 19. A sequence (pt)Tt=1 is said to be convex if it verifies pt−pt+1 ≥ pt+1−pt+2

for t = 1, . . . , T − 2.

An instance of a convex sequence is given by the prices offered by the monotonealgorithm. A seller offering prices forming a decreasing convex sequence seeks to con-trol the number of lies of the buyer by slowly reducing prices. The following propositiongives a lower bound on the regret of any algorithm in this family.

Proposition 20. Let (pt)Tt=1 be a decreasing convex sequence of prices. There exists

a valuation v0 for the buyer such that the regret of the monotone algorithm defined bythese prices is Ω(

√TCγ +

√T ), where Cγ = γ

2(1−γ).

The full proof of this proposition is given in Appendix D.1.1. The proposition showsthat when the discount factor γ is known, the monotone algorithm is in fact asymptot-ically optimal in its class.

The results just presented suggest that the dependency on T cannot be improved byany monotone algorithm. In some sense, this family of algorithms is “too conservative”.Thus, to achieve a more favorable regret guarantee, an entirely different algorithmicidea must be introduced. In the next section, we describe a new algorithm that achievesa substantially more advantageous strategic regret by combining the fast convergenceproperties of a binary search-type algorithm (in a truthful setting) with a method penal-izing untruthful behaviors of the buyer.

5.4 A Nearly Optimal AlgorithmLet A be an algorithm for revenue optimization used against a truthful buyer. Denoteby T (T ) the tree associated to A after T rounds. That is, T (T ) is a full tree of height

112

Page 126: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

1/2

1/4 3/4

1/16 5/16 9/16 13/16

1/2

1/4 3/4

13/16

(a) (b)

Figure 5.1: (a) Tree T (3) associated to the algorithm proposed in (Kleinberg and Leighton,2003a). (b) Modified tree T ′(3) with r = 2.

T with nodes n ∈ T (T ) labeled with the prices pn offered by A. The right and leftchildren of n are denoted by r(n) and l(n) respectively. The price offered when pn isaccepted by the buyer is the label of r(n) while the price offered by A if pn is rejectedis the label of l(n). Finally, we will denote the left and right subtrees rooted at node nby L (n) and R(n) respectively. Figure 5.1 depicts the tree generated by an algorithmproposed by Kleinberg and Leighton (2003a), which we will describe later.

Since the buyer holds a fixed valuation, we will consider algorithms that increaseprices only after a price is accepted and decrease it only after a rejection. This is for-malized in the following definition.

Definition 20. An algorithm A is said to be consistent if maxn′∈L (n) pn′ ≤ pn ≤minn′∈R(n) pn′ for any node n ∈ T (T ).

For any consistent algorithm A, we define a modified algorithm Ar, parametrizedby an integer r ≥ 1, designed to face strategic buyers. Algorithm Ar offers the sameprices asA, but it is defined with the following modification: when a price is rejected bythe buyer, the seller offers the same price for r rounds. The pseudocode of Ar is givenin Algorithm 5. The motivation behind the modified algorithm is given by the followingsimple observation: a strategic buyer will lie only if she is certain that rejecting a pricewill boost her surplus in the future. By forcing the buyer to reject a price for severalrounds, the seller ensures that the future discounted surplus will be negligible, therebycoercing the buyer to be truthful.

We proceed to formally analyze algorithm Ar. In particular, we will quantify theeffect of the parameter r on the choice of the buyer’s strategy. To do so, a measure ofthe spread of the prices offered by Ar is needed.

Definition 21. For any node n ∈ T (T ) define the right increment of n as δrn := pr(n) −pn. Similarly, define its left increment to be δln := maxn′∈L (n) pn − pn′ .

The prices offered by Ar define a path in T (T ). For each node in this path, we candefine time t(n) to be the number of rounds needed for this node to be reached by Ar.

113

Page 127: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Note that, since r may be greater than 1, the path chosen by Ar might not necessarilyreach the leaves of T (T ). Finally, let S : n 7→ S(n) be the function representing thesurplus obtained by the buyer when playing an optimal strategy against Ar after node nis reached.

Lemma 7. The function S satisfies the following recursive relation:

S(n) = max(γt(n)−1(v − pn) + S(r(n)),S(l(n))). (5.3)

Proof. Define a weighted tree T ′(T ) ⊂ T (T ) of nodes reachable by algorithmAr. Weassign weights to the edges in the following way: if an edge on T ′(T ) is of the form(n, r(n)), its weight is set to be γt(n)−1(v − pn), otherwise, it is set to 0. It is easy to seethat the function S evaluates the weight of the longest path from node n to the leafs ofT ′(T ). It thus follows from elementary graph algorithms that equation (5.3) holds.

The previous lemma immediately gives us necessary conditions for a buyer to rejecta price.

Proposition 21. For any reachable node n, if price pn is rejected by the buyer, then thefollowing inequality holds:

v − pn <γr

(1− γ)(1− γr)(δln + γδrn).

Proof. A direct implication of Lemma 7 is that price pn will be rejected by the buyer ifand only if

γt(n)−1(v − pn) + S(r(n)) < S(l(n)). (5.4)

However, by definition, the buyer’s surplus obtained by following any path in R(n) isbounded above by S(r(n)). In particular, this is true for the path which rejects pr(n) andaccepts every price afterwards. The surplus of this path is given by

∑Tt=t(n)+r+1 γ

t−1(v−pt) where (pt)

Tt=t(n)+r+1 are the prices the seller would offer if price pr(n) were rejected.

Furthermore, since algorithm Ar is consistent, we must have pt ≤ pr(n) = pn + δrn.Therefore, S(r(n)) can be bounded as follows:

S(r(n)) ≥T∑

t=t(n)+r+1

γt−1(v − pn − δrn) =γt(n)+r − γT

1− γ (v − pn − δrn). (5.5)

We proceed to upper bound S(l(n)). Since pn − p′n ≤ δln for all n′ ∈ L (n), v − pn′ ≤v − pn + δln and

S(l(n)) ≤T∑

t=tn+r

γt−1(v − pn + δln) =γt(n)+r−1 − γT

1− γ (v − pn + δln). (5.6)

114

Page 128: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Combining inequalities (5.4), (5.5) and (5.6) we conclude that

γt(n)−1(v − pn) +γt(n)+r − γT

1− γ (v − pn − δrn) ≤γt(n)+r−1 − γT

1− γ (v − pn + δln)

⇒ (v − pn)(

1 +γr+1 − γr

1− γ

)≤ γrδln + γr+1δrn − γT−t(n)+1(δrn + δln)

1− γ

⇒ (v − pn)(1− γr) ≤ γr(δln + γδrn)

1− γ .

Rearranging the terms in the above inequality yields the desired result.

Let us consider the following instantiation of algorithm A introduced in (Kleinbergand Leighton, 2003a). The algorithm keeps track of a feasible interval [a, b] initializedto [0, 1] and an increment parameter ε initialized to 1/2. The algorithm works in phases.Within each phase, it offers prices a+ε, a+2ε, . . . until a price is rejected. If price a+kεis rejected, then a new phase starts with the feasible interval set to [a+ (k− 1)ε, a+ kε]and the increment parameter set to ε2. This process continues until b−a < 1/T at whichpoint the last phase starts and price a is offered for the remaining rounds. It is not hard tosee that the number of phases needed by the algorithm is less than dlog2 log2 T e+ 1. Amore surprising fact is that this algorithm has been shown to achieve regretO(log log T )when the seller faces a truthful buyer. We will show that the modification Ar of thisalgorithm admits a particularly favorable regret bound. We will call this algorithm PFSr(penalized fast search algorithm).

Proposition 22. For any value of v ∈ [0, 1] and any γ ∈ (0, 1), the regret of algorithmPFSr admits the following upper bound:

Reg(PFSr, v) ≤ (vr + 1)(dlog2 log2 T e+ 1) +(1 + γ)γrT

2(1− γ)(1− γr) . (5.7)

Note that for r = 1 and γ → 0 the upper bound coincides with that of (Kleinbergand Leighton, 2003a).

Proof. Algorithm PFSr can accumulate regret in two ways: the price offered pn is re-jected, in which case the regret is v, or the price is accepted and its regret is v − pn.

Let K = dlog2 log2 T e+ 1 be the number of phases run by algorithm PFSr. Since atmostK different prices are rejected by the buyer (one rejection per phase) and each pricemust be rejected for r rounds, the cumulative regret of all rejections is upper boundedby vKr.

The second type of regret can also be bounded straightforwardly. For any phase i,let εi and [ai, bi] denote the corresponding search parameter and feasible interval respec-tively. If v ∈ [ai, bi], the regret accrued in the case where the buyer accepts a price inthis interval is bounded by bi − ai =

√εi. If, on the other hand v ≥ bi, then it readily

115

Page 129: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

follows that v − pn < v − bi +√εi for all prices pn offered in phase i. Therefore, the

regret obtained in acceptance rounds is bounded by

K∑

i=1

Ni

((v − bi)1v>bi +

√εi

)≤

K∑

i=1

(v − bi)1v>biNi +K,

where Ni ≤ 1√εi

denotes the number of prices offered during the i-th round.Finally, notice that, in view of the algorithm’s definition, every bi corresponds to a

rejected price. Thus, by Proposition 21, there exist nodes ni (not necessarily distinct)such that pni = bi and

v − bi = v − pni ≤γr

(1− γ)(1− γr)(δlni + γδrni).

It is immediate that δrn ≤ 1/2 and δln ≤ 1/2 for any node n, thus, we can write

K∑

i=1

(v − bi)1v>biNi ≤γr(1 + γ)

2(1− γ)(1− γr)K∑

i=1

Ni ≤γr(1 + γ)

2(1− γ)(1− γr)T.

The last inequality holds since at most T prices are offered by our algorithm. Combiningthe bounds for both regret types yields the result.

When an upper bound on the discount factor γ is known to the seller, he can leveragethis information and optimize upper bound (5.7) with respect to the parameter r.

Theorem 15. Let 1/2 < γ < γ0 < 1 and r∗ =⌈

argminr≥1 r +γr0T

(1−γ0)(1−γr0)

⌉. For any

v ∈ [0, 1], if T > 4, the regret of PFSr∗ satisfies

Reg(PFSr∗ , v) ≤ (2vγ0Tγ0 log cT + 1 + v)(log2 log2 T + 1) + 4Tγ0 ,

where c = 4 log 2.

The proof of this theorem is fairly technical and is deferred to the Appendix. Thetheorem helps us define conditions under which logarithmic regret can be achieved.Indeed, if γ0 = e−1/ log T = O(1− 1

log T), using the inequality e−x ≤ 1− x+ x2/2 valid

for all x > 0 we obtain

1

1− γ0

≤ log2 T

2 log T − 1≤ log T.

It then follows from Theorem 15 that

Reg(PFSr∗ , v) ≤ (2v log T log cT + 1 + v)(log2 log2 T + 1) + 4 log T.

116

Page 130: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Let us compare the regret bound given by Theorem 15 with the one given by Aminet al. (2013). The above discussion shows that for certain values of γ, an exponentiallybetter regret can be achieved by our algorithm. It can be argued that the knowledge ofan upper bound on γ is required, whereas this is not needed for the monotone algo-rithm. However, if γ > 1− 1/

√T , the regret bound on monotone is super-linear, and

therefore uninformative. Thus, in order to properly compare both algorithms, we mayassume that γ < 1 − 1/

√T in which case, by Theorem 15, the regret of our algorithm

is O(√T log T ) whereas only linear regret can be guaranteed by the monotone algo-

rithm. Even under the more favorable bound of O(√TγT +

√T ), for any α < 1 and

γ < 1 − 1/Tα, the monotone algorithm will achieve regret O(Tα+1

2 ) while a strictlybetter regret O(Tα log T log log T ) is attained by ours.

5.5 Lower BoundThe following lower bounds have been derived in previous work.

Theorem 16 ((Amin et al., 2013)). Let γ > 0 be fixed. For any algorithmA, there existsa valuation v for the buyer such that Reg(A, v) ≥ 1

12Tγ .

This theorem is in fact given for the stochastic setting where the buyer’s valuationis a random variable taken from some fixed distribution D. However, the proof of thetheorem selects D to be a point mass, therefore reducing the scenario to a fixed pricedsetting.

Theorem 17 ( (Kleinberg and Leighton, 2003a)). Given any algorithm A to be playedagainst a truthful buyer, there exists a value v ∈ [0, 1] such that Reg(A, v) ≥ C log log Tfor some universal constant C.

Combining these results leads immediately to the following.

Corollary 9. Given any algorithm A, there exists a buyer’s valuation v ∈ [0, 1] suchthat Reg(A, v) ≥ max

(112Tγ, C log log T

), for a universal constant C.

We now compare the upper bounds given in the previous section with the boundof Corollary 9. For γ > 1/2, we have Reg(PFSr, v) = O(Tγ log T log log T ). Onthe other hand, for γ ≤ 1/2, we may choose r = 1, in which case, by Proposi-tion 22, Reg(PFSr, v) = O(log log T ). Thus, the upper and lower bounds match upto an O(log T ) factor.

5.6 Empirical ResultsIn this section, we present the result of simulations comparing the monotone algo-rithm and our algorithm PFSr. The experiments were carried out as follows: given

117

Page 131: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

γ = .85, v = .75 γ = .95, v = .75 γ = .75, v = .25 γ = .80, v = .25

0

200

400

600

800

1000

1200

2 2.5 3 3.5 4 4.5

Reg

ret

Number of rounds (log-scale)

PFSmon

0

500

1000

1500

2000

2500

2 2.5 3 3.5 4 4.5

Reg

ret

Number of rounds (log-scale)

PFSmon

0

20

40

60

80

100

120

2 2.5 3 3.5 4 4.5

Reg

ret

Number of rounds (log-scale)

PFSmon

Figure 5.2: Comparison of the monotone algorithm and PFSr for different choices of γ and v.The regret of each algorithm is plotted as a function of the number rounds when γ is not knownto the algorithms (first two figures) and when its value is made accessible to the algorithms (lasttwo figures).

a buyer’s valuation v, a discrete set of false valuations v were selected out of the set.03, .06, . . . , v. Both algorithms were run against a buyer making the seller believeher valuation is v instead of v. The value of v achieving the best utility for the buyerwas chosen and the regret for both algorithms is reported in Figure 5.2.

We considered two sets of experiments. First, the value of parameter γ was left un-known to both algorithms and the value of r was set to log(T ). This choice is motivatedby the discussion following Theorem 15 since, for large values of T , we can expect toachieve logarithmic regret. The first two plots (from left to right) in Figure 5.2 depictthese results. The apparent stationarity in the regret of PFSr is just a consequence of thescale of the plots as the regret is in fact growing as log(T ). For the second set of exper-iments, we allowed access to the parameter γ to both algorithms. The value of r waschosen optimally based on the results of Theorem 15 and the parameter β of monotonewas set to 1−1/

√TTγ to ensure regret in O(

√TTγ +

√T ). It is worth noting that even

though our algorithm was designed under the assumption of some knowledge about thevalue of γ, the experimental results show that an exponentially better performance overthe monotone algorithm is still attainable and in fact the performances of the optimizedand unoptimized versions of our algorithm are comparable. A more comprehensive se-ries of experiments is presented in Appendix D.2.

5.7 ConclusionWe presented a detailed analysis of revenue optimization algorithms against strategicbuyers. In doing so, we reduced the gap between upper and lower bounds on strategicregret to a logarithmic factor. Furthermore, the algorithm we presented is simple toanalyze and reduces to the truthful scenario in the limit of γ → 0, an important propertythat previous algorithms did not admit. We believe that our analysis helps gain a deeperunderstanding of this problem and that it can serve as a tool for studying more complexscenarios such as that of strategic behavior in repeated second-price auctions, VCGauctions and general market strategies.

118

Page 132: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Chapter 6

Conclusion

We have provided an extensive theoretical analysis of two important problems in ma-chine learning. Our results have inspired new and exciting research questions. Indeed,not only have we given tight learning guarantees for the problems of domain adaptationand drifting, but we have introduced two novel divergence metrics: the Y-discrepancyand the generalized discrepancy, which have been shown to be crucial in other relatedadaptation tasks such as that of Germain et al. (2013).

We also believe that the use of Y-discrepancy can be of pivotal importance in theanalysis of active learning too. Indeed, as pointed out in (Dasgupta, 2011), a secondaryeffect of actively selecting points for learning is that of biasing the data used for training,thereby making active learning an instance of the sample bias correction problem. Itis thus natural to ask whether the notions of discrepancy can be used to analyze thischallenging learning scenario.

In this thesis, we also provided an original analysis for learning in auctions whenfeatures are used. Indeed, whereas learning tools had been used in the past to studythis problem, previous work had never considered the problem of learning in auctionswith features, which we have shown can drastically increase the revenue of an auctionin practice. Furthermore, we provided a first convergence analysis of empirical Bayes-Nash equilibria which can help better understand the behavior of buyers in generalizedsecond-price auctions. Finally, we have carefully analyzed the effects of dealing withstrategic buyers with fixed valuations and recently have extended these results to buyerswith random valuations (Mohri and Medina, 2015) .

It is worth mentioning that our work has in fact has inspired new connections be-tween the auction theory and learning communities (Cole and Roughgarden, 2014;Morgenster and Roughgarden, 2015). Furthermore, the work presented in this thesisrepresents only the basis for what we believe will be a series of exciting future researchchallenges. Indeed, several questions can be derived from this thesis such as: can weanalyze a more general scenario where the seller does not hold a monopoly and wherehe competes with other sellers to retain costumers? Similarly, what happens when there

119

Page 133: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

exists more than one buyer? Do other participants force strategic buyers to bid truth-fully? Finally, our model makes strong assumptions about the knowledge of the buyer,what would happen if both (buyer and seller), must learn the environment at the sametime? These and other similar challenging questions could for the basis for a theory oflearning in auctioning that we believe can form a rich branch of computer science andmathematics.

120

Page 134: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Appendix A

Appendix to Chapter 1

A.1 SDP FormulationLemma 8. The Lagrangian dual of the problem

maxa∈Rm

‖Ksa−y‖2≤r2

1

2‖Ksta‖2 − b>KtKsta, (A.1)

is given by

minη≥0,γ

γ

s. t.

(−1

2K>stKst + ηK2

s12K>stKtb− ηKsy

12b>KtKst − ηy>Ks η(‖y‖2 − r2) + γ

) 0.

Furthermore, the duality gap for these problems is zero.

Proof. For η ≥ 0 the Lagrangian of (A.1) is given by

L(a, η) =1

2‖Ksta‖2 − b>KtKsta− η(‖Ksa− y‖2 − r2)

= a>(1

2K>stKst − ηK2

s

)a + (2ηKsy −K>stKtb)>a− η(‖y‖2 − r2).

Since the Lagrangian is a quadratic function of a and that the conjugate function of aquadratic can be expressed in terms of the pseudo-inverse, the dual is given by

minη≥0

1

4(2ηKsy −K>stKtb)>

(ηK2

s −1

2K>stKst

)†(2ηKsy −K>stKtb)− η(‖y‖2 − r2)

s. t. ηK2s −

1

2K>stKst 0.

121

Page 135: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Introducing the variable γ to replace the objective function yields the equivalent problem

minη≥0,γ

γ

s. t. ηK2s −

1

2K>stKst 0

γ− 1

4(2ηKsy−K>stKtb)>

(ηK2

s−1

2K>stKst

)†(2ηKsy−K>stKtb)+η(‖y‖2−r2) ≥ 0

Finally, by the properties of the Schur complement (Boyd and Vandenberghe, 2004), thetwo constraints above are equivalent to

( −12K>stKst + ηK2

s12K>stKtb− ηKsy(

12K>stKtb− ηKsy

)>η(‖y‖2 − r) + γ

) 0.

Since duality holds for a general QCQP with only one constraint (Boyd and Vanden-berghe, 2004)[Appendix B], the duality gap between these problems is 0.

Proposition 23. The optimization problem (1.23) is equivalent to the following SDP:

maxα,β,ν,Z,z

1

2Tr(K>stKstZ)− β − α

s. t

(νK2

s + 12K>stKst − 1

4K νKsy + 1

4Kz

νy>Ks + 14z>K α + ν(‖y‖2 − r2)

) 0

(Z zz> 1

) 0 ∧

(λKt + K2

t12KtKstz

12z>K>stKt β

) 0

Tr(K2sZ)− 2y>Ksz + ‖y‖2 ≤ r2 ∧ ν ≥ 0,

where K = K>stKt(λKt + K2t )†KtKst.

Proof. By Lemma 3, we may rewrite (1.23) as

mina,γ,η,b

b>(λKt + K2t )b +

1

2a>K>stKsta− a>K>stKtb + γ (A.2)

s. t.

(−1

2K>stKst + ηK2

s12K>stKtb− ηKsy

12b>KtKst − ηy>Ks η(‖y‖2 − r2) + γ

)∧ η ≥ 0

‖Ksa− y‖2 ≤ r2.

Let us apply the change of variables b = 12(λKt + K2

t )†KtKsta + v. The following

122

Page 136: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

equalities can be easily verified.

b>(λKt + K2t )b =

1

4a>K>stKt(λKt + K2

t )†KtKsta

+ v>KtKsta + v>(λKt + K2t )v

a>K>stKtb =1

2a>K>stKt(λKt + K2

t )†KtKsta + v>KtKsta.

Thus, replacing b on (A.2) yields

mina,v,γ,η

v>(λKt + K2t )v + a>

(1

2K>stKst −

1

4K)a + γ

s. t.

(−1

2K>stKst + ηK2

s14Ka + 1

2K>stKtv − ηKsy

14a>K + 1

2v>KtKst − ηy>Ks η(‖y‖2 − r2) + γ

) 0

η ≥ 0 ∧ ‖Ksa− y‖2 ≤ r2.

Introducing the scalar multipliers µ, ν ≥ 0 and the matrix(

Z zz> z,

) 0

as a multiplier for the matrix constraint, we can form the Lagrangian:

L := v>(λKt + K2t )v + a>

(1

2K>stKst −

1

4K)a + γ − µη + ν(‖Ksa− y‖2 − r2)

− Tr

((Z zz z

)( −12K>stKst + ηK2

s14Ka + 1

2K>stKtv − ηKsy

14a>K + 1

2v>KtKst − ηy>Ks η(‖y‖2 − r2) + γ

)).

The KKT conditions ∂L∂η

= ∂L∂γ

= 0 trivially imply z = 1 and Tr(K2sZ) − 2y>Ksz +

‖y‖2 − r2 + µ = 0. These constraints on the dual variables guarantee that the primalvariables η and γ will vanish from the Lagrangian, thus yielding

L =1

2Tr(K>stKstZ) + ν(‖y‖2 − r2) + v>(λKt + K2

t )v> − z>K>stKtv

+ a>(νK2

s +1

2K>stKst −

1

4K)a−

(2νKsy +

1

2Kz)>

a.

This is a quadratic function on the primal variables a and v with minimizing solutions

a =1

2

(νK2

s+1

2K>stKst−

1

4K)†(

2νKsy+1

2Kz)

and v =1

2(λKt+K2

t )†KtKstz,

123

Page 137: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

and optimal value equal to the objective of the Lagrangian dual:

1

2Tr(K>stKstZ) + ν(‖y‖2 − r2)− 1

4z>Kz

− 1

4

(2νKsy +

1

2Kz)>(

νK2s +

1

2K>stKst −

1

4K)†(

2νKsy +1

2Kz).

As in Lemma 3, we apply the properties of the Schur complement to show that the dualis given by

maxα,β,ν,Z,z

1

2Tr(K>stKstZ)− β − α

s. t

(νK2

s + 12K>stKst − 1

4K νKsy + 1

4Kz

νy>Ks + 14z>K α + ν(‖y‖2 − r2)

) 0

(Z zz> 1

) 0 ∧ Tr(K2

sZ)− 2y>Ksz + ‖y‖2 ≤ r2

β ≥ 1

4z>Kz ∧ ν ≥ 0

Finally, recalling the definition of K and using the Schur complement one more time wearrive to the final SDP formulation:

maxα,β,ν,Z,z

1

2Tr(K>stKstZ)− β − α

s. t

(νK2

s + 12K>stKst − 1

4K νKsy + 1

4Kz

νy>Ks + 14z>K α + ν(‖y‖2 − r2)

) 0

(Z zz> 1

) 0 ∧

(λKt + K2

t12KtKstz

12z>K>stKt β

) 0

Tr(K2sZ)− 2y>Ksz + ‖y‖2 ≤ r2 ∧ ν ≥ 0.

A.2 QP FormulationProposition 24. Let Y = (Yij) ∈ Rn×k be the matrix defined by Yij = n−1/2hj(x

′i) and

y′ = (y′1, . . . , y′k)> ∈ Rk the vector defined by y′i = n−1

∑nj=1 hi(x

′j)

2. Then, the dual

124

Page 138: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

problem of (1.24) is given by

maxα,γ,β

−(Yα+

γ

2

)>Kt

(λI +

1

2Kt

)−1(Yα+

γ

2

)

− 1

2γ>KtK

†tγ +α>y′ − β (A.3)

s.t. 1>α =1

2, 1β ≥ −Y>γ, α ≥ 0,

where 1 is the vector in Rk with all components equal to 1. Furthermore, the solu-tion h of (1.24) can be recovered from a solution (α,γ, β) of (A.3) by ∀x, h(x) =∑n

i=1 aiK(xi, x), where a =(λI + 1

2Kt)

−1(Yα+ 12γ).

We will first prove a simplified version of the proposition for the case of linearhypotheses, i.e. we can represent hypotheses in H and elements of X as vectors w,x ∈Rd respectively. Define X′ = n−1/2(x′1, . . . ,x

′n) to be the matrix whose columns are

the normalized sample points from the target distribution. Let also w1, . . . ,wk be asample taken from ∂H ′′ and define W := (w1, . . . ,wk) ∈ Rd×k. Under this notation,problem (1.24) may be rewritten as

minw∈Rd

λ‖w‖2 +1

2maxi=1,...,k

‖X′>(w −wi)‖2 +1

2minw′∈C‖X′>(w −w′)‖2 (A.4)

Lemma 9. The Lagrange dual of problem (A.4) is given by

maxα,γ,β

−(Yα+

γ

2

)>X′>

(λI +

X′X′>

2

)−1

X′(Y α+

γ

2

)

− 1

2γ>X′>(X′X′>)†X′γ +α>y′ − β

s. t. 1>α =1

21β ≥ −Y>γ α ≥ 0,

where Y = X′>W and y′i = ‖X′>wi‖2.

Proof. By applying the change of variable u = w′ −w, problem (A.4) is can be madeequivalent to

minw∈Rdu∈C−w

λ‖w‖2 +1

2‖X′>w‖2 +

1

2‖X′>u‖2 +

1

2maxi=1,...,k

‖X′>wi‖2 − 2w>i X′X′>w.

By making the constraints on u explicit and replacing the maximization term with the

125

Page 139: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

variable r the above problem becomes

minw,u,r,µ

λ‖w‖2 +1

2‖X′>w‖2 +

1

2‖X′>u‖2 +

1

2r

s. t. 1r ≥ y′ − 2Y>X′>w

1>µ = 1 µ ≥ 0 Wµ−w = u.

For α, δ ≥ 0, the Lagrangian of this problem is defined as

L(w,u,µ, r,α, β, δ,γ ′) = λ‖w‖2 +1

2‖X′>w‖2 +

1

2‖X′>u‖2 +

1

2r + β(1>µ− 1)

+α>(y′ − 2(X′Y)>w − 1r)− δ>µ+ γ ′>(Wµ−w − u).

Minimizing with respect to the primal variables yields the following KKT conditions:

1>α =1

21β = δ −W>γ ′. (A.5)

X′X′>u = γ ′ 2

(λI +

X′X′>

2

)w = 2(X′Y )α+ γ ′ (A.6)

Condition (A.5) implies that the terms involving r and µ will vanish from the La-grangian. Furthermore, the first equation in (A.6) implies that any feasible γ ′ mustsatisfy γ ′ = X′γ for some γ ∈ Rn. Finally, it is immediate that γ ′>u = u>X′X′>u

and 2w>(λI + X′X′>

2

)w = 2α>(X′Y)>w + γ ′>w. Thus, at the optimal point, the

Lagrangian becomes

−w>(λI +

1

2X′X′>

)w − 1

2u>X′X′>u +α>y′ − β

s. t. 1>α =1

21β = δ −W>γ ′ α ≥ 0 ∧ δ ≥ 0.

The positivity of δ implies that 1β ≥ −W>γ ′. Solving for w and u on (A.6) andapplying the change of variable X′γ = γ ′ we obtain the final expression for the dualproblem:

maxα,γ,β

−(Yα+

γ

2

)>X′>

(λI +

X′X′>

2

)−1

X′(Y α+

γ

2

)

− 1

2γ>X′>(X′X′>)†X′γ +α>y′ − β

s. t. 1>α =1

21β ≥ −Y>γ α ≥ 0,

where we have used the fact that Y>γ = WX′>γ to simplify the constraints. Notice

126

Page 140: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

also that we can recover the solution w of problem (A.4) as w = (λI + 12X′>X′)−1

X′(Yα+ 12γ)

Using the matrix identities X′(λI + X′>X′)−1 = (λI + X′X′>)−1X′ andX′>X′(X′>X′)† = X′>(X′X′>)†X′, the proof of Proposition 7 is now immediate.

Proposition 7. We can rewrite the dual objective of the previous lemma in terms of theGram matrix X′>X′ alone as follows:

maxα,γ,β

−(Yα+

γ

2

)>X′>X′

(λI +

X′>X′

2

)−1(Y α+

γ

2

)

− 1

2γ>X′>X′(X′>X′)†γ +α>y′ − β

s. t. 1>α =1

21β ≥ −Y>γ α ≥ 0.

By replacing X′>X′ by the more general kernel matrix Kt (which corresponds to theGram matrix in the feature space) we obtain the desired expression for the dual. Addi-tionally, the same matrix identities applied to condition (A.6) imply that the optimal hy-pothesis h is given by h(x) =

∑ni=1 aiK(x′i, x) where a = (λI+ 1

2Kt)

−1(Yα+ γ2).

A.3 µ-admissibilityLemma 10 (Relaxed triangle inequality). For any p ≥ 1, let Lp be the loss defined overRN by Lp(x,y) = ‖y− x‖p for all x,y ∈ RN . Then, the following inequality holds forall x,y, z ∈ RN :

Lp(x, z) ≤ 2q−1[Lp(x,y) + Lp(y, z)].

Proof. Observe that

Lp(x, z) = 2p∥∥∥x− y

2+

y − z

2

∥∥∥p

.

For p ≥ 1, x 7→ xp is convex, thus,

Lp(x, z) ≤ 2p1

2

[‖(x− y)‖p + ‖(y − z)‖p

]= 2p−1[Lp(x, z) + Lp(y, z)],

which concludes the proof.

Lemma 11. Assume that Lp(h(x), y) ≤ M for all x ∈ X and y ∈ Y , then Lp isµ-admissible with µ = pMp−1.

127

Page 141: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Proof. Since x 7→ xp is p-Lipschitz over [0, 1] we can write

|L(h(x), y)− L(h′(x), y)| = Mp

∣∣∣∣( |h(x)− y|

M

)p−( |h′(x)− y|

M

)p∣∣∣∣≤ pMp−1|h(x)− y + y − h′(x)|= pMp−1|h(x)− h′(x)|,

which concludes the proof.

Lemma 12. Let L be the Lp loss for some p ≥ 1 and let h, h′, h′′ be functions satisfyingLp(h(x), h′(x)) ≤M and Lp(h′′(x), h′(x)) ≤M for all x ∈ X , for someM ≥ 0. Then,for any distribution D over X , the following inequality holds:

|LD(h, h′)− LD(h′′, h′)| ≤ pMp−1[LD(h, h′′)]1p . (A.7)

Proof. Proceeding as in the proof of Lemma 11, we obtain

|LD(h, h′)− LD(h′′, h′)| = | Ex∈D

[Lp(h(x), h′(x))− Lp(h′′(x), h′(x)

]|

≤ pMp−1 Ex∈D

[|h(x)− h′′(x)|

].

Since p ≥ 1, by Jensen’s inequality, we can write Ex∈D[|h(x) − h′′(x)|

]≤

Ex∈D[|h(x)− h′′(x)|p

]1/p= [LD(h, h′′)]

1p .

128

Page 142: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Appendix B

Appendix to Chapter 3

B.1 Contraction LemmaThe following is a version of Talagrand’s contraction lemma Ledoux and Talagrand(2011). Since our definition of Rademacher complexity does not use absolute values,we give an explicit proof below.

Lemma 13. LetH be a hypothesis set of functions mapping X to R and Ψ1, . . . ,Ψm, µ-Lipschitz functions for some µ > 0. Then, for any sample S ofm points x1, . . . , xm ∈ X ,the following inequality holds

1

mEσ

[suph∈H

m∑

i=1

σi(Ψi h)(xi)

]≤ µ

mEσ

[suph∈H

m∑

i=1

σih(xi)

]

= µ RS(H).

Proof. The proof is similar to the case where the functions Ψi are all equal. Fix asample S = (x1, . . . , xm). Then, we can rewrite the empirical Rademacher complexityas follows:

1

mEσ

[suph∈H

m∑

i=1

σi(Ψi h)(xi)]

=1

mE

σ1,...,σm−1

[Eσm

[suph∈H

um−1(h)+σm(Ψm h)(xm)]],

where um−1(h) =∑m−1

i=1 σi(Ψi h)(xi). Assume that the suprema can be attained andlet h1, h2 ∈ H be the hypotheses satisfying

um−1(h1) + Ψm(h1(xm)) = suph∈H

um−1(h) + Ψm(h(xm))

um−1(h2)−Ψm(h2(xm)) = suph∈H

um−1(h)−Ψm(h(xm)).

129

Page 143: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

When the suprema are not reached, a similar argument to what follows can be given byconsidering instead hypotheses that are ε-close to the suprema for any ε > 0.

By definition of expectation, since σm uniform distributed over −1,+1, we canwrite

Eσm

[suph∈H

um−1(h) + σm(Ψm h)(xm)]

=1

2suph∈H

um−1(h) + (Ψm h)(xm) +1

2suph∈H

um−1(h)− (Ψm h)(xm)

=1

2[um−1(h1) + (Ψm h1)(xm)] +

1

2[um−1(h2)− (Ψm h2)(xm)].

Let s = sgn(h1(xm)− h2(xm)). Then, the previous equality implies

Eσm

[suph∈H

um−1(h) + σm(Ψm h)(xm)]

≤ 1

2[um−1(h1) + um−1(h2) + sµ(h1(xm)− h2(xm))]

=1

2[um−1(h1) + sµh1(xm)] +

1

2[um−1(h2)− sµh2(xm)]

≤ 1

2suph∈H

[um−1(h) + sµh(xm)] +1

2suph∈H

[um−1(h)− sµh(xm)]

= Eσm

[suph∈H

um−1(h) + σmµh(xm)],

where we used the µ−Lipschitzness of Ψm in the first inequality and the definition ofexpectation over σm for the last equality. Proceeding in the same way for all other σi’s(i 6= m) proves the lemma.

B.2 Proof of Theorem 8Proof. We first show that the functions Ln are uniformly bounded for any b:

|Ln(r, b)| =∣∣∣∫ r

0

L′n(r, b)dr∣∣∣ ≤

∫ M

0

max(∣∣∣L′n(0, b)

∣∣∣,∣∣∣L′n(M, b)

∣∣∣)dr

≤∫ M

0

Kdr = MK,

where the first inequality holds since, by convexity, the derivative of Ln with respect tor is an increasing function.

130

Page 144: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Next, we show that the sequence (Ln)n∈N is also equicontinuous. It will followthen by the theorem of Arzela-Ascoli that the sequence Ln(·, b) converges uniformly toLc(·, b). Let r1, r2 ∈ [0,M ], for any b ∈ [0,M ] we have

|Ln(r1, b)− Ln(r2, b)| ≤ supr∈[0,M ]

|L′n(r, b)| |r1 − r2|

= max (|L′n(0, b)| , |L′n(M, b))|) |r1 − r2|≤ K|r1 − r2|,

where, again, the convexity of Ln was used for the first equality. Let Fn(r) =Eb∼D[Ln(r, b)] and F (r) = Eb∼D[Lc(r, b)]. Fn is a convex function as the expecta-tion of a convex function. By the theorem of Arzela-Ascoli, the sequence (Fn)n admitsa uniformly convergent subsequence. Furthermore, by the dominated convergence the-orem, we have (Fn(r))n converges pointwise to F (r). Therefore, the uniform limit ofFn must be F . This implies that

minr∈[0,M ]

F (r) = limn→+∞

minr∈[0,M ]

Fn(r) = limn→+∞

Fn(rn) = F (r∗),

where the first and third equalities follow from the uniform convergence of Fn to F . Thelast equation implies that Lc is consistent with L. Furthermore, the function Lc(·, b) isconvex since it is the uniform limit of convex functions. It then follows by Proposition 12that Lc(·, b) ≡ Lc(0, b) = 0.

B.3 Consistency of LγLemma 14. Let H be a closed, convex subset of a linear space of functions containing0 and let h∗γ = argminh∈H Lγ(h). Then, the following inequality holds:

Ex,b

[h∗γ(x)1I2(x,b)

]≥ 1

γEx,b

[h∗γ(x)1I3(x,b)

].

Proof. Let 0 < λ < 1. Since H is a convex set, it follows that λh∗γ ∈ H . Furthermore,by the definition of h∗γ , we must have:

Ex,b

[Lγ(h

∗γ(x),b)

]≤ E

x,b

[Lγ(λh

∗γ(x),b)

]. (B.1)

If h∗γ(x) < 0, then Lγ(h∗γ(x),b) = Lγ(λh∗γ(x)) = −b(2) by definition of Lγ . If on

the other hand h∗γ(x) > 0, since λh∗γ(x) < h∗γ(x), we must have that for (x,b) ∈I1 Lγ(h

∗γ(x),b) = Lγ(λh

∗γ(x),b) = −b(2) too. Moreover, from the fact that Lγ ≤ 0

and Lγ(h∗γ(x),b) = 0 for (x,b) ∈ I4 it follows that Lγ(h∗γ(x),b) ≥ Lγ(λh∗γ(x),b) for

131

Page 145: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

(x,b) ∈ I4, and therefore the following inequality trivially holds:

Ex,b

[Lγ(h

∗γ(x),b)(1I1(x,b) + 1I4(x,b))

]

≥ Ex,b

[Lγ(λh

∗γ(x),b)(1I1(x,b) + 1I4(x,b))

]. (B.2)

Subtracting (B.2) from (B.1) we obtain

Ex,b

[Lγ(h

∗γ(x),b)(1I2(x,b) + 1I3(x,b))

]

≤ Ex,b

[Lγ(λh

∗γ(x),b)(1I2(x,b) + 1I3(x,b))

].

Rearranging terms shows that this inequality is equivalent to

Ex,b

[(Lγ(λh

∗γ(x),b)− Lγ(h∗γ(x),b))1I2(x,b)

]

≥ Ex,b

[(Lγ(h

∗γ(x),b)− Lγ(λh∗γ(x),b))1I3(x,b)

](B.3)

Notice that if (x,b) ∈ I2, then Lγ(h∗γ(x),b) = −h∗γ(x). If λh∗γ(x) > b(2) too thenLγ(λh

∗γ(x),b) = −λh∗γ(x). On the other hand if λh∗γ(x) ≤ b(2) then Lγ(λh∗γ(x),b) =

−b(2) ≤ −λh∗γ(x). Thus

E(Lγ(λh∗γ(x),b)− Lγ(h∗γ(x),b))1I2(x,b)) ≤ (1− λ)E(h∗γ(x)1I2(x,b)) (B.4)

This gives an upper bound for the left-hand side of inequality (B.3). We now seek toderive a lower bound on the right-hand side. To do so, we analyze two different cases:1. λh∗γ(x) ≤ b(1);2. λh∗γ(x) > b(1).

In the first case, we know that Lγ(h∗γ(x),b) = 1γ(h∗γ(x) − (1 + γ)b(1)) > −b(1) (since

h∗γ(x) > b(1) for (x,b) ∈ I3). Furthermore, if λh∗γ(x) ≤ b(1), then, by definitionLγ(λh

∗γ(x),b) = min(−b(2),−λh∗γ(x)) ≤ −λh∗γ(x). Thus, we must have:

Lγ(h∗γ(x),b)− Lγ(λh∗γ(x),b) > λh∗γ(x)− b(1) > (λ− 1)b(1) ≥ (λ− 1)M, (B.5)

where we used the fact that h∗γ(x) > b(1) for the second inequality and the last inequalityholds since λ− 1 < 0.

We analyze the second case now. If λh∗γ(x) > b(1), then for (x,b) ∈ I3 we haveLγ(h

∗γ(x),b)−Lγ(λh∗γ(x),b)= 1

γ(1−λ)h∗γ(x). Thus, letting ∆(x,b)=Lγ(h

∗γ(x),b)−

132

Page 146: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Lγ(λh∗γ(x),b), we can lower bound the right-hand side of (B.3) as:

Ex,b

[∆(x,b)1I3(x,b)

]

= Ex,b

[∆(x,b)1I3(x,b)1λh∗γ(x)>b(1)

]+ E

x,b

[∆(x,b)1I3(x,b)1λh∗γ(x)≤b(1)

]

≥ 1− λγ

Ex,b

[h∗γ(x)1I3(x,b)1λh∗γ(x)>b(1)

]+ (λ− 1)M P

[h∗γ(x) > b(1) ≥ λh∗γ(x)

],

(B.6)

where we have used (B.5) to bound the second summand. Combining inequalities (B.3),(B.4) and (B.6) and dividing by (1− λ) we obtain the bound

Ex,b

[h∗γ(x)1I2(x,b)

]

≥ 1

γEx,b

[h∗γ(x)1I3(x,b)1λh∗γ(x)>b(1)

]−M P

[h∗γ(x) > b(1) ≥ λh∗γ(x)

].

Finally, taking the limit λ→ 1, we obtain

Ex,b

[h∗γ(x)1I2(x,b)

]≥ 1

γEx,b

[h∗γ(x)1I3(x,b)

].

Taking the limit inside the expectation is justified by the bounded convergence theoremand P[h∗γ(x) > b(1) ≥ λh∗γ(x)]→ 0 holds by the continuity of probability measures.

Proposition 25. For any δ > 0, with probability at least 1 − δ over the choice of asample S of size m, the following holds for all γ ∈ (0, 1] and h ∈ H:

Lγ(h) ≤ Lγ(h) +2

γRm(H) +M

[√log log2

m+

√log 1

δ

2m

].

Proof. Consider two sequences (γk)k≥1 and (εk)k≥1, with εk ∈ (0, 1). By theorem 10,for any fixed k ≥ 1,

P[Lγk(h)− Lγk(h)>

2

γkRm(H) +Mεk

]≤ exp(−2mε2k).

133

Page 147: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Choose εk = ε+√

log km

, then, by the union bound,

P[∃k : Lγk(h)− Lγk(h) >

1

γkRm(H) +Mεk

]

≤∑

k≥1

exp[− 2m(ε+

√(log k)/m)2

]

≤(∑

k≥1

1/k2)

exp(−2mε2)

=π2

6exp(−2mε2) ≤ 2 exp(−2mε2).

For any γ ∈ (0, 1], there exists k ≥ 1 such that γ ∈ (γk, γk−1) with γk = 1/2k. For sucha k, 1

γk−1≤ 1

γ, γk−1 ≤ γ

2, and

√log(k − 1) =

√log log2(1/γk−1) ≤

√log log2(1/γ).

Since for any h ∈ H , Lγk−1(h) ≤ Lγ(h), we can write

P[L(h)−Lγ(h)>

2

γRm(H)+M

(K(γ) + ε

)]≤ exp(−2mε2),

where K(γ) =

√log log2

m. This concludes the proof.

Corollary 10. Let H be a hypothesis set with pseudo-dimension d = Pdim(H). Then,for any δ > 0 and any γ > 0, with probability at least 1− δ over the choice of a sampleS of size m, the following inequality holds:

L(hγ) ≤ L∗ +2γ + 2

γRm(H) + γM +M

[2

√2d log εm

d

m+ 2

√log 2

δ

2m+

√log log2

m

].

The proof follows the same steps as Theorem 11 and uses the results of Proposi-tion 25. Notice that, by setting γ = 1

m1/4 , we can guarantee the convergence of L(hγ) toL∗. Indeed, with this choice, the bound can be expressed as follows:

L(hγ) ≤ L∗ + (2 +m1/4)Rm(H) +1

m1/4M

+M

[2

√2d log εm

d

m+ 2

√log 2

δ

2m+

√log log2m

1/4

m

].

Furthermore, when H has finite pseudo-dimension, it is known that <m(H) is inO(

1m1/2

). This shows that L(hγ) = L∗ +O

(1

m1/4

).

134

Page 148: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

B.4 Proof of Proposition 13Proof. From the definition of v-function, it is immediate that Vi is differentiable every-where except at the three points n(1)

i = b(2)i , n

(2)i = b

(1)i and n(3)

i = (1 + η)b(1)i . Let r∗

be a minimizer of F . If r∗ 6= n(j)i for every j ∈ 1, 2, 3 and i ∈ 1, . . . ,m, then F

must be differentiable at r∗ and F ′(r∗) = 0. Now, let n∗ = maxn(j)i |n(j)

i < r∗. SinceF is a linear function over the interval (n∗, r∗], we must have F ′(r) = F ′(r∗) = 0 forevery r ∈ (n∗, r∗]. Thus, F reduces to a constant over this interval and continuity of Fimplies that F (n∗) = F (r∗).

We conclude the proof by showing that n∗ is equal to b(1)i for some i. Suppose this

is not the case and let U be an open interval around n∗ satisfying b(1)i /∈ U for all i. It

is not hard to verify that Vi is a concave function over every interval not containing b(1)i .

In particular Vi is concave over U for any i and, as a sum of concave functions, F isconcave too over the interval U . Moreover, by definition, n∗ minimizes F restricted toU . This implies that F is constant over U as a non-constant concave function cannotreach its minimum over an open set. Finally, let b∗ = argmini |n∗ − b(1)

i |. Since U wasan arbitrary open interval, it follows that there exists r arbitrarily close to b∗ such thatF (r) = F (n∗). By the continuity of F , we must then have F (b∗) = F (n∗).

135

Page 149: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Appendix C

Appendix to Chapter 4

C.1 The Envelope TheoremThe envelope theorem is a well known result in applied mathematics characterizing themaximum of a parametrized family of functions. A general version of this theorem isdue to Milgrom and Segal (2002) and we include its proof here for completeness. Wewill let X be an arbitrary space and we will consider a function f : X × [0, 1]→ R. Wedefine the envelope function V and the set valued function X∗ as

V (t) = supx∈X

f(x, t) and X∗(t) = x ∈ X|f(x, t) = V (t).

We show a plot of the envelope function in figure C.1.

Theorem 18 (Envelope Theorem). Let f be an absolutely continuous function for everyx ∈ X . Suppose also that there exists an integrable function b : [0, 1] → R+ suchthat for every x ∈ X , df

dt(x, t) ≤ b(t) almost everywhere in t. Then V is absolutely

continuous. If in addition f(x, ·) is differentiable for all x ∈ X , X∗(t) 6= ∅ almosteverywhere on [0, 1] and x∗(t) denotes an arbitrary element in X∗(t), then

V (t) = V (0) +

∫ t

0

df

dt(x∗(s), s)ds.

Proof. By definition of V , for any t′, t′′ ∈ [0, 1] we have

|V (t′′)− V (t′)| ≤ supx∈X|f(x, t′′)− f(x, t′)|

= supx∈X

∣∣∣∫ t′′

t′

df

dt(x, s)

∣∣∣ ≤∫ t′′

t′b(t)dt.

This easily implies that V (t) is absolutely continuous. Therefore, V is differentiable

136

Page 150: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

f(x1, t)

f(x2, t)

f(x3, t)

V(t)

Figure C.1: Depiction of the envelope function

almost everywhere and V (t) = V (0) +∫ t

0V ′(s)ds. Finally, if f(x, t) is differentiable in

t then we know that V ′(t) = dfdt

(x∗(t), t) for any x∗(t) ∈ X∗(t) whenever V ′(t) existsand the result follows.

C.2 Elementary CalculationsWe present elementary results of Calculus that will be used throughout the rest of thisAppendix.

Lemma 15. The following equality holds for any k ∈ N:

∆Fki =

k

nFk−1i−1 +

ik−2

nk−2O( 1

n2

),

and∆Gk

i = −knGk−1i−1 +O

( 1

n2

).

Proof. The result follows from a straightforward application of Taylor’s theorem to thefunction h(x) = xk. Notice that Fk

i = h(i/n), therefore:

∆Fki = h

(i− 1

n+

1

n

)− h(i− 1

n

)

= h′(i− 1

n

) 1

n+ h′′(ζi)

1

2n2

=k

nFk−1i−1 + h′′(ζi)

1

2n2,

137

Page 151: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

for some ζi ∈ [(i− 1)/n, i/n]. Since h′′(x) = k(k− 1)xk−2, it follows that the last termin the previous expression is in (i/n)k−2O(1/n2). The second equality can be similarlyproved.

Proposition 26. Let a, b ∈ R and N ≥ 1 be an integer, then

N∑

j=0

(N

j

)ajbN−j

j + 1=

(a+ b)N+1 − bN+1

a(N + 1)(C.1)

Proof. The proof relies on the fact that aj

j+1= 1

a

∫ a0tjdt. The left hand side of (C.1) is

then equal to

1

a

∫ a

0

N∑

j=0

(N

j

)tjbN−jdt =

1

a

∫ a

0

(t+ b)Ndt

=(a+ b)N+1 − bN+1

a(N + 1).

Lemma 16. If the sequence ai ≥ 0 satisfies

ai ≤ δ ∀i ≤ r

ai ≤ A+Bi−1∑

j=1

aj ∀i > r.

Then ai ≤ (A+ rδB)(1 +B)i−r−1 ≤ (A+ rδB)eB(i−r−1) ∀i > r.

This lemma is well known in the numerical analysis community and we include theproof here for completeness.

Proof. We proceed by induction on i. The base of our induction is given by i = r + 1and it can be trivially verified. Indeed, by assumption

ar+1 ≤ A+ rδB.

Let us assume that the proposition holds for values less than i and let us try to show it

138

Page 152: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

also holds for i.

ai ≤ A+B

r∑

j=1

aj +B

i−1∑

j=r+1

aj

≤ A+ rBδ +Bi−1∑

j=r+1

(A+ rBδ)(1 +B)j−r−1

= A+ rBδ + (A+ rBδ)Bi−r−2∑

j=0

(1 +B)j

= A+ rBδ + (A+ rBδ)B(1 +B)i−r−1 − 1

B= (A+ rBδ)(1 +B)i−r−1.

Lemma 17. Let W0 : [e,∞) → R denote the main branch of the Lambert function, i.e.W0(x)eW0(x) = x. The following inequality holds for every x ∈ [e,∞).

log(x) ≥ W0(x).

Proof. By definition of W0 we see that W0(e) = 1. Moreover, W0 is an increasingfunction. Therefore for any x ∈ [e,∞)

W0(x) ≥ 1

⇒W0(x)x ≥ x

⇒W0(x)x ≥ W0(x)eW0(x)

⇒x ≥ eW0(x).

The result follows by taking logarithms on both sides of the last inequality.

C.3 Proof of Proposition 18

Here, we derive the linear equation that must be satisfied by the bidding function β. Forthe most part, we adapt the analysis of Gomes and Sweeney (2014) to a discrete setting.

Proposition 17. In a symmetric efficient equilibrium of the discrete GSP, the probabilityzs(v) that an advertiser with valuation v is assigned to slot s is given by

zs(v) =N−s∑

j=0

s−1∑

k=0

(N − 1

j, k,N−1−j−k

)Fji−1G

ki

(N − j − k)nN−1−j−k .

139

Page 153: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

if v = vi and by

zs(v) =

(N − 1

s− 1

)limv′→v−

F (v′)p(1− F (v))s−1 =: z−s (v),

where p = N − s.

Proof. Since advertisers play an efficient equilibrium, these probabilities depend onlyon the advertisers’ valuations. Let Aj,k(s, v) denote the event that j buyers have a valua-tion lower than v, k of them have a valuation higher than v andN−1−j−k a valuationexactly equal to v. Then, the probability of assigning s to an advertiser with value v isgiven by

N−s∑

j=0

s−1∑

k=0

1

N − i− j P(Aj,k(s, v)). (C.2)

The factor 1N−i−j appears due to the fact that the slot is randomly assigned in the case

of a tie. When v = vi, this probability is easily seen to be:

(N − 1

j, k,N−1−j−k

)Fji−1G

ki

nN−1−j−k .

On the other hand, if v ∈ (vi−1, vi) the event Aj,k(s, v) happens with probability zerounless j = N − s and k = s− 1. Therefore, (C.2) simplifies to

(N − 1

s− 1

)F (v)p(1− F (v))s−1 =

(N − 1

s− 1

)limv′→v−

F (v′)p(1− F (v))s−1.

Proposition 27. Let E[P PE(v)] denote the expected payoff of an advertiser with valuev at equilibrium. Then

E[P PE(vi)] =S∑

s=1

cs

[zs(vi)vi −

j=1

z−s (vi)(vi − vi−1)].

Proof. By the revelation principle (Gibbons, 1992), there exists a truth revealing mech-anism with the same expected payoff function as the GSP with bidders playing an equi-librium. For this mechanism, we then must have

v ∈ arg maxv∈[0,1]

S∑

s=1

cszs(v)v − E[P PE(v)].

140

Page 154: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

By the envelope theorem (see Theorem 18), we have

S∑

s=1

cszs(vi)vi − E[P PE(vi)] = −E[P PE(0)] +S∑

s=1

∫ vi

0

zs(t)dt.

Since the expected payoff of an advertiser with valuation 0 should be zero too, we seethat

E[P PE(vi)] = cszs(vi)vi −∫ vi

0

zs(t)dt.

Using the fact that zs(t) ≡ z−s (vi) for t ∈ (vi−1, vi) we obtain the desired expression.

Proposition 18. If the discrete GSP auction admits a symmetric efficient equilibrium,then its bidding function β satisfies β(vi) = βi, where β is the solution of the followinglinear equation:

Mβ = u.

where M =∑S

s=1 csM(s) and

ui =S∑

s=1

(cszs(vi)vi −

i∑

j=1

z−s (vj)∆vj

).

Proof. Let E[P β(vi)] denote the expected payoff of an advertiser with value vi whenall advertisers play the bidding function β. Let A(s, vi, vj) denote the event that anadvertiser with value vi gets assigned slot s and the s-th highest valuation among theremainingN−1 advertisers is vj . If the eventA(s, vi, vj) takes place, then the advertiserhas a expected payoff of csβ(vj). Thus,

E[P β(vi)] =S∑

s=1

cs

i∑

j=1

β(vj)P(A(s, vi, vj)).

In order for event A(s, vi, vj) to occur for i 6= j, N − s advertisers must have valuationsless than or equal to vj with equality holding for at least one advertiser. Also, thevaluation of s− 1 advertisers must be greater than or equal to vi. Keeping in mind thata slot is assigned randomly in the event of a tie, we see that A(s, vi, vj) occurs with

141

Page 155: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

probability

N−s−1∑

l=0

s−1∑

k=0

(N−1

s−1

)(s−1

k

)(N−sl

)Flj−1

nN−s−lGs−1−ki

(k + 1)nk

=

(N−1

s−1

)N−s−1∑

l=0

(N − sl

)Flj−1

nN−s−l

s−1∑

k=0

(s− 1

k

)Gs−1−ki

(k + 1)nk

=

(N−1

s−1

)((Fj−1 +

1

n

)N−s − FN−sj−1

)(n(Gsi−1 −Gs

i

)

s

)

= −(N − 1

s− 1

)n∆Fj∆Gi

s,

where the second equality follows from an application of the binomial theorem andProposition 26. On the other hand if i = j this probability is given by:

N−s−1∑

j=0

s−1∑

k=0

(N − 1

j, k,N − 1− j − k

)Fji−1G

ki

(N − j − k)nN−1−j−k

It is now clear that M(s)i,j = P(A(s, vi, vj)) for i ≥ j. Finally, given that in equilibriumthe equality E[P PE(v)] = E[P β(v)] must hold, by Proposition 27, we see that β mustsatisfy equation (4.10).

We conclude this section with a simpler expression for Mii(s). By adding and sub-tracting the term j = N − s in the expression defining Mii(s) we obtain

Mii(s) = zs(vi)−s−1∑

k=0

(N−1

N−s, k, s−1−k

)Fpi−1G

ki

(s− k)ns−1−k

= zs(vi)−(N − 1

s− 1

) s−1∑

k=1

(s− 1

k

)Fpi−1G

ki

(s− k)ns−1−k

= zs(vi) +

(N − 1

s− 1

)Fpi−1

n∆Gi

s, (C.3)

where again we used Proposition 26 for the last equality.

C.4 High Probability BoundsIn order to improve the readability of our proofs we use a fixed variable C to refer toa universal constant even though this constant may be different in different lines of aproof.

142

Page 156: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Theorem 19. (Glivenko-Cantelli Theorem) Let v1, . . . , vn be an i.i.d. sample drawnfrom a distribution F . If F denotes the empirical distribution function induced by thissample, then with probability at least 1− δ for all v ∈ R

|F (v)− F (v)| ≤ C

√log(1/δ)

n.

Proposition 28. Let X1, . . . , Xn be an i.i.d sample from a distribution F supportedin [0, 1]. Suppose F admits a density f and assume f(x) > c for all x ∈ [0, 1]. IfX(1), . . . , X(n) denote the order statistics of a sample of size n and we let X(0) = 0,then

P( maxi∈1,...,n

X(i) −X(i−1) > ε) ≤ 3

εe−cεn/2.

In particular, with probability at least 1− δ:

maxi∈1,...,n

X(i) −X(i−1) ≤ 1

nq(n, δ), (C.4)

where q(n, δ) = 2c

log(nc2δ

).

Proof. Divide the interval [0, 1] into k ≤ d2/εe sub-intervals of size ε2. Denote this sub-

intervals by I1, . . . , Ik, with Ij = [aj, bj] . If there exists i such that X(i) −X(i−1) > εthen at least one of these sub-intervals must not contain any samples. Therefore:

P( maxi∈1,...,n

X(i) −X(i−1) > ε) ≤ P(∃ j s.t Xi /∈ Ij ∀i)

≤d2/εe∑

j=1

P(Xi /∈ Ij ∀i).

Using the fact that the sample is i.i.d. and that F (bk)− F (ak) ≥ minx∈[ak,bk] f(x)(bk −ak) ≥ c(bk − ak), we may bound the last term by

(2 + ε

ε

)(1− (F (bk)− F (ak)))

n ≤ 3

ε(1− c(bk − ak))n

≤ 3

εe−cεn/2.

The equation 3εe−cεn/2 = δ implies ε = 2

ncW0(3nc

2δ), where W0 denotes the main branch

of the Lambert function (the inverse of the function x 7→ xex). By Lemma 17, forx ∈ [e,∞) we have

log(x) ≥ W0(x). (C.5)

143

Page 157: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Therefore, with probability at least 1− δ

maxi∈1,...,n

X(i) −X(i−1) ≤ 2

nclog(3cn

).

The following estimates will be used in the proof of Theorem 14.

Lemma 18. Let p ≥ 1 be an integer. If i >√n , then for any t ∈ [vi−1, vi] the following

inequality is satisfied with probability at least 1− δ:

|F p(v)− Fpi−1| ≤ C

ip−1

np−1

log(2/δ)p−1

2√n

q(n, 2/δ)

Proof. The left hand side of the above inequality may be decomposed as

|F p(v)− Fpi−1| ≤ |F p(v)− F p(vi−1)|+ |F p(vi−1)− Fp

i−1|≤ p|F (ζi)

p−1f(ζi)|(vi − vi−1) + pFp−1i−1 (F (vi−1)− Fi−1)

≤ Cq(n, 2

δ)

nF (ζi)

p−1 + Cip−1

np−1

√log(2/δ)

n,

for some ζi ∈ (vi−1, vi). The second inequality follows from Taylor’s theorem andwe have used Glivenko-Cantelli’s theorem and Proposition 28 for the last inequality.

Moreover, we know F (vi) ≤ Fi +√

log 2/δn≤ C

√log 2/δ(i+

√n)

n. Finally, since i ≥ √n it

follows that

F (ζi)p−1 ≤ F (vi)

p−1 ≤ C( ip−1

np−1log(2/δ)(p−1)/2

).

Replacing this term in our original bound yields the result.

Proposition 29. Let ψ : [0, 1]→ R be a twice continuously differentiable function. Withprobability at least 1− δ the following bound holds for all i >

√n

∣∣∣∫ vi

0

F p(t)dt−i−1∑

j=1

Fpj−1∆vj

∣∣∣ ≤ Cip

nplog(2/δ)p/2√

nq(n, δ/2)2.

and

∣∣∣∫ vi

0

ψ(t)pF p−1(t)f(t)dt−i−2∑

j=1

ψ(vj)∆Fpj

∣∣∣ ≤ Cip

nplog(2/δ)p/2√

nq(n, δ/2)2.

144

Page 158: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Proof. By splitting the integral along the intervals [vj−1, vj] we obtain

∣∣∣∫ vi

0

F p(t)dt−i−1∑

j=1

Fpj−1∆vj

∣∣∣ ≤∣∣∣i−1∑

j=1

∫ vj

vj−1

F p(t)−Fpj−1dt

∣∣∣+F p(vi)(vi−vi−1) (C.6)

By Lemma 18, for t ∈ [vj−1, vj] we have:

|F p(t)− Fpj−1| ≤ C

jp−1

np−1

log(2/δ)p−1

2√n

q(n, δ/2).

Using the same argument of Lemma 18 we see that for i ≥ √n

F (vi)p ≤ C

(i√

log(2/δ)

n

)p

Therefore we may bound (C.6) by

Cip−1

np−1

log(2/δ)p−1

2√n

(q(n, δ/2)

i−1∑

j=1

vj +i(vi − vi−1)

√log(2/δ)

n

).

We can again use Proposition 28 to bound the sum by inq(n, δ/2) and the result follows.

In order to proof the second bound we first do integration by parts to obtain∫ vi

0

ψ(t)pF p−1f(t)dt = ψ(vi)Fp(vi)−

∫ vi

0

ψ′(t)F p(t)dt.

Similarly

i−2∑

j=1

ψ(vj)∆Fpj = ψ(vi−2)Fp

i−2 −i−2∑

j=1

Fpj

(ψ(vj)− ψ(vj−1)

).

Using the fact that ψ is twice continuously differentiable, we can recover the desiredbound by following similar steps as before.

Proposition 30. With probability at least 1− δ the following inequality holds for all i

∣∣∣(s− 1)G(vi)s−2 − n2 ∆2Gs

i

s

∣∣∣ ≤ C

√log(1/δ)

n. (C.7)

Proof. By Lemma 15 we know that

n2 ∆2Gsi

s= (s− 1)Gs−2

i +O( 1

n

)

145

Page 159: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Therefore the left hand side of (C.7) can be bounded by

(s− 1)|G(vi)s−2 −Gs−2

i |+C

n.

The result now follows from Glivenko-Cantelli’s theorem.

Proposition 31. With probability at least 1− δ the following bound holds for all i

∣∣∣(N − 1

s− 1

)pG(vi)

s−1F (vi)p − 2nMii(s)

∣∣∣ ≤ Cip−2

np−2

(log(2/δ))p−2

2√n

q(n, δ/2).

Proof. By analyzing the sum defining Mii(s) we see that all terms with exception ofthe term given by j = N − s− 1 and k = s− 1 have a factor of ip−2

np−21n2 . Therefore,

Mii(s) =1

2n

(N−1

s−1

)pFp−1

i−1Gs−1i +

ip−2

np−2O( 1

n2

). (C.8)

Furthermore, by Theorem 19 we have

|Gs−1i −G(vi)

s−1| ≤ C

√log(2/δ)

n. (C.9)

Similarly, by Lemma 18

|Fp−1i−1 − F (vi)

p−1|C ≤ ip−2

np−2

(log(2/δ))p−2

2√n

q(n, δ/2). (C.10)

From equation (C.8) and inequalities (C.9) and (C.10) we can thus infer that∣∣pG(vi)

s−1F (vi)p − 2nMii(s)

∣∣

≤ C(pFp−1

i−1 |G(vi)s−1 −Gs−1

i |+G(vi)s−1p|F (vi)

p−1 − Fp−1i−1 |)

+ Cip−2

np−2

1

n2

≤ Cip−2

np−2

( in

√log(2/δ)

n+

(log(2/δ))p−2

2√n

q(n, δ/2) +1

n2

)

The desired bound follows trivially from the last inequality.

C.5 Solution PropertiesA standard way to solve a Volterra equation of the first kind is to differentiate the equa-tion and transform it into an equation of the second kind. As mentioned before this may

146

Page 160: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

only be done if the kernel defining the equation satisfies K(t, t) ≥ c > 0 for all t. Herewe take the discrete derivative of (4.10) and show that in spite of the fact that the newsystem remains ill-conditioned the solution of this equation has a particular propertythat allows us to show the solution β will be close to the solution β of surrogate linearsystem which, in turn, will also be close to the true bidding function β.

Proposition 32. The solution β of equation (4.10) also satisfies the following equation

dMβ = du (C.11)

where dMij = Mi,j −Mi−1,j and dui = ui − ui−1. Furthermore, for j ≤ i− 2

dMij = −S∑

s=1

cs

(N−1

s−1

)n∆Fj∆

2Gsi

s

and

dui =S∑

i=1

cs(vi(zs(vi)− z−s (vi)

)+ vi−1

(z−s (vi)− zs(vi−1)

)).

Proof. It is clear that the new equation is obtained from (4.10) by subtracting row i− 1from row i. Therefore β must also satisfy this equation. The expression for dMij

follows directly from the definition of Mij . Finally,

zs(vi)vi −i∑

j=1

z−s (vj)(vj − vj−1)−(zs(vi−1)vi−1 −

i−1∑

j=1

z−s (vj)(vj − vj−1))

= vi(zs(vi)− z−s (vi)

)+ z−s (vi)vi − zs(vi−1)vi−1 − z−s (vi)(vi − vi−1).

Simplifying terms and summing over s yields the desired expression for dui.

A straightforward bound on the difference |βi − β(vi)| can be obtain by boundingthe following quantity: difference

i∑

j=1

dMi,j(β(vi)− βi) =i∑

j=1

dMi,jβ(vi)− dui, (C.12)

and by then solving the system of inequalities defining εi = |β(vi) − βi|. In orderto do this, however, it is always assumed that the diagonal terms of the matrix satisfymini ndMii ≥ c > 0 for all n, which in view of (C.8) does not hold in our case.We therefore must resort to a different approach. We will first show that for values ofi ≤ n3/4 the values of βi are close to vi and similarly β(vi) will be close to vi. Thereforefor i ≤ n3/4 we can show that the difference |β(vi) − βi| is small. We will see that the

147

Page 161: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

analysis for i n3/4 is in fact more complicated; yet, by a clever manipulation of thesystem (4.10) we are able to obtain the desired bound.

Proposition 33. If cS > 0 then there exists a constant C > 0 such that:

S∑

s=1

csMii(s) ≥ C( in

)N−S−1 1

2n

Proof. By definition of Mii(s) it is immediate that

csMii(s) ≥cs2n

(N − 1

s− 1

)pFp−1

i−1 (Gi)s−1

=1

2nCs

(i− 1

n

)p−1(1− i

n

)s−1

,

with CS = csp(N−1s−1

). The sum can thus be lower bounded as follows

S∑

s=1

csM(s)ii ≥1

2nmax

(C1

(i− 1

n

)N−2

, CS

(i− 1

n

)N−S−1(1− i

n

)S−1)(C.13)

When C1

(i−1n

)N−2

≥ CS

(i−1n

)N−S−1(1 − i

n

)S−1

, we have K i−1n≥ 1 − i

n, with

K = (C1/CS)1/(S−1). Which holds if and only if i > n+KK+1

. In this case the max term of(C.13) is easily seen to be lower bounded by C1(K/K + 1)N−2. On the other hand, if

i < n+KK+1

then we can lower bound this term by CS(K/K + 1)s−1(in

)N−S−1

. The resultfollows immediately from these observations.

Proposition 34. For all i and s the following inequality holds:

|dMii(s)− dMi,i−1(s)| ≤ Cip−2

np−2

1

n2.

Proof. From equation (C.8) we see that

|dMii(s)− dMi,i−1(s)| = |Mii(s) + Mi−1,i−1(s)−Mi,i−1(s)|

≤∣∣∣Mii(s)−

1

2Mi,i−1(s)

∣∣∣+∣∣∣Mi−1,i−1(s)− 1

2Mi,i−1(s)

∣∣∣

≤(N−1

s−1

)(1

2

∣∣∣pFp−1i−1G

s−1i

n− ∆Fp

i−1n∆Gsi

s

∣∣∣

+1

2

∣∣∣pFp−1i−2G

s−1i−1

n− n∆Fp

i−1∆Gsi

s

∣∣∣)

+ Cip−2

np−2

( 1

n2

),

A repeated application of Lemma 15 yields the desired result.

148

Page 162: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Lemma 19. The following holds for every s and every i

zs(vi)− z−s (vi) = Mii(s)−(N−1

s−1

)Fpi−1

(n∆Gsi

s+ Gs−1

i−1

)

and

z−s (vi)− zs(vi−1) = M(s)i,i−1 −M(s)i−1,i−1 − n(N−1

s−1

)Fpi−2

∆2Gsi

s

+

(N−1

s−1

)Fpi−1

(Gs−1i−1 + n

∆Gsi

s

).

Proof. From (C.3) we know that

zs(vi)− z−s (vi) = Mii(s)−(N−1

s−1

)nFp

i−1

∆Gsi

s− z−s (vi).

By using the definition of z−s (vi) we can verify that the right hand side of the aboveequation is in fact equal to

Mii(s)−(N−1

s−1

)Fpi−1

(n∆Gsi

s+ Gs−1

i−1

).

The second statement can be similarly proved

z−s (vi)− zs(vi−1) = z−s (vi)−M(s)i−1,i−1

+ n

(N−1

s−1

)Fpi−2

∆Gsi−1

s+ M(s)i,i−1 −M(s)i,i−1. (C.14)

On the other hand we have

n

(N−1

s−1

)Fpi−2

∆Gsi−1

s−M(s)i,i−1

= n

(N−1

s−1

)[Fpi−2

∆Gsi−1

s+

(Fpi−1 − Fp

i−2)∆Gsi

s

]

= n

(N−1

s−1

)[Fpi−1

∆Gsi

s− Fp

i−2

∆2Gsi

s

]

149

Page 163: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

By replacing this expression into (C.14) and by definition of z−s (vi).

z−s (vi)− zs(vi−1) = M(s)i,i−1 −M(s)i−1,i−1 − n(N−1

s−1

)Fpi−2

∆2Gsi

s

+

(N−1

s−1

)Fpi−1

(Gs−1i−1 + n

∆Gsi

s

).

Corollary 11. The following equality holds for all i and s.

dui = vi(zs(vi)− z−s (vi)) + vi−1(z−s (vi)− zs(vi−1))

= vidMii(s) + vi−1dMi,i−1(s) +i−2∑

j=1

dMij(s)vj

−(N−1

s−1

)n∆2Gs

i

s

i−2∑

j=1

Fpj−1∆vj − (vi − vi−1)

(N−1

s−1

)Fpi−1

(Gs−1i−1 +

n∆Gsi

s

)

Proof. From the previous proposition we know

vi(zs(vi)− z−s (vi)) + vi−1(z−s (vi)− zs(vi−1))

= viMii(s) + vi−1(M(s)i,i−1 −M(s)i−1,i−1)− vi−1n

(N−1

s−1

)Fpi−2

∆2Gsi

s

+ (vi−1 − vi)(N−1

s−1

)Fpi−1

(Gs−1i−1 +

n∆Gsi

s

)

= vidMii(s) + vi−1dMi,i−1(s)− vi−1n

(N−1

s−1

)Fpi−2

∆2Gsi

s

− (vi − vi−1)

(N−1

s−1

)Fpi−1

(Gs−1i−1 +

n∆Gsi

s

),

where the last equality follows from the definition of dM. Furthermore, by doing sum-

150

Page 164: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

mation by parts we see that

vi−1n

(N−1

s−1

)Fpi−2

∆2Gsi

s

= vi−2

(N−1

s−1

)Fpi−2

n∆2Gsi

s+ (vi−1 − vi−2)

(N−1

s−1

)Fpi−2

n∆2Gsi

s

=

(N−1

s−1

)n∆2Gs

i

s

( i−2∑

j=1

vj∆Fpj +

i−2∑

j=1

Fpj−1∆vj

)

+ (vi−1 − vi−2)

(N−1

s−1

)Fpi−2

n∆2Gsi

s

= −i−2∑

j=1

dMijvj +

(N−1

s−1

)n∆2Gs

i

s

i−1∑

j=1

Fpj−1∆vj,

where again we used the definition of dM in the last equality. By replacing this expres-sion in the previous chain of equalities we obtain the desired result.

Corollary 12. Let p denote the vector defined by

pi =S∑

s=1

cs

(N−1

s−1

)n∆2Gs

i

s

i−1∑

j=1

Fpj−1∆vj + cs∆vi

(N−1

s−1

)Fpi−1

(Gs−1i−1 +

n∆Gsi

s

).

If ψ = v − β, then ψ solves the following system of equations:

dMψ = p. (C.15)

Proof. It is immediate by replacing the expression for dui from the previous corollaryinto (C.11) and rearranging terms.

We can now present the main result of this section.

Proposition 35. Under Assumption 2, with probability at least 1− δ, the solution ψ ofequation (C.15) satisfies ψi ≤ C i2

n2 q(n, δ).

Proof. By doing forward substitution on equation (C.15) we have:

dMi,i−1ψi−1 + dMiiψi = pi +i−2∑

j=1

dMijψj

= pi +S∑

s=1

csn∆2Gs

i

s

i−2∑

j=1

∆Fpjψj. (C.16)

151

Page 165: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

A repeated application of Lemma 15 shows that

pi ≤ C1

n

iN−S

nN−S

i∑

j=1

∆vj,

which by Proposition 28 we know it is bounded by

pi ≤ C1

n

iN−S−1

nN−S−1

i2

n2q(n, δ).

Similarly for j ≤ i− 2 we have

n∆2Gsi

s∆Fp

j ≤ C1

n

iN−S−1

nN−S−1

1

n.

Finally, Assumption 2 implies that ψ ≥ 0 for all i and since dMi,i−1 > 0, the followinginequality must hold for all i:

dMiiψi ≤ dMi,i−1ψi−1 + dMiiψi

≤ C1

n

iN−S−1

nN−S−1

( i2n2q(n, δ) +

1

n

i−2∑

j=1

ψj

).

In view of Proposition 33 we know that dMii ≥ C 1niN−S−1

nN−S−1 , therefore after dividingboth sides of the inequality by dMii, it follows that

ψi ≤ Ci2

n2q(n, δ) +

1

n

i−2∑

j=1

ψj.

Applying Lemma 16 with A = C i2

n2 , r = 0 and B = Cn

we arrive to the followinginequality:

ψi ≤ Ci2

n2q(n, δ)eC

in ≤ C ′

i2

n2q(n, δ).

We now present an analogous result for the solution β of (4.2). Let CS = cs(N−1s−1

)

and define the functions

Fs(v) = CsFN−s(v) Gs(v) = G(v)s−1.

It is not hard to verify that zs(v) = Fs(v)Gs(v) and that the integral equation (4.2) is

152

Page 166: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

given byS∑

s=1

∫ v

0

t(Fs(t)Gs(t))′dt =

S∑

s=1

Gs(v)

∫ v

0

β(t)F′s(t)dt (C.17)

After differentiating this equation and rearranging terms we obtain

0 = (v − β(v))S∑

s=1

Gs(v)F′s(v) +S∑

s=1

G′s(v)

∫ v

0

β(t)F′s(t)dt+ vG′s(v)Fs(v)

= (v − β(v))S∑

s=1

Gs(v)F′s(v) +S∑

s=1

G′s(v)

∫ v

0

(t− β(t))F′s(t)dt+ G′s(v)

∫ v

0

Fs(t)dt,

where the last equality follows from integration by parts. Notice that the above equationis the continuous equivalent of equation (C.15). Letting ψ(v) := v− β(v) we have that

ψ(v) = −∑S

s=1 G′s(v)

∫ v0Fs(t)dt+ G′s(v)

∫ v0ψ(t)F′s(t)dt∑S

s=1 Gs(v)F′s(v)(C.18)

Since limv→0 Gs(v) = limv→0 G′s(v)/f(v) = 1 and limv→0 Fs(v) = 0, it is not hard to

see that

ψ(0) = − limv→0

∑Ss=1 G

′s(v)

∫ v0Fs(t)dt+ G′s(v)

∫ v0ψ(t)F′s(t)dt∑S

s=1 Gs(v)F′s(v)

= − limv→0

f(v)(∑S

s=1G′s(v)f(v)

∫ v0Fs(t)dt+ G′s(v)

f(v)

∫ v0ψ(t)F′s(t)dt

)

∑Ss=1 Gs(v)F′s(v)

= − limv→0

f(v)(∑S

s=1

∫ v0Fs(t)dt+

∫ v0ψ(t)F′s(t)dt

)

∑Ss=1 F

′s(v)

Since the smallest power in the definition of Fs is attained at s = S, the previous limit

153

Page 167: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

is in fact equal to:

− limv→0

f(v)( ∫ v

0FS(t)dt+

∫ v0ψ(t)F′S(t)dt

)

F′S(v)

=− limv→0

f(v)( ∫ v

0FN−S(t)dt+

∫ v0

(N − S)ψ(t)FN−S−1(t)f(t)dt)

(N − S)FN−S−1(v)f(v)

=− limv→0

∫ v0FN−S(t)dt+

∫ v0

(N − S)ψ(t)FN−S−1(t)f(t)dt

(N − S)FN−S−1(v).

Using L’Hopital’s rule and simplifying we arrive to the following:

ψ(0) = − limv→0

F 2(v)

(N − S)(N − S − 1)f(v)+

ψ(v)F (v)

(N − S − 1)

Moreover, since ψ is a continuous function, it must be bounded and therefore, the pre-vious limit is equal to 0. Using the same series of steps we also see that:

ψ′(0) = limv→0

ψ(v)

v= − lim

v→0

( ∫ v0FN−S(t)dt+

∫ v0

(N − S)ψ(t)FN−S−1(t)f(t)dt)

v(N − S)FN−S−1(v)

By L’Hopital’s rule again we have the previous limit is equal to

− limv→0

FN−S(v) + (N − S)ψ(v)FN−S−1(v)f(v)

(N − S)(N − S − 1)FN−S−2(v)f(v)v + (N − S)FN−S−1(v)(C.19)

Furthermore, notice that

limv→0

FN−S(v) + (N − S)ψ(v)FN−S−1(v)f(v)

(N − S)(N − S − 1)FN−S−2(v)f(v)v

= limv→0

F 2(v)

(N − S)(N − S − 1)f(v)v+

ψ(v)F (v)

(N − S − 1)v= 0.

Where for the last equality we used the fact that limv→0F (v)v

= f(0) and ψ(0) = 0.Similarly, we have:

limv→0

FN−S(v) + (N − S)ψ(v)FN−S−1(v)f(v)

(N − S)FN−S−1(v)= lim

v→0

F (v)

N − S + ψ(v)f(v) = 0

Since the terms in the denominator of (C.19) are positive, the two previous limits implythat the limit given by (C.19) is in fact 0 and therefore ψ′(0) = 0. Thus, by Taylor’stheorem we have |ψ(v)| ≤ Cv2 for some constant C.

154

Page 168: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Corollary 13. The following inequality holds with probability at least 1 − δ for alli ≤ 1

n3/4

|ψi − ψ(vi)| ≤ C1√nq(n, δ).

Proof. Follows trivially from the bound on ψ(v), Proposition 35 and the fact that i2

n2 ≤1√n

.

Having bounded the magnitude of the error for small values of i one could use theforward substitution technique used in Proposition 35 to bound the errors εi = |ψi −ψ(vi)|. Nevertheless, a crucial assumption used in Proposition 35 was the fact that ψi ≥0. This condition, however is not necessarily verified by εi. Therefore, a straightforwardforward substitution will not work. Instead, we leverage the fact that |dMi,i−1ψi−1 −dMi,iψi| is in O

(1n2

)and show that the solution ψ of a surrogate linear equation is

close to both ψ and ψ implying that ψi and ψ(vi) will be close too. Therefore let dM′

denote the lower triangular matrix with dM′i,j = dMi,j for j ≤ i− 2, dM′

i,i−1 = 0 anddM′

ii = 2dMii. Thus, we are effectively removing the problematic term dMi,i−1 in theanalysis made by forward substitution. The following proposition quantifies the effectof approximating the original system with the new matrix dM′.

Proposition 36. Let ψ be the solution to the system of equations

dM′ψ = p.

Then, for all i ∈ 1, . . . , n it is true that

|ψi −ψ| ≤( 1√

n+q(n, δ)

n3/2

)eC .

Proof. We can show, in the same way as in Proposition 35, that ψi ≤ C i2

n2 q(n, δ) withprobability at least 1− δ for all i. In particular, for i < n3/4 it is true that

|ψi −ψi| ≤ C1√nq(n, δ).

On the other hand by forward substitution we have

dM′iiψi = pi −

i−1∑

j=1

dM′ijψj

and

dMiiψi = pi −i−1∑

j=1

dMijψj.

155

Page 169: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

By using the definition of dM′ we see the above equations hold if and only if

2dMiiψi = pi −i−2∑

j=1

dMijψj

2dMiiψi = dMiiψi + pi − dMi,i−1 −i−2∑

j=1

dMijψj.

Taking the difference of these two equations yields a recurrence relation for the quantityei = ψi −ψi.

2dMiiei = dMiiψi − dMi,i−1ψi−1 −i−2∑

j=1

dMijej.

Furthermore we can bound dMiiψi − dMi,i−1ψi−1 as follows:

|dMiiψi − dMi,i−1ψi−1| ≤ |dMii − dMi,i−1|ψi−1 + |ψi −ψi−1|dMii.

≤ Cip

npq(n, δ)

n2+

C√ndMii.

Where the last inequality follows from Assumption 2 and Proposition 34 as well as fromthe fact thatψi ≤ i2

n2 q(n, δ). Finally, using the same bound on dMij as in Proposition 35gives us

|ei| ≤ C(q(n, δ)

ni+

1√n

+1

n

i−2∑

j=1

ei

)

≤ C1√n

+C

n

i−2∑

j=1

ei.

Applying Lemma 16 with A = C√n

, B = Cn

and r = n3/4 we obtain the final bound

|ψi −ψi| ≤( 1√

n+q(n, δ)

n3/2

)eC .

156

Page 170: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

C.6 Proof of Theorem 14Proposition 37. Let ψ(v) denote the solution of (C.18) and denote by ψ the vectordefined by ψi = ψ(vi). Then, with probability at least 1− δ

maxi>√nn|(dM′ψ)i − pi| ≤ C

iN−S

nN−Slog(2/δ)N/2√

nq(n, δ/2)3. (C.20)

Proof. By definition of dM′ and pi we can decompose the difference n((dM′ψ)i−pi

)

as:

S∑

s=1

cs

(Is(vi) + Υ3(vi)−

(Υ1(s, i) + Υ2(s, i)

)

− n∆vi

(N−1

s−1

)Fpi−1

(Gs−1i−1 +

n∆Gsi

s

)). (C.21)

where

Υ1(s, i) =

(N−1

s−1

)n2∆2Gs

i

s

i−1∑

j=1

Fpj−1∆vj −

G′s(vi)

f(vi)

∫ vi

0

Fs(t),

Υ2(s, i) =

(N−1

s−1

)n2∆2Gs

i

s

i−2∑

j=1

∆Fpiψ(vj)−

G′s(vi)

f(vi)

∫ vi

0

F′s(t)ψ(t)dt,

Υ3(s, i) =(

2nMii(s)−F′s(vi)

f(vi)Gs(vi)

)ψ(vi) and

Is(vi) =1

f(vi)

(F′s(vi)Gs(vi)ψ(vi) + G′s(vi)

∫ vi

0

Fs(t) + G′s(vi)

∫ vi

0

F′s(t)ψ(t)dt).

Using the fact that ψ solves equation (C.18) we see that∑S

s=1 csIs(vi) = 0. Further-more, using Lemma 15 as well as Proposition 28 we have

n∆vi

(N−1

s−1

)Fpi−1

(Gs−1i−1 +

n∆Gsi

s

)≤ ip

np1

nq(n, δ/2) ≤ iN−S

nN−S1

nq(n, δ/2)

Therefore we need only to bound Υk for k = 1, 2, 3. After replacing the values of Gsand Fs by its definitions, Proposition 31 and the fact that ψ(vi) ≤ Cv2

i ≤ C i2

n2 q2(n, δ)

imply that with probability at least 1− δ

Υ3(s, vi) ≤ Cip

nplog(2/δ)p−2

√n

q(n, δ/2)3.

157

Page 171: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

We proceed to bound the term Υ2. The bound for Υ1 can be derived in a similar manner.By using the definition of Gs and Fs we see that Υ2 =

(N−1s−1

)(Υ

(1)2 + Υ

(2)2

)where

Υ(1)2 (s, i) =

(n2∆2Gsi

s− (s− 1)G(vi)

s−2) i−2∑

j=1

∆Fpi ψi

Υ(2)2 (s, i) =

( i−2∑

j=1

∆Fpi ψ −

∫ vi

0

ψ(t)pF p−1(t)f(t)dt)

(s−1)G(vi)s−2.

It follows from Propositions 29 and 30 that |Υ2(s, i)| ≤ C ip

nplog(2/δ)p/2√

nq(n, δ/2)2. And

the same inequality holds for Υ1. Replacing these bounds in (C.21) and using the factip

np≤ iN−S

nN−Syields the desired inequality.

Proposition 38. With probability at least 1− δ

maxi|ψ(vi)−ψi| ≤ eC

( log(2/δ)N/2√n

q(n, δ/2)3 +Cq(n, δ/2)

n3/2

)

Proof. With the same argument used in Corollary 13 we see that with probability atleast 1 − δ for i ≤ 1

n3/4 we have |ψ(vi) − ψi| ≤ C√nq(n, δ). On the other hand, since

dMi = pi the previous Proposition implies that for i > n3/4

n∣∣(dM′(ψ −ψ)

)i

∣∣ ≤ CiN−S

nN−Slog(2/δ)N/2√

nq(n, δ/2)3.

Letting εi = |ψ(vi) − ψi|, we see that the previous equation defines the followingrecursive inequality.

ndM′iiεi ≤ C

iN−S

nN−Slog(2/δ)N/2√

nq(n, δ/2)3 − Cn

i−2∑

j=1

dM′ijεj,

where we used the fact that dM′i,i−1 = 0. Since dM′

ii = 2Mii ≥ 2C iN−S−1

nN−S−11n

, afterdividing the above inequality by dM′

ii we obtain

εi ≤ Clog(2/δ)N/2√

nq(n, δ/2)3 − C

n

i−2∑

j=1

εj.

Using Lemma 16 again we conclude that

εi ≤ eC( log(2/δ)N/2√

nq(n, δ/2)3 +

Cq(n, δ/2)

n3/2

)

158

Page 172: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Theorem 14. If Assumptions 1, 2 and 3 are satisfied, then, with probability at least 1−δover the draw of a sample of size n, the following bound holds for all i ∈ [1, n]:

|β(vi)− β(vi)| ≤ eC( log(2/δ)N/2√

nq(n, δ/2)3 +

Cq(n, δ/2)

n3/2

).

where q(n, δ) = 2c

log(nc/2δ) with c defined in Assumption 1, and where C is someuniversal constant.

Proof. The proof is a direct consequence of the previous proposition and Proposition 36.

159

Page 173: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Appendix D

Appendix to Chapter 5

D.1 Appendix to Chapter 5

Lemma 20. The function g : γ 7→ log 1γ

1−γ is decreasing over the interval (0, 1).

Proof. This can be straightforwardly established:

g′(γ) =−1−γ

γ+ log 1

γ

(1− γ)2=γ log

(1−

[1− 1

γ

])− (1− γ)

γ(1− γ)2<

(1− γ)− (1− γ)

γ(1− γ)2= 0,

using the inequality log(1− x) < −x valid for all x < 0.

Lemma 21. Let a ≥ 0 and let g : D ⊂ R → [a,∞) be a decreasing and differentiablefunction. Then, the function F : R→ R defined by

F (γ) = g(γ)−√g(γ)2 − b

is increasing for all values of b ∈ [0, a].

Proof. We will show that F ′(γ) ≥ 0 for all γ ∈ D. Since F ′ = g′[1−g(g2−b)−1/2] andg′ ≤ 0 by hypothesis, the previous statement is equivalent to showing that

√g2 − b ≤ g

which is trivially verified since b ≥ 0.

Theorem 15. Let 1/2 < γ < γ0 < 1 and r∗ =⌈

argminr≥1 r +γr0T

(1−γ0)(1−γr0)

⌉. For any

v ∈ [0, 1], if T > 4, the regret of PFSr∗ satisfies

Reg(PFSr∗ , v) ≤ (2vγ0Tγ0 log cT + 1 + v)(log2 log2 T + 1) + 4Tγ0 ,

where c = 4 log 2.

160

Page 174: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Proof. It is not hard to verify that the function r 7→ r +γr0T

(1−γ0)(1−γr0)is convex and

approaches infinity as r →∞. Thus, it admits a minimizer r∗ whose explicit expressioncan be found by solving the following equation

0 =d

dr

(r +

γr0T

(1− γ0)(1− γr0)

)= 1 +

γr0T log γ0

(1− γ0)(1− γr0)2.

Solving the corresponding second-degree equation yields

γ r∗

0 =2 + T log(1/γ0)

1−γ0−√(

2 + T log(1/γ0)1−γ0

)2

− 4

2=: F (γ0).

By Lemmas 20 and 21, the function F thereby defined is increasing. Therefore, γ r∗0 ≤limγ0→1 F (γ0) and

γ r∗

0 ≤2 + T −

√(2 + T )2 − 4

2=

4

2(2 + T +√

(2 + T )2 − 4)≤ 2

T. (D.1)

By the same argument, we must have γ r∗0 ≥ F (1/2), that is

γ r∗

0 ≥ F (1/2) =2 + 2T log 2−

√(2 + 2T log 2)2 − 4

2

=4

2(2 + 2T log 2 +√

(2 + 2T log 2)2 − 4)

≥ 2

4 + 4T log 2≥ 1

4T log 2.

Thus,

r∗ = dr∗e ≤ log(1/F (1/2))

log(1/γ0)+ 1 ≤ log(4T log 2)

log 1/γ0

+ 1. (D.2)

Combining inequalities (D.1) and (D.2) with (5.7) gives

Reg(PFSr∗ , v) ≤(v

log(4T log 2)

log 1/γ0

+ 1 + v

)(dlog2 log2 T e+ 1) +

(1 + γ0)T

(1− γ0)(T − 2)

≤ (2vγ0Tγ0 log(cT ) + 1 + v)(dlog2 log2 T e+ 1) + 4Tγ0 ,

using the inequality log( 1γ) ≥ 1−γ

2γvalid for all γ ∈ (1/2, 1).

161

Page 175: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

D.1.1 Lower bound for monotone algorithmsLemma 6. Let (pt)

Tt=1 be a decreasing sequence of prices. Assume that the seller faces

a truthful buyer. Then, if v is sampled uniformly at random in the interval [12, 1], the

following inequality holds:

E[κ∗] ≥ 1

32E[v − pκ∗ ].

Proof. Since the buyer is truthful, κ∗(v) = κ if and only if v ∈ [pκ, pκ−1]. Thus, we canwrite

E[v−pκ∗ ] =κmax∑

κ=2

E[1v∈[pκ,pκ−1](v−pκ)

]=

κmax∑

κ=2

∫ pκ−1

(v−pκ) dv =κmax∑

κ=2

(pκ−1 − pκ)2

2,

where κmax = κ∗(12). Thus, by the Cauchy-Schwarz inequality, we can write

E

[κ∗∑

κ=2

pκ−1 − pκ]≤ E

√√√√κ∗

κ∗∑

κ=2

(pκ−1 − pκ)2

≤ E

√√√√κ∗

κmax∑

κ=2

(pκ−1 − pκ)2

= E[√

2κ∗ E[v − pκ∗ ]]

≤√

E[κ∗]√

2E[v − p∗κ],

where the last step holds by Jensen’s inequality. In view of that, since v > pκ∗ , it followsthat:

3

4= E[v] ≥ E[pκ∗ ] = E

[κ∗∑

κ=2

pκ − pκ−1

]+ p1 ≥ −

√E[κ∗]

√2E[v − pκ∗ ] + 1.

Solving for E[κ∗] concludes the proof.

The following lemma characterizes the value of κ∗ when facing a strategic buyer.

Lemma 22. For any v ∈ [0, 1], κ∗ satisfies v − pκ∗ ≥ Cκ∗γ (pκ∗ − pκ∗+1) with Cκ∗

γ =γ−γT−κ∗+1

1−γ . Furthermore, when κ∗ ≤ 1 +√TγT and T ≥ Tγ + 2 log(2/γ)

log(1/γ), Cκ∗

γ can bereplaced by the universal constant Cγ = γ

2(1−γ).

Proof. Since an optimal strategy is played by the buyer, the surplus obtained by accept-ing a price at time κ∗ must be greater than the corresponding surplus obtained when

162

Page 176: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

accepting the first price at time κ∗ + 1. It thus follows that:

T∑

t=κ∗

γt−1(v − pκ∗) ≥T∑

t=κ∗+1

γt−1(v − pκ∗+1)

⇒ γκ∗−1(v − pκ∗) ≥

T∑

t=κ∗+1

γt−1(pκ∗ − pκ∗+1) =γκ∗ − γT1− γ (pκ∗ − pκ∗+1).

Dividing both sides of the inequality by γκ∗−1 yields the first statement of the lemma.Let us verify the second statement. A straightforward calculation shows that the condi-tions on T imply T −

√TTγ ≥ log(2/γ)

log(1/γ), therefore

Cκ∗

γ ≥γ − γT−

√TγT

1− γ ≥ γ − γlog(2/γ)log(1/γ)

1− γ =γ − γ

2

1− γ =γ

2(1− γ).

Proposition 39. For any convex decreasing sequence (pt)Tt=1, if T ≥ Tγ + 2 log(2/γ)

log(1/γ), then

there exists a valuation v0 ∈ [12, 1] for the buyer such that

Reg(Am, v0) ≥ max

1

8

√T −√T ,

1

2

√√√√Cγ

(T −

√TγT

)(1

2−√CγT

)

= Ω(√T +

√CγT ).

Proof. In view of Proposition 19, we only need to verify that there exists v0 ∈ [12, 1]

such that

Reg(Am, v0) ≥

√√√√Cγ

(T −

√TγT

)(1

2−√CγT

).

Let κmin = κ∗(1), and κmax = κ∗(12). If κmin > 1 +

√TγT , then Reg(Am, 1) ≥

1+√TγT , from which the statement of the proposition can be derived straightforwardly.

Thus, in the following we will only consider the case κmin ≤ 1 +√TγT . Since, by

definition, the inequality 12≥ pκmax holds, we can write

1

2≥ pκmax =

κmax∑

κ=κmin+1

(pκ − pκ−1) + pκmin≥ κmax(pκmin+1 − pκmin

) + pκmin,

where the last inequality holds by the convexity of the sequence and the fact that pκmin−

pκmin−1 ≤ 0. The inequality is equivalent to pκmin− pκmin+1 ≥ pκmin−

12

κmax. Furthermore,

163

Page 177: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

by Lemma 22, we have

maxv∈[ 1

2,1]

Reg(Am, v) ≥ max(κmax

2, (T − κmin)(1− pκmin

))

≥ max(κmax

2, Cγ(T − κmin)(pκmin

− pκmin+1))

≥ max

(κmax

2, Cγ

(T − κmin)(pκmin− 1

2)

κmax

).

The right-hand side is minimized for κmax =√

2Cγ(T − κmin)(pκmin− 1

2). Thus, there

exists a valuation v0 for which the following inequality holds:

Reg(Am, v0) ≥ 1

2

√Cγ(T − κmin)

(pκmin

− 1

2

)≥√Cγ

(T −

√TγT

)(pκmin

− 1

2

).

Furthermore, we can assume that pκmin≥ 1 −

√CγT

otherwise Reg(Am, 1) ≥ (T −1)√Cγ/T , which is easily seen to imply the desired lower bound. Thus, there exists a

valuation v0 such that

Reg(Am, v0) ≥ 1

2

√√√√Cγ

(T −

√TγT

)(1

2−√CγT

),

which concludes the proof.

D.2 SimulationsHere, we present the results of more extensive simulations for PFSr and the monotonealgorithm. Again, we consider two different scenarios. Figure D.1 shows the experimen-tal results for an agnostic scenario where the value of the parameter γ remains unknownto both algorithms and where the parameter r of PFSr is set to log(T ). The results re-ported in Figure D.2 correspond to the second scenario where the discounting factor γis known to the algorithms and where the parameter β for the monotone algorithm isset to 1 − 1/

√TTγ . The scale on the plots is logarithmic in the number of rounds and

in the regret.

164

Page 178: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.2

0

γ = 0.50

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

γ = 0.60

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

γ = 0.70

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

γ = 0.80

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.4

0

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.6

0

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.8

0

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

Figure D.1: Regret curves for PFSr and monotone for different values of v and γ. Thevalue of γ is not known to the algorithms.

165

Page 179: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.2

0

γ = 0.50

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

γ = 0.60

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

γ = 0.70

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

γ = 0.80

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.4

0

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.6

0

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

v =

0.8

0

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

3

4

5

6

6.5 7.5 8.5 9.5 10.5

PFSrmon

Figure D.2: Regret curves for PFSr and monotone for different values of v and γ. Thevalue of γ is known to both algorithms.

166

Page 180: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Bibliography

Agrawal, R. (1995). The continuum-armed bandit problem. SIAM journal on controland optimization 33(6), 1926–1951.

Amin, K., M. Kearns, P. Key, and A. Schwaighofer (2012). Budget optimization forsponsored search: Censored learning in MDPs. In UAI, pp. 54–63.

Amin, K., A. Rostamizadeh, and U. Syed (2013). Learning prices for repeated auctionswith strategic buyers. In Proceedings of NIPS, pp. 1169–1177.

Arora, R., O. Dekel, and A. Tewari (2012). Online bandit learning against an adaptiveadversary: from regret to policy regret. In Proceedings of ICML.

Auer, P., N. Cesa-Bianchi, and P. Fischer (2002). Finite-time analysis of the multiarmedbandit problem. Machine Learning 47(2-3), 235–256.

Auer, P., N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002). The nonstochasticmultiarmed bandit problem. SIAM J. Comput. 32(1), 48–77.

Auer, P., N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002). The nonstochasticmultiarmed bandit problem. SIAM J. Comput. 32(1), 48–77.

Baker, C. T. (1977). The numerical treatment of integral equations. Clarendon press.

Balcan, M.-F., A. Blum, J. D. Hartline, and Y. Mansour (2008). Reducing mechanismdesign to algorithm design via machine learning. J. Comput. Syst. Sci. 74(8), 1245–1270.

Bartlett, P. L. (1992). Learning with a slowly changing distribution. In Proceedings ofthe fifth annual workshop on Computational learning theory, Proceedings of COLT,New York, NY, USA, pp. 243–252. ACM.

Bartlett, P. L., S. Ben-David, and S. Kulkarni (2000). Learning changing concepts byexploiting the structure of change. Machine Learning 41, 153–174.

Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006). Convexity, classification, andrisk bounds. Journal of the American Statistical Association 101(473), 138–156.

167

Page 181: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Barve, R. D. and P. M. Long (1997). On the complexity of learning from drifting distri-butions. Information and Computation 138(2), 101–123.

Ben-David, S., J. Blitzer, K. Crammer, and F. Pereira (2006). Analysis of representationsfor domain adaptation. In Proceedings of NIPS, pp. 137–144.

Ben-David, S., T. Lu, T. Luu, and D. Pal (2010). Impossibility theorems for domainadaptation. JMLR - Proceedings Track 9, 129–136.

Ben-David, S. and R. Urner (2012). On the hardness of domain adaptation and the utilityof unlabeled target samples. In Proceedings of ALT, pp. 139–153.

Berlind, C. and R. Urner (2015). Active nearest neighbors in changing environments.In Proceedings of ICML, pp. 1870–1879.

Blitzer, J., K. Crammer, A. Kulesza, F. Pereira, and J. Wortman (2007). Learning boundsfor domain adaptation. In Proceedings of NIPS.

Blitzer, J., M. Dredze, and F. Pereira (2007). Biographies, bollywood, boom-boxes andblenders: Domain adaptation for sentiment classification. In Proceedings of ACL.

Blum, A., V. Kumar, A. Rudra, and F. Wu (2004). Online learning in online auctions.Theor. Comput. Sci. 324(2-3), 137–146.

Borgers, T., I. Cox, M. Pesendorfer, and V. Petricek (2013). Equilibrium bids in spon-sored search auctions: Theory and evidence. American Economic Journal: Microe-conomics 5(4), 163–87.

Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge: CambridgeUniversity Press.

Cavallanti, G., N. Cesa-Bianchi, and C. Gentile (2007). Tracking the best hyperplanewith a simple budget perceptron. Machine Learning 69(2/3), 143–167.

Cesa-Bianchi, N., A. Conconi, and C. Gentile (2001). On the generalization ability ofon-line learning algorithms. In NIPS, pp. 359–366.

Cesa-Bianchi, N., C. Gentile, and Y. Mansour (2013). Regret minimization for reserveprices in second-price auctions. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana,USA, January 6-8, 2013, pp. 1190–1204. SIAM.

Cesa-Bianchi, N., C. Gentile, and Y. Mansour (2013). Regret minimization for reserveprices in second-price auctions. In SODA, pp. 1190–1204.

168

Page 182: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Cesa-Bianchi, N. and G. Lugosi (2006). Prediction, learning, and games. CambridgeUniversity Press.

Cole, R. and T. Roughgarden (2014). The sample complexity of revenue maximization.In Proceedings of STOC, pp. 243–252.

Cortes, C., Y. Mansour, and M. Mohri (2010). Learning bounds for importance weight-ing. In Proceedings of NIPS, pp. 442–450.

Cortes, C. and M. Mohri (2011). Domain adaptation in regression. In Proceedings ofALT.

Cortes, C. and M. Mohri (2013). Domain adaptation and sample bias correction theoryand algorithm for regression. Theoretical Computer Science 9474, 103–126.

Cortes, C., M. Mohri, M. Riley, and A. Rostamizadeh (2008). Sample selection biascorrection theory. In Proceedings of ALT, pp. 38–53.

Crammer, K., E. Even-Dar, Y. Mansour, and J. W. Vaughan (2010). Regret minimizationwith concept drift. In COLT, pp. 168–180.

Cui, Y., R. Zhang, W. Li, and J. Mao (2011). Bid landscape forecasting in online adexchange marketplace. In KDD, pp. 265–273.

Dasgupta, S. (2011). Recent advances in active learning. In 2011 Symposium on Ma-chine Learning in Speech and Language Processing.

Daume III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of ACL,Prague, Czech Republic.

Debreu, G. and T. C. Koopmans (1982). Additively decomposed quasiconvex functions.Mathematical Programming 24, 1–38.

Devanur, N. R. and S. M. Kakade (2009). The price of truthfulness for pay-per-clickauctions. In Proceedings 10th ACM Conference on Electronic Commerce (EC-2009),Stanford, California, USA, July 6–10, 2009, pp. 99–106.

Dredze, M., J. Blitzer, P. P. Talukdar, K. Ganchev, J. Graca, and F. Pereira (2007).Frustratingly Hard Domain Adaptation for Parsing. In CoNLL 2007.

Dudley, R. M. (1984). A course on empirical processes. Lecture Notes in Math. 1097,2 – 142.

Easley, D. A. and J. M. Kleinberg (2010). Networks, Crowds, and Markets - ReasoningAbout a Highly Connected World. Cambridge University Press.

169

Page 183: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Edelman, B. and M. Ostrovsky (2007). Strategic bidder behavior in sponsored searchauctions. Decision Support Systems 43(1), 192–198.

Edelman, B., M. Ostrovsky, and M. Schwarz (2007). Internet advertising and the gener-alized second-price auction: Selling billions of dollars worth of keywords. AmericanEconomic Review 97(1), 242–259.

Edelman, B. and M. Schwarz (2010). Optimal auction design and equilibrium selectionin sponsored search auctions. American Economic Review 100(2), 597–602.

Fischer, K., B. Gartner, and M. Kutz (2003). Fast smallest-enclosing-ball computationin high dimensions. In Algorithms-ESA 2003, pp. 630–641. Springer.

Freund, Y. and Y. Mansour (1997). Learning under persistent drift. In EuroColt, pp.109–118.

Germain, P., A. Habrard, F. Laviolette, and E. Morvant (2013). A PAC-Bayesian ap-proach for domain adaptation with specialization to linear classifiers. In Proceedingsof ICML.

Gibbons, R. (1992). Game theory for applied economists. Princeton University Press.

Gomes, R. and K. S. Sweeney (2014). Bayes-Nash equilibria of the generalized second-price auction. Games and Economic Behavior 86, 421–437.

Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J. Smola (2007). Akernel method for the two-sample-problem. In B. Scholkopf, J. Platt, and T. Hoffman(Eds.), Proceedings of NIPS, pp. 513–520. Cambridge, MA: MIT Press.

Guerre, E., I. Perrigne, and Q. Vuong (2000). Optimal nonparametric estimation offirst-price auctions. Econometrica 68(3), 525–574.

He, D., W. Chen, L. Wang, and T. Liu (2014). A game-theoretic machine learningapproach for revenue maximization in sponsored search. CoRR abs/1406.0728.

Helmbold, D. P. and P. M. Long (1994). Tracking drifting concepts by minimizingdisagreements. Machine Learning 14(1), 27–46.

Herbster, M. and M. Warmuth (1998). Tracking the best expert. Machine Learn-ing 32(2), 151–78.

Herbster, M. and M. Warmuth (2001). Tracking the best linear predictor. Journal ofMachine Learning Research 1, 281–309.

Hoffman, J., T. Darrell, and K. Saenko (2014). Continuous manifold based adaptationfor evolving visual domains. In Computer Vision and Pattern Recognition (CVPR).

170

Page 184: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Horst, R. and N. V. Thoai (1999). DC programming: overview. Journal of OptimizationTheory and Applications 103(1), 1–43.

Huang, J., A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf (2006). Cor-recting sample selection bias by unlabeled data. In Proceedings of NIPS, Volume 19,pp. 601–608.

Jiang, J. and C. Zhai (2007). Instance Weighting for Domain Adaptation in NLP. InProceedings of ACL, pp. 264–271.

Kanamori, T., S. Hido, and M. Sugiyama (2009). A least-squares approach to directimportance estimation. Journal of Machine Learning Research 10, 1391–1445.

Kleinberg, R. D. and F. T. Leighton (2003a). The value of knowing a demand curve:Bounds on regret for online posted-price auctions. In Proceedings of FOCS, pp. 594–605.

Kleinberg, R. D. and F. T. Leighton (2003b). The value of knowing a demand curve:Bounds on regret for online posted-price auctions. In Proceedings of FOCS, pp. 594–605.

Koltchinskii, V. and D. Panchenko (2002). Empirical margin distributions and boundingthe generalization error of combined classifiers. Ann. Statist. 30(1), 1–50.

Kress, R., V. Maz’ya, and V. Kozlov (1989). Linear integral equations, Volume 82.Springer.

Kuleshov, V. and D. Precup (2010). Algorithms for the multi-armed bandit problem.CoRR abs/1402.6028.

Kumar, P., J. S. B. Mitchell, and E. A. Yildirim (2003). Computing core-sets and ap-proximate smallest enclosing hyperspheres in high dimensions. In ALENEX, LectureNotes Comput. Sci, pp. 45–55.

Lahaie, S. and D. M. Pennock (2007). Revenue analysis of a family of ranking rules forkeyword auctions. In Proceedings of ACM EC, pp. 50–56.

Langford, J., L. Li, Y. Vorobeychik, and J. Wortman (2010). Maintaining equilibriaduring exploration in sponsored search auctions. Algorithmica 58(4), 990–1021.

Ledoux, M. and M. Talagrand (2011). Probability in Banach spaces. Classics in Math-ematics. Berlin: Springer-Verlag. Isoperimetry and processes, Reprint of the 1991edition.

171

Page 185: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Leggetter, C. J. and P. C. Woodland (1995). Maximum likelihood linear regression forspeaker adaptation of continuous density hidden Markov models. Computer Speech& Language 9(2), 171–185.

Linz, P. (1985). Analytical and numerical methods for Volterra equations, Volume 7.SIAM.

Littlestone, N. (1989). From on-line to batch learning. In Proceedings of COLT, pp.269–284. Morgan Kaufmann Publishers Inc.

Long, P. M. (1999). The complexity of learning according to two models of a driftingenvironment. Machine Learning 37, 337–354.

Lucier, B., R. P. Leme, and E. Tardos (2012). On revenue in the generalized secondprice auction. In Proceedings of WWW, pp. 361–370.

Mansour, Y., M. Mohri, and A. Rostamizadeh (2009). Domain adaptation: Learningbounds and algorithms. In Proceedings of COLT, Montreal, Canada. Omnipress.

Martınez, A. M. (2002). Recognizing imprecisely localized, partially occluded, andexpression variant faces from a single sample per class. IEEE Trans. Pattern Anal.Mach. Intell. 24(6), 748–763.

Milgrom, P. and I. Segal (2002). Envelope theorems for aribtrary choice sets. Econo-metrica (2), 583–601.

Milgrom, P. and R. Weber (1982). A theory of auctions and competitive bidding. Econo-metrica: Journal of the Econometric Society 50(5), 1089–1122.

Mohri, M. and A. M. Medina (2014). Learning theory and algorithms for revenue opti-mization in second price auctions with reserve. In Proceedings of ICML, pp. 262–270.JMLR.org.

Mohri, M. and A. M. Medina (2015). Revenue optimization against strategic buyers. InProceedings of NIPS.

Mohri, M. and A. Munoz (2012). New analysis and algorithm for learning with driftingdistributions. In Proceedings of ALT. Springer.

Mohri, M., A. Rostamizadeh, and A. Talwalkar (2012). Foundations of machine learn-ing. Cambridge, MA: MIT Press.

Morgenster, J. and T. Roughgarden (2015). The pseudo-dimension of near-optimal auc-tions. In Proceedings of NIPS.

172

Page 186: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Morgenstern, J. and T. Roughgarden (2015). The pseudo-dimension of near-optimalauctions. CoRR abs/1506.03684.

Morris, P. (1994). Non-zero-sum games. In Introduction to Game Theory, pp. 115–147.Springer.

Muthukrishnan, S. (2009). Ad exchanges: Research issues. Internet and network eco-nomics 5929, 1–12.

Myerson, R. (1981). Optimal auction design. Mathematics of operations research 6(1),58–73.

Nachbar, J. (2001). Bayesian learning in repeated games of incomplete information.Social Choice and Welfare 18(2), 303–326.

Nachbar, J. H. (1997). Prediction, optimization, and learning in repeated games. Econo-metrica: Journal of the Econometric Society 65(2), 275–309.

Nisan, N., T. Roughgarden, E. Tardos, and V. V. Vazirani (Eds.) (2007). Algorithmicgame theory. Cambridge: Cambridge University Press.

Ostrovsky, M. and M. Schwarz (2011). Reserve prices in internet advertising auctions:a field experiment. In Proceedings of ACM EC, pp. 59–60.

Pan, S. J., I. W. Tsang, J. T. Kwok, and Q. Yang (2011). Domain adaptation via transfercomponent analysis. IEEE Transactions on Neural Networks 22(2), 199–210.

Pollard, D. (1984). Convergence of Stochastic Processess. New York: Springer.

Qin, T., W. Chen, and T. Liu (2014). Sponsored search auctions: Recent advances andfuture directions. ACM TIST 5(4), 60.

Rakhlin, A., K. Sridharan, and A. Tewari (2010). Online learning: Random averages,combinatorial parameters, and learnability.

Rasmussen, C. E., R. M. Neal, G. Hinton, D. van Camp, M. R. Z. Ghahramani, R. Kus-tra, and R. Tibshirani (1996). The delve project. http://www.cs.toronto.edu/˜delve/data/datasets.html. version 1.0.

Riley, J. and W. Samuelson (1981). Optimal auctions. The American Economic Re-view 71(3), 381–392.

Robbins, H. (1985). Some aspects of the sequential design of experiments. In HerbertRobbins Selected Papers, pp. 169–177. Springer.

Schonherr, S. (2002). Quadratic Programming in Geometric Optimization: Theory, Im-plementation, and applications. Ph. D. thesis, Swiss Federal Institute of Technology.

173

Page 187: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Sriperumbudur, B. K. and G. R. G. Lanckriet (2012). A proof of convergence ofthe concave-convex procedure using Zangwill’s theory. Neural Computation 24(6),1391–1407.

Sugiyama, M., S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe (2007).Direct importance estimation with model selection and its application to covariateshift adaptation. In Proceedings of NIPS, pp. 1433–1440.

Sun, Y., Y. Zhou, and X. Deng (2014). Optimal reserve prices in weighted GSP auctions.Electronic Commerce Research and Applications 13(3), 178–187.

Talagrand, M. (2005). The Generic Chaining. New York: Springer.

Tao, P. D. and L. T. H. An (1997). Convex analysis approach to DC programming:theory, algorithms and applications. Acta Mathematica Vietnamica 22(1), 289–355.

Tao, P. D. and L. T. H. An (1998). A DC optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization 8(2), 476–505.

Thompson, D. R. M. and K. Leyton-Brown (2013). Revenue optimization in the gener-alized second-price auction. In Proceedings of ACM EC, pp. 837–852.

Tommasi, T., T. Tuytelaars, and B. Caputo (2014). A testbed for cross-dataset analysis.CoRR abs/1402.5923.

Tuy, H. (1964). Concave programming under linear constraints. Translated Soviet Math-ematics 5, 1437–1440.

Tuy, H. (2002). Counter-examples to some results on D.C. optimization. Technicalreport, Institute of Mathematics, Hanoi, Vietnam.

Valiant, P. (2011). Testing symmetric properties of distributions. SIAM J. Comput. 40(6),1927–1968.

Varian, H. R. (2007, December). Position auctions. International Journal of IndustrialOrganization 25(6), 1163–1178.

Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders. TheJournal of finance 16(1), 8–37.

Welzl, E. (1991). Smallest enclosing disks (balls and ellipsoids). In New results and newtrends in computer science (Graz, 1991), Volume 555 of Lecture Notes in Comput.Sci., pp. 359–370. Berlin: Springer.

Widrow, B. and M. E. Hoff (1988). Adaptive switching circuits, pp. 123–134. ACM.

174

Page 188: cims.nyu.edumunoz/papers/thesis.pdfAcknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment

Yen, I. E., N. Peng, P.-W. Wang, and S.-D. Lin (2012). On convergence rate of concave-convex procedure. In Proceedings of the NIPS 2012 Optimization Workshop.

Yildirim, E. A. (2008). Two algorithms for the minimum enclosing ball problem. SIAMJournal on Optimization 19(3), 1368–1391.

Yuille, A. L. and A. Rangarajan (2003). The concave-convex procedure. Neural Com-putation 15(4), 915–936.

Zhang, K., B. Scholkopf, K. Muandet, and Z. Wang (2013). Domain adaptation undertarget and conditional shift. In Proceedings of ICML 2013, pp. 819–827.

Zhong, E., W. Fan, Q. Yang, O. Verscheure, and J. Ren (2010). Cross validation frame-work to choose amongst models and datasets for transfer learning. In Proceedings ofECML PKDD 2010 Part III, pp. 547–562.

Zhu, Y., G. Wang, J. Yang, D. Wang, J. Yan, J. Hu, and Z. Chen (2009). Optimiz-ing search engine revenue in sponsored search. In Proceedings of ACM SIGIR onResearch and Development in Information Retrieval, pp. 588–595.

175