1000
edric Villani Optimal transport, old and new June 13, 2008 Springer Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo

optimal transport

  • Upload
    rui-ma

  • View
    50

  • Download
    1

Embed Size (px)

DESCRIPTION

textbook

Citation preview

Cdric Villani e

Optimal transport, old and newJune 13, 2008

SpringerBerlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo

Do mo chuisle mo chro Alle , e

VII

This is the June 14, 2008 version of my lecture notes for the 2005 Saint-Flour summer school. The changes with respect to the previous version which I had daringly called nal are the following: - a third Appendix to Chapter 14, to clarify certain properties of Jacobi elds and ll a gap (pointed out to me by D. Cordero-Erausquin) in the discussion of distortion coecients; - a corrected statement for Corollary 5.23 (stability of the transport map), given to me by B. Schulte; - Step 5 of the dreadful proof of Theorem 23.13, which used the faulty version of Corollary 5.23, has been corrected; as a result that proof is even dreadfuller now; - Chapter 23 on concentration has been updated with some recent results by N. Gozlan. This is the version sent back to the publisher after copyediting.

C. Villani UMPA, ENS Lyon 46 alle dItalie e 69364 Lyon Cedex 07 FRANCE Email: [email protected] Webpage: www.umpa.ens-lyon.fr/~cvillani

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction 1 2 3

1 7 13

Couplings and changes of variables . . . . . . . . . . . . . . . . . . . 17 Three examples of coupling techniques . . . . . . . . . . . . . . . 33 The founding fathers of optimal transport . . . . . . . . . . . 41 Qualitative description of optimal transport 51

Part I 4 5 6 7 8 9

Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Cyclical monotonicity and Kantorovich duality . . . . . . . 63 The Wasserstein distances . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Displacement interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 The MongeMather shortening principle . . . . . . . . . . . . . 175 Solution of the Monge problem I: Global approach . . . 217

10 Solution of the Monge problem II: Local approach . . . 227

X

Contents

11 The Jacobian equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 12 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 13 Qualitative picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Part II Optimal transport and Riemannian geometry 367

14 Ricci curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 15 Otto calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 16 Displacement convexity I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 17 Displacement convexity II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 18 Volume control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 19 Density control and local regularity . . . . . . . . . . . . . . . . . . 521 20 Innitesimal displacement convexity . . . . . . . . . . . . . . . . . 541 21 Isoperimetric-type inequalities . . . . . . . . . . . . . . . . . . . . . . . 561 22 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 23 Gradient ows I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 24 Gradient ows II: Qualitative properties . . . . . . . . . . . . . 709 25 Gradient ows III: Functional inequalities . . . . . . . . . . . . 735 Part III Synthetic treatment of Ricci curvature 747

26 Analytic and synthetic points of view . . . . . . . . . . . . . . . . 751 27 Convergence of metric-measure spaces . . . . . . . . . . . . . . . 759 28 Stability of optimal transport . . . . . . . . . . . . . . . . . . . . . . . . 789

Contents

XI

29 Weak Ricci curvature bounds I: Denition and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 30 Weak Ricci curvature bounds II: Geometric and analytic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865 Conclusions and open problems 921

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933 List of short statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975 List of gures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985 Some notable cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989

Preface

2

Preface

When I was rst approached for the 2005 edition of the Saint-Flour Probability Summer School, I was intrigued, attered and scared.1 Apart from the challenge posed by the teaching of a rather analytical subject to a probabilistic audience, there was the danger of producing a remake of my recent book Topics in Optimal Transportation. However, I gradually realized that I was oered a unique opportunity to rewrite the whole theory from a dierent perspective, with alternative proofs and dierent focus, and a more probabilistic presentation; plus the incorporation of recent progress. Among the most striking of these recent advances, there was the rising awareness that John Mathers minimal measures had a lot to do with optimal transport, and that both theories could actually be embedded in a single framework. There was also the discovery that optimal transport could provide a robust synthetic approach to Ricci curvature bounds. These links with dynamical systems on one hand, dierential geometry on the other hand, were only briey alluded to in my rst book; here on the contrary they will be at the basis of the presentation. To summarize: more probability, more geometry, and more dynamical systems. Of course there cannot be more of everything, so in some sense there is less analysis and less physics, and also there are fewer digressions. So the present course is by no means a reduction or an expansion of my previous book, but should be regarded as a complementary reading. Both sources can be read independently, or together, and hopefully the complementarity of points of view will have pedagogical value. Throughout the book I have tried to optimize the results and the presentation, to provide complete and self-contained proofs of the most important results, and comprehensive bibliographical notes a dauntingly dicult task in view of the rapid expansion of the literature. Many statements and theorems have been written specically for this course, and many results appear in rather sharp form for the rst time. I also added several appendices, either to present some domains of mathematics to non-experts, or to provide proofs of important auxiliary results. All this has resulted in a rapid growth of the document, which in the end is about six times (!) the size that I had planned initially. So the non-expert reader is advised to skip long proofs at rst reading, and concentrate on explanations, statements, examples and sketches of proofs when they are available.1

Fans of Tom Waits may have identied this quotation.

Preface

3

About terminology: For some reason I decided to switch from transportation to transport, but this really is a matter of taste. For people who are already familiar with the theory of optimal transport, here are some more serious changes. Part I is devoted to a qualitative description of optimal transport. The dynamical point of view is given a prominent role from the beginning, with Robert McCanns concept of displacement interpolation. This notion is discussed before any theorem about the solvability of the Monge problem, in an abstract setting of Lagrangian action which generalizes the notion of length space. This provides a unied picture of recent developments dealing with various classes of cost functions, in a smooth or nonsmooth context. I also wrote down in detail some important estimates by John Mather, well-known in certain circles, and made extensive use of them, in particular to prove the Lipschitz regularity of intermediate transport maps (starting from some intermediate time, rather than from initial time). Then the absolute continuity of displacement interpolants comes for free, and this gives a more unied picture of the Mather and MongeKantorovich theories. I rewrote in this way the classical theorems of solvability of the Monge problem for quadratic cost in Euclidean space. Finally, this approach allows one to treat change of variables formulas associated with optimal transport by means of changes of variables that are Lipschitz, and not just with bounded variation. Part II discusses optimal transport in Riemannian geometry, a line of research which started around 2000; I have rewritten all these applications in terms of Ricci curvature, or more precisely curvaturedimension bounds. This part opens with an introduction to Ricci curvature, hopefully readable without any prior knowledge of this notion. Part III presents a synthetic treatment of Ricci curvature bounds in metric-measure spaces. It starts with a presentation of the theory of GromovHausdor convergence; all the rest is based on recent research papers mainly due to John Lott, Karl-Theodor Sturm and myself. In all three parts, noncompact situations will be systematically treated, either by limiting processes, or by restriction arguments (the restriction of an optimal transport is still optimal; this is a simple but powerful principle). The notion of approximate dierentiability, introduced in the eld by Luigi Ambrosio, appears to be particularly handy in the study of optimal transport in noncompact Riemannian manifolds.

4

Preface

Several parts of the subject are not developed as much as they would deserve. Numerical simulation is not addressed at all, except for a few comments in the concluding part. The regularity theory of optimal transport is described in Chapter 12 (including the remarkable recent works of Xu-Jia Wang, Neil Trudinger and Grgoire Loeper), but withe out the core proofs and latest developments; this is not only because of the technicality of the subject, but also because smoothness is not needed in the rest of the book. Still another poorly developed subject is the MongeMatherMa problem arising in dynamical systems, and ne including as a variant the optimal transport problem when the cost function is a distance. This topic is discussed in several treatises, such as Albert Fathis monograph, Weak KAM theorem in Lagrangian dynamics; but now it would be desirable to rewrite everything in a framework that also encompasses the optimal transport problem. An important step in this direction was recently performed by Patrick Bernard and Boris Buoni. In Chapter 8 I shall provide an introduction to Mathers theory, but there would be much more to say. The treatment of Chapter 22 (concentration of measure) is strongly inuenced by Michel Ledouxs book, The Concentration of Measure Phenomenon; while the results of Chapters 23 to 25 owe a lot to the monograph by Luigi Ambrosio, Nicola Gigli and Giuseppe Savar, e Gradient ows in metric spaces and in the space of probability measures. Both references are warmly recommended complementary reading. One can also consult the two-volume treatise by Svetlozar Rachev and Ludger Rschendorf, Mass Transportation Problems, for many apu plications of optimal transport to various elds of probability theory. While writing this text I asked for help from a number of friends and collaborators. Among them, Luigi Ambrosio and John Lott are the ones whom I requested most to contribute; this book owes a lot to their detailed comments and suggestions. Most of Part III, but also signicant portions of Parts I and II, are made up with ideas taken from my collaborations with John, which started in 2004 as I was enjoying the hospitality of the Miller Institute in Berkeley. Frequent discussions with Patrick Bernard and Albert Fathi allowed me to get the links between optimal transport and John Mathers theory, which were a key to the presentation in Part I; John himself gave precious hints about the history of the subject. Neil Trudinger and Xu-Jia Wang spent vast amounts of time teaching me the regularity theory of MongeAmp`re equations. Alessio Figalli took up the dreadful chale

Preface

5

lenge to check the entire set of notes from rst to last page. Apart from these people, I got valuable help from Stefano Bianchini, Franois c Bolley, Yann Brenier, Xavier Cabr, Vincent Calvez, Jos Antonio e e Carrillo, Dario Cordero-Erausquin, Denis Feyel, Sylvain Gallot, Wilfrid Gangbo, Diogo Aguiar Gomes, Nathal Gozlan, Arnaud Guillin, e Kazuhiro Kuwae, Michel Ledoux, Grgoire Loeper, Francesco Maggi, e Robert McCann, Shin-ichi Ohta, Vladimir Oliker, Yann Ollivier, Felix Otto, Ludger Rschendorf, Giuseppe Savar, Walter Schachermayer, u e Benedikt Schulte, Theo Sturm, Josef Teichmann, Anthon Thalmaier, u Hermann Thorisson, Sleyman Ustnel, Anatoly Vershik, and others. u Short versions of this course were tried on mixed audiences in the Universities of Bonn, Dortmund, Grenoble and Orlans, as well as the e Borel seminar in Leysin and the IHES in Bures-sur-Yvette. Part of the writing was done during stays at the marvelous MFO Institute in Oberwolfach, the CIRM in Luminy, and the Australian National University in Canberra. All these institutions are warmly thanked. It is a pleasure to thank Jean Picard for all his organization work on the 2005 Saint-Flour summer school; and the participants for their questions, comments and bug-tracking, in particular Sylvain Arlot (great bug-tracker!), Fabrice Baudoin, Jrme Demange, Steve Evans eo (whom I also thank for his beautiful lectures), Christophe Leuridan, Jan Oblj, Erwan Saint Loubert Bi, and others. I extend these thanks o e to the joyful group of young PhD students and ma tres de confrences e with whom I spent such a good time on excursions, restaurants, quantum ping-pong and other activities, making my stay in Saint-Flour truly wonderful (with special thanks to my personal driver, Stphane e Loisel, and my table tennis sparring-partner and adversary, Franois c Simenhaus). I will cherish my visit there in memory as long as I live! Typing these notes was mostly performed on my (now defunct) faithful laptop Torsten, a gift of the Miller Institute. Support by the Agence Nationale de la Recherche and Institut Universitaire de France is acknowledged. My eternal gratitude goes to those who made ne typesetting accessible to every mathematician, most notably Donald A Knuth for TEX, and the developers of L TEX, BibTEX and XFig. Final thanks to Catriona Byrne and her team for a great editing process. As usual, I encourage all readers to report mistakes and misprints. I will maintain a list of errata, accessible from my Web page. Cdric Villani e Lyon, June 2008

Conventions

8

Conventions

Axioms I use the classical axioms of set theory; not the full version of the axiom of choice (only the classical axiom of countable dependent choice). Sets and structures Id is the identity mapping, whatever the space. If A is a set then the function 1A is the indicator function of A: 1A (x) = 1 if x A, and 0 otherwise. If F is a formula, then 1F is the indicator function of the set dened by the formula F . If f and g are two functions, then (f, g) is the function x (f (x), g(x)). The composition f g will often be denoted by f (g). N is the set of positive integers: N = {1, 2, 3, . . .}. A sequence is written (xk )kN , or simply, when no confusion seems possible, (xk ). R is the set of real numbers. When I write Rn it is implicitly assumed that n is a positive integer. The Euclidean scalar product between two vectors a and b in Rn is denoted interchangeably by a b or a, b . The Euclidean norm will be denoted simply by | |, independently of the dimension n. Mn (R) is the space of real n n matrices, and In the n n identity matrix. The trace of a matrix M will be denoted by tr M , its determinant by det M , its adjoint by M , and its HilbertSchmidt norm tr (M M ) by M HS (or just M ). Unless otherwise stated, Riemannian manifolds appearing in the text are nite-dimensional, smooth and complete. If a Riemannian manifold M is given, I shall usually denote by n its dimension, by d the geodesic distance on M , and by vol the volume (= n-dimensional Hausdor) measure on M . The tangent space at x will be denoted by Tx M , and the tangent bundle by TM . The norm on Tx M will most of the time be denoted by | |, as in Rn , without explicit mention of the point x. (The symbol will be reserved for special norms or functional norms.) If S is a set without smooth structure, the notation Tx S will instead denote the tangent cone to S at x (Denition 10.46). If Q is a quadratic form dened on Rn , or on the tangent bundle of a manifold, its value on a (tangent) vector v will be denoted by Qv, v , or simply Q(v). The open ball of radius r and center x in a metric space X is denoted interchangeably by B(x, r) or Br (x). If X is a Riemannian manifold, the distance is of course the geodesic distance. The closed ball will be denoted interchangeably by B[x, r] or Br] (x). The diameter of a metric space X will be denoted by diam (X ).

Conventions

9

The closure of a set A in a metric space will be denoted by A (this is also the set of all limits of sequences with values in A). A metric space X is said to be locally compact if every point x X admits a compact neighborhood; and boundedly compact if every closed and bounded subset of X is compact. A map f between metric spaces (X , d) and (X , d ) is said to be C-Lipschitz if d (f (x), f (y)) C d(x, y) for all x, y in X . The best admissible constant C is then denoted by f Lip . A map is said to be locally Lipschitz if it is Lipschitz on bounded sets, not necessarily compact (so it makes sense to speak of a locally Lipschitz map dened almost everywhere). A curve in a space X is a continuous map dened on an interval of R, valued in X . For me the words curve and path are synonymous. The time-t evaluation map et is dened by et () = t = (t). If is a curve dened from an interval of R into a metric space, its length will be denoted by L(), and its speed by ||; denitions are recalled on p. 131. Usually geodesics will be minimizing, constant-speed geodesic curves. If X is a metric space, (X ) stands for the space of all geodesics : [0, 1] X . Being given x0 and x1 in a metric space, I denote by [x0 , x1 ]t the set of all t-barycenters of x0 and x1 , as dened on p. 407. If A0 and A1 are two sets, then [A0 , A1 ]t stands for the set of all [x0 , x1 ]t with (x0 , x1 ) A0 A1 . Function spaces C(X ) is the space of continuous functions X R, Cb (X ) the space of bounded continuous functions X R; and C0 (X ) the space of continuous functions X R converging to 0 at innity; all of them are equipped with the norm of uniform convergence = sup ||. k Then Cb (X ) is the space of k-times continuously dierentiable functions u : X R, such that all the partial derivatives of u up to order k are bounded; it is equipped with the norm given by the supremum of all norms u Cb , where u is a partial derivative of order at most k; k Cc (X ) is the space of k-times continuously dierentiable functions with compact support; etc. When the target space is not R but some other space Y, the notation is transformed in an obvious way: C(X ; Y), etc. Lp is the Lebesgue space of exponent p; the space and the measure will often be implicit, but clear from the context.

10

Conventions

Calculus The derivative of a function u = u(t), dened on an interval of R and valued in Rn or in a smooth manifold, will be denoted by u , or more often by u. The notation d+ u/dt stands for the upper right-derivative of a real-valued function u: d+ u/dt = lim sups0 [u(t + s) u(t)]/s. If u is a function of several variables, the partial derivative with respect to the variable t will be denoted by t u, or u/t. The notation ut does not stand for t u, but for u(t). The gradient operator will be denoted by grad or simply ; the divergence operator by div or ; the Laplace operator by ; the Hessian operator by Hess or 2 (so 2 does not stand for the Laplace operator). The notation is the same in Rn or in a Riemannian manifold. is the divergence of the gradient, so it is typically a nonpositive operator. The value of the gradient of f at point x will be denoted either by x f or f (x). The notation stands for the approximate gradient, introduced in Denition 10.2. If T is a map Rn Rn , T stands for the Jacobian matrix of T , that is the matrix of all partial derivatives (Ti /xj ) (1 i, j n). All these dierential operators will be applied to (smooth) functions but also to measures, by duality. For instance, the Laplacian of a mea2 sure is dened via the identity d() = () d ( Cc ). The notation is consistent in the sense that (f vol) = (f ) vol. Similarly, I shall take the divergence of a vector-valued measure, etc. f = o(g) means f /g 0 (in an asymptotic regime that should be clear from the context), while f = O(g) means that f /g is bounded. log stands for the natural logarithm with base e. The positive and negative parts of x R are dened respectively by x+ = max (x, 0) and x = max (x, 0); both are nonnegative, and |x| = x+ + x . The notation a b will sometimes be used for min (a, b). All these notions are extended in the usual way to functions and also to signed measures. Probability measures x is the Dirac mass at point x. All measures considered in the text are Borel measures on Polish spaces, which are complete, separable metric spaces, equipped with their Borel -algebra. I shall usually not use the completed -algebra, except on some rare occasions (emphasized in the text) in Chapter 5. A measure is said to be nite if it has nite mass, and locally nite if it attributes nite mass to compact sets.

Conventions

11

The space of Borel probability measures on X is denoted by P (X ), the space of nite Borel measures by M+ (X ), the space of signed nite Borel measures by M (X ). The total variation of is denoted by TV . The integral of a function f with respect to a probability measure will be denoted interchangeably by f (x) d(x) or f (x) (dx) or f d. If is a Borel measure on a topological space X , a set N is said to be -negligible if N is included in a Borel set of zero -measure. Then is said to be concentrated on a set C if X \ C is negligible. (If C itself is Borel measurable, this is of course equivalent to [X \ C] = 0.) By abuse of language, I may say that X has full -measure if is concentrated on X . If is a Borel measure, its support Spt is the smallest closed set on which it is concentrated. The same notation Spt will be used for the support of a continuous function. If is a Borel measure on X , and T is a Borel map X Y, then T# stands for the image measure2 (or push-forward) of by T : It is a Borel measure on Y, dened by (T# )[A] = [T 1 (A)]. The law of a random variable X dened on a probability space (, P ) is denoted by law (X); this is the same as X# P . The weak topology on P (X ) (or topology of weak convergence, or narrow topology) is induced by convergence against Cb (X ), i.e. bounded continuous test functions. If X is Polish, then the space P (X ) itself is Polish. Unless explicitly stated, I do not use the weak- topology of measures (induced by C0 (X ) or Cc (X )). When a probability measure is clearly specied by the context, it will sometimes be denoted just by P , and the associated integral, or expectation, will be denoted by E . If (dx dy) is a probability measure in two variables x X and y Y, its marginal (or projection) on X (resp. Y) is the measure X# (resp. Y# ), where X(x, y) = x, Y (x, y) = y. If (x, y) is random with law (x, y) = , then the conditional law of x given y is denoted by (dx|y); this is a measurable function Y P (X ), obtained by disintegrating along its y-marginal. The conditional law of y given x will be denoted by (dy|x). A measure is said to be absolutely continuous with respect to a measure if there exists a measurable function f such that = f .2

Depending on the authors, the measure T# is often denoted by T #, T , T (), R T , T (a) (da), T 1 , T 1 , or [T ].

12

Conventions

Notation specic to optimal transport and related elds If P (X ) and P (Y) are given, then (, ) is the set of all joint probability measures on X Y whose marginals are and . C(, ) is the optimal (total) cost between and , see p. 92. It implicitly depends on the choice of a cost function c(x, y). For any p [1, +), Wp is the Wasserstein distance of order p, see Denition 6.1; and Pp (X ) is the Wasserstein space of order p, i.e. the set of probability measures with nite moments of order p, equipped with the distance Wp , see Denition 6.4. Pc (X ) is the set of probability measures on X with compact support. If a reference measure on X is specied, then P ac (X ) (resp. ac Pp (X ), Pcac (X )) stands for those elements of P (X ) (resp. Pp (X ), Pc (X )) which are absolutely continuous with respect to . DCN is the displacement convexity class of order N (N plays the role of a dimension); this is a family of convex functions, dened on p. 457 and in Denition 17.1. U is a functional dened on P (X ); it depends on a convex function U and a reference measure on X . This functional will be dened at various levels of generality, rst in equation (15.2), then in Denition 29.1 and Theorem 30.4. U, is another functional on P (X ), which involves not only a convex function U and a reference measure , but also a coupling and a distortion coecient , which is a nonnegative function on X X : See again Denition 29.1 and Theorem 30.4. The and 2 operators are quadratic dierential operators associated with a diusion operator; they are dened in (14.47) and (14.48). (K,N ) is the notation for the distortion coecients that will play a t prominent role in these notes; they are dened in (14.61). CD(K, N ) means curvature-dimension condition (K, N ), which morally means that the Ricci curvature is bounded below by Kg (K a real number, g the Riemannian metric) and the dimension is bounded above by N (a real number which is not less than 1). If c(x, y) is a cost function then c(y, x) = c(x, y). Similarly, if (dx dy) is a coupling, then is the coupling obtained by swapping variables, that is (dy dx) = (dx dy), or more rigorously, = S# , where S(x, y) = (y, x). Assumptions (Super), (Twist), (Lip), (SC), (locLip), (locSC), (H) are dened on p. 246, (STwist) on p. 313, (Cutn1 ) on p. 317.

Introduction

15

To start, I shall recall in Chapter 1 some basic facts about couplings and changes of variables, including denitions, a short list of famous couplings (KnotheRosenblatt coupling, Moser coupling, optimal coupling, etc.); and some important basic formulas about change of variables, conservation of mass, and linear diusion equations. In Chapter 2 I shall present, without detailed proofs, three applications of optimal coupling techniques, providing a avor of the kind of applications that will be considered later. Finally, Chapter 3 is a short historical perspective about the foundations and development of optimal coupling theory.

1 Couplings and changes of variables

Couplings are very well-known in all branches of probability theory, but since they will occur again and again in this course, it might be a good idea to start with some basic reminders and a few more technical issues. Denition 1.1 (Coupling). Let (X , ) and (Y, ) be two probability spaces. Coupling and means constructing two random variables X and Y on some probability space (, P ), such that law (X) = , law (Y ) = . The couple (X, Y ) is called a coupling of (, ). By abuse of language, the law of (X, Y ) is also called a coupling of (, ). If and are the only laws in the problem, then without loss of generality one may choose = X Y. In a more measure-theoretical formulation, coupling and means constructing a measure on X Y such that admits and as marginals on X and Y respectively. The following three statements are equivalent ways to rephrase that marginal condition: (projX )# = , (projY )# = , where projX and projY respectively stand for the projection maps (x, y) x and (x, y) y;

For all measurable sets A X , B Y, one has [A Y] = [A], [X B] = [B];

For all integrable (resp. nonnegative) measurable functions , on X , Y, (x) + (y) d(x, y) =X Y X

d +Y

d.

18

1 Couplings and changes of variables

A rst remark about couplings is that they always exist: at least there is the trivial coupling, in which the variables X and Y are independent (so their joint law is the tensor product ). This can hardly be called a coupling, since the value of X does not give any information about the value of Y . Another extreme is when all the information about the value of Y is contained in the value of X, in other words Y is just a function of X. This motivates the following denition (in which X and Y do not play symmetric roles). Denition 1.2 (Deterministic coupling). With the notation of Denition 1.1, a coupling (X, Y ) is said to be deterministic if there exists a measurable function T : X Y such that Y = T (X). To say that (X, Y ) is a deterministic coupling of and is strictly equivalent to any one of the four statements below: (X, Y ) is a coupling of and whose law is concentrated on the graph of a measurable function T : X Y; X has law and Y = T (X), where T# = ; X has law and Y = T (X), where T is a change of variables from to : for all -integrable (resp. nonnegative measurable) functions , (y) d(y) =Y X

T (x) d(x);

(1.1)

= (Id , T )# . The map T appearing in all these statements is the same and is uniquely dened -almost surely (when the joint law of (X, Y ) has been xed). The converse is true: If T and T coincide -almost surely, then T# = T# . It is common to call T the transport map: Informally, one can say that T transports the mass represented by the measure , to the mass represented by the measure . Unlike couplings, deterministic couplings do not always exist: Just think of the case when is a Dirac mass and is not. But there may also be innitely many deterministic couplings between two given probability measures.

Some famous couplings

19

Some famous couplingsHere below are some of the most famous couplings used in mathematics of course the list is far from complete, since everybody has his or her own preferred coupling technique. Each of these couplings comes with its own natural setting; this variety of assumptions reects the variety of constructions. (This is a good reason to state each of them with some generality.) 1. The measurable isomorphism. Let (X , ) and (Y, ) be two Polish (i.e. complete, separable, metric) probability spaces without atom (i.e. no single point carries a positive mass). Then there exists a (nonunique) measurable bijection T : X Y such that T# = , (T 1 )# = . In that sense, all atomless Polish probability spaces are isomorphic, and, say, isomorphic to the space Y = [0, 1] equipped with the Lebesgue measure. Powerful as that theorem may seem, in practice the map T is very singular; as a good exercise, the reader might try to construct it explicitly, in terms of cumulative distribution functions, when X = R and Y = [0, 1] (issues do arise when the density of vanishes at some places). Experience shows that it is quite easy to fall into logical traps when working with the measurable isomorphism, and my advice is to never use it. 2. The Moser mapping. Let X be a smooth compact Riemannian manifold with volume vol, and let f, g be Lipschitz continuous positive probability densities on X ; then there exists a deterministic coupling of = f vol and = g vol, constructed by resolution of an elliptic equation. On the positive side, there is a somewhat explicit representation of the transport map T , and it is as smooth as can be: if f, g are C k, then T is C k+1,. The formula is given in the Appendix at the end of this chapter. The same construction works in Rn provided that f and g decay fast enough at innity; and it is robust enough to accommodate for variants. 3. The increasing rearrangement on R. Let , be two probability measures on R; dene their cumulative distribution functions byx y

F (x) =

d,

G(y) =

d.

Further dene their right-continuous inverses by

20

1 Couplings and changes of variables

F 1 (t) = inf x R; F (x) > t ; G1 (t) = inf y R; G(y) > t ; and set T = G1 F. If does not have atoms, then T# = . This rearrangement is quite simple, explicit, as smooth as can be, and enjoys good geometric properties. 4. The KnotheRosenblatt rearrangement in Rn . Let and be two probability measures on Rn , such that is absolutely continuous with respect to Lebesgue measure. Then dene a coupling of and as follows. Step 1: Take the marginal on the rst variable: this gives probability measures 1 (dx1 ), 1 (dy1 ) on R, with 1 being atomless. Then dene y1 = T1 (x1 ) by the formula for the increasing rearrangement of 1 into 1 . Step 2: Now take the marginal on the rst two variables and disintegrate it with respect to the rst variable. This gives probability measures 2 (dx1 dx2 ) = 1 (dx1 ) 2 (dx2 |x1 ), 2 (dy1 dy2 ) = 1 (dy1 ) 2 (dy2 |y1 ). Then, for each given y1 R, set y1 = T1 (x1 ), and dene y2 = T2 (x2 ; x1 ) by the formula for the increasing rearrangement of 2 (dx2 |x1 ) into 2 (dy2 |y1 ). (See Figure 1.1.)

Then repeat the construction, adding variables one after the other and dening y3 = T3 (x3 ; x1 , x2 ); etc. After n steps, this produces a map y = T (x) which transports to , and in practical situations might be computed explicitly with little eort. Moreover, the Jacobian matrix of the change of variables T is (by construction) upper triangular with positive entries on the diagonal; this makes it suitable for various geometric applications. On the negative side, this mapping does not satisfy many interesting intrinsic properties; it is not invariant under isometries of Rn , not even under relabeling of coordinates. 5. The Holley coupling on a lattice. Let and be two discrete probabilities on a nite lattice , say {0, 1}N , equipped with the natural partial ordering (x y if xn yn for all n). Assume that x, y , [inf(x, y)] [sup(x, y)] [x] [y]. (1.2)

Some famous couplings

21

dx1 T1

dy1

Fig. 1.1. Second step in the construction of the KnotheRosenblatt map: After the correspondance x1 y1 has been determined, the conditional probability of x2 (seen as a one-dimensional probability on a small slice of width dx1 ) can be transported to the conditional probability of y2 (seen as a one-dimensional probability on a slice of width dy1 ).

Then there exists a coupling (X, Y ) of (, ) with X Y . The situation above appears in a number of problems in statistical mechanics, in connection with the so-called FKG (FortuinKasteleynGinibre) inequalities. Inequality (1.2) intuitively says that puts more mass on large values than . 6. Probabilistic representation formulas for solutions of partial dierential equations. There are hundreds of them (if not thousands), representing solutions of diusion, transport or jump processes as the laws of various deterministic or stochastic processes. Some of them are recalled later in this chapter. 7. The exact coupling of two stochastic processes, or Markov chains. Two realizations of a stochastic process are started at initial time, and when they happen to be in the same state at some time, they are merged: from that time on, they follow the same path and accordingly, have the same law. For two Markov chains which are started independently, this is called the classical coupling. There

22

1 Couplings and changes of variables

are many variants with important dierences which are all intended to make two trajectories close to each other after some time: the Ornstein coupling, the -coupling (in which one requires the two variables to be close, rather than to occupy the same state), the shift-coupling (in which one allows an additional time-shift), etc. 8. The optimal coupling or optimal transport. Here one introduces a cost function c(x, y) on X Y, that can be interpreted as the work needed to move one unit of mass from location x to location y. Then one considers the MongeKantorovich minimization problem inf E c(X, Y ), where the pair (X, Y ) runs over all possible couplings of (, ); or equivalently, in terms of measures, infX Y

c(x, y) d(x, y),

where the inmum runs over all joint probability measures on X Y with marginals and . Such joint measures are called transference plans (or transport plans, or transportation plans); those achieving the inmum are called optimal transference plans. Of course, the solution of the MongeKantorovich problem depends on the cost function c. The cost function and the probability spaces here can be very general, and some nontrivial results can be obtained as soon as, say, c is lower semicontinuous and X , Y are Polish spaces. Even the apparently trivial choice c(x, y) = 1x=y appears in the probabilistic interpretation of total variation: TV

= 2 inf E 1X=Y ;

law (X) = , law (Y ) = .

Cost functions valued in {0, 1} also occur naturally in Strassens duality theorem. Under certain assumptions one can guarantee that the optimal coupling really is deterministic; the search of deterministic optimal couplings (or Monge couplings) is called the Monge problem. A solution of the Monge problem yields a plan to transport the mass at minimal cost with a recipe that associates to each point x a single point y. (No mass shall be split.) To guarantee the existence of solutions to the

Gluing

23

Monge problem, two kinds of assumptions are natural: First, c should vary enough in some sense (think that the constant cost function will allow for arbitrary minimizers), and secondly, should enjoy some regularity property (at least Dirac masses should be ruled out!). Here is a typical result: If c(x, y) = |x y|2 in the Euclidean space, is absolutely continuous with respect to Lebesgue measure, and , have nite moments of order 2, then there is a unique optimal Monge coupling between and . More general statements will be established in Chapter 10. Optimal couplings enjoy several nice properties: (i) They naturally arise in many problems coming from economics, physics, partial dierential equations or geometry (by the way, the increasing rearrangement and the Holley coupling can be seen as particular cases of optimal transport); (ii) They are quite stable with respect to perturbations; (iii) They encode good geometric information, if the cost function c is dened in terms of the underlying geometry; (iv) They exist in smooth as well as nonsmooth settings; (v) They come with a rich structure: an optimal cost functional (the value of the inmum dening the MongeKantorovich problem); a dual variational problem; and, under adequate structure conditions, a continuous interpolation. On the negative side, it is important to be warned that optimal transport is in general not so smooth. There are known counterexamples which put limits on the regularity that one can expect from it, even for very nice cost functions. All these issues will be discussed again and again in the sequel. The rest of this chapter is devoted to some basic technical tools.

GluingIf Z is a function of Y and Y is a function of X, then of course Z is a function of X. Something of this still remains true in the setting of nondeterministic couplings, under quite general assumptions. Gluing lemma. Let (Xi , i ), i = 1, 2, 3, be Polish probability spaces. If (X1 , X2 ) is a coupling of (1 , 2 ) and (Y2 , Y3 ) is a coupling of (2 , 3 ),

24

1 Couplings and changes of variables

then one can construct a triple of random variables (Z1 , Z2 , Z3 ) such that (Z1 , Z2 ) has the same law as (X1 , X2 ) and (Z2 , Z3 ) has the same law as (Y2 , Y3 ). It is simple to understand why this is called gluing lemma: if 12 stands for the law of (X1 , X2 ) on X1 X2 and 23 stands for the law of (X2 , X3 ) on X2 X3 , then to construct the joint law 123 of (Z1 , Z2 , Z3 ) one just has to glue 12 and 23 along their common marginal 2 . Expressed in a slightly informal way: Disintegrate 12 and 23 as 12 (dx1 dx2 ) = 12 (dx1 |x2 ) 2 (dx2 ),

23 (dx2 dx3 ) = 23 (dx3 |x2 ) 2 (dx2 ), and then reconstruct 123 as

123 (dx1 dx2 dx3 ) = 12 (dx1 |x2 ) 2 (dx2 ) 23 (dx3 |x2 ).

Change of variables formulaWhen one writes the formula for change of variables, say in Rn or on a Riemannian manifold, a Jacobian term appears, and one has to be careful about two things: the change of variables should be injective (otherwise, reduce to a subset where it is injective, or take the multiplicity into account); and it should be somewhat smooth. It is classical to write these formulas when the change of variables is continuously dierentiable, or at least Lipschitz: Change of variables formula. Let M be an n-dimensional Riemannian manifold with a C 1 metric, let 0 , 1 be two probability measures on M , and let T : M M be a measurable function such that T# 0 = 1 . Let be a reference measure, of the form (dx) = eV (x) vol(dx), where V is continuous and vol is the volume (or n-dimensional Hausdor ) measure. Further assume that (i) 0 (dx) = 0 (x) (dx) and 1 (dy) = 1 (y) (dy); (ii) T is injective; (iii) T is locally Lipschitz. Then, 0 -almost surely,

Change of variables formula

25

0 (x) = 1 (T (x)) JT (x), where JT (x) is the Jacobian determinant of T at x, dened by JT (x) := lim0

(1.3)

[T (B (x))] . [B (x)]

(1.4)

The same holds true if T is only dened on the complement of a 0 negligible set, and satises properties (ii) and (iii) on its domain of denition. Remark 1.3. When is just the volume measure, JT coincides with the usual Jacobian determinant, which in the case M = Rn is the absolute value of the determinant of the Jacobian matrix T . Since V is continuous, it is almost immediate to deduce the statement with an arbitrary V from the statement with V = 0 (this amounts to multiplying 0 (x) by eV (x) , 1 (y) by eV (y) , JT (x) by eV (x)V (T (x)) ). Remark 1.4. There is a more general framework beyond dierentiability, namely the property of approximate dierentiability. A function T on an n-dimensional Riemannian manifold is said to be approximately dierentiable at x if there exists a function T , dierentiable at x, such that the set {T = T } has zero density at x, i.e.r0

lim

vol

x Br (x); T (x) = T (x) vol [Br (x)]

= 0.

It turns out that, roughly speaking, an approximately dierentiable map can be replaced, up to neglecting a small set, by a Lipschitz map (this is a kind of dierentiable version of Lusins theorem). So one can prove the Jacobian formula for an approximately dierentiable map by approximating it with a sequence of Lipschitz maps. Approximate dierentiability is obviously a local property; it holds true if the distributional derivative of T is a locally integrable function, or even a locally nite measure. So it is useful to know that the change of variables formula still holds true if Assumption (iii) above is replaced by (iii) T is approximately dierentiable.

26

1 Couplings and changes of variables

Conservation of mass FormulaThe single most important theorem of change of variables arising in continuum physics might be the one resulting from the conservation of mass formula, + ( ) = 0. (1.5) t Here = (t, x) stands for the density of a system of particles at time t and position x; = (t, x) for the velocity eld at time t and position x; and stands for the divergence operator. Once again, the natural setting for this equation is a Riemannian manifold M . It will be useful to work with particle densities t (dx) (that are not necessarily absolutely continuous) and rewrite (1.5) as + ( ) = 0, t where the divergence operator is dened by duality against continuously dierentiable functions with compact support: ( ) = ( ) d.

M

M

The formula of conservation of mass is an Eulerian description of the physical world, which means that the unknowns are elds. The next theorem links it with the Lagrangian description, in which everything is expressed in terms of particle trajectories, that are integral curves of the velocity eld: d t, Tt (x) = Tt (x). (1.6) dt If is (locally) Lipschitz continuous, then the CauchyLipschitz theorem guarantees the existence of a ow Tt locally dened on a maximal time interval, and itself locally Lipschitz in both arguments t and x. Then, for each t the map Tt is a local dieomorphism onto its image. But the formula of conservation of mass also holds true without any regularity assumption on ; one should only keep in mind that if is not Lipschitz, then a solution of (1.6) is not uniquely determined by its value at time 0, so x Tt (x) is not necessarily uniquely dened. Still it makes sense to consider random solutions of (1.6). Mass conservation formula. Let M be a C 1 manifold, T (0, +] and let (t, x) be a (measurable) velocity eld on [0, T ) M . Let

Diusion formula

27

(t )0t 0 there is a compact set K such that [X \K ] for all P. Lemma 4.3 (Lower semicontinuity of the cost functional). Let X and Y be two Polish spaces, and c : X Y R {+} a lower

56

4 Basic properties

semicontinuous cost function. Let h : X Y R {} be an upper semicontinuous function such that c h. Let (k )kN be a sequence of probability measures on X Y, converging weakly to some P (X Y), in such a way that h L1 (k ), h L1 (), andX Y

h dk k

h d.X Y

ThenX Y

c d lim infk

c dk .X Y

In particular, if c is nonnegative, then F : c d is lower semicontinuous on P (X Y), equipped with the topology of weak convergence. Lemma 4.4 (Tightness of transference plans). Let X and Y be two Polish spaces. Let P P (X ) and Q P (Y) be tight subsets of P (X ) and P (Y) respectively. Then the set (P, Q) of all transference plans whose marginals lie in P and Q respectively, is itself tight in P (X Y). Proof of Lemma 4.3. Replacing c by c h, we may assume that c is a nonnegative lower semicontinuous function. Then c can be written as the pointwise limit of a nondecreasing family (c )N of continuous real-valued functions. By monotone convergence, c d = lim

c d = lim

k

lim

c dk lim infk

c dk .

Proof of Lemma 4.4. Let P, Q, and (, ). By assumption, for any > 0 there is a compact set K X , independent of the choice of in P, such that [X \ K ] ; and similarly there is a compact set L Y, independent of the choice of in Q, such that [Y \ L ] . Then for any coupling (X, Y ) of (, ), / / P (X, Y ) K L P [X K ] + P [Y L ] 2. / The desired result follows since this bound is independent of the coupling, and K L is compact in X Y. Proof of Theorem 4.1. Since X is Polish, {} is tight in P (X ); similarly, {} is tight in P (Y). By Lemma 4.4, (, ) is tight in P (X Y), and

Restriction property

57

by Prokhorovs theorem this set has a compact closure. By passing to the limit in the equation for marginals, we see that (, ) is closed, so it is in fact compact. Then let (k )kN be a sequence of probability measures on X Y, such that c dk converges to the inmum transport cost. Extracting a subsequence if necessary, we may assume that k converges to some (, ). The function h : (x, y) a(x) + b(y) lies in L1 (k ) and in L1 (), and c h by assumption; moreover, h dk = h d = a d + b d; so Lemma 4.3 implies c d lim infk

c dk .

Thus is minimizing.

Remark 4.5. This existence theorem does not imply that the optimal cost is nite. It might be that all transport plans lead to an innite total cost, i.e. c d = + for all (, ). A simple condition to rule out this annoying possibility is c(x, y) d(x) d(y) < +, which guarantees that at least the independent coupling has nite total cost. In the sequel, I shall sometimes make the stronger assumption c(x, y) cX (x) + cY (y), (cX , cY ) L1 () L1 (),

which implies that any coupling has nite total cost, and has other nice consequences (see e.g. Theorem 5.10).

Restriction propertyThe second good thing about optimal couplings is that any sub-coupling is still optimal. In words: If you have an optimal transport plan, then any induced sub-plan (transferring part of the initial mass to part of the nal mass) has to be optimal too otherwise you would be able to lower the cost of the sub-plan, and as a consequence the cost of the whole plan. This is the content of the next theorem.

58

4 Basic properties

Theorem 4.6 (Optimality is inherited by restriction). Let (X , ) and (Y, ) be two Polish spaces, a L1 (), b L1 (), let c : X Y R {+} be a measurable cost function such that c(x, y) a(x) + b(y) for all x, y; and let C(, ) be the optimal transport cost from to . Assume that C(, ) < + and let (, ) be an optimal transport plan. Let be a nonnegative measure on X Y, such that and [X Y] > 0. Then the probability measure := [X Y]

is an optimal transference plan between its marginals and . Moreover, if is the unique optimal transference plan between and , then also is the unique optimal transference plan between and . Example 4.7. If (X, Y ) is an optimal coupling of (, ), and Z X Y is such that P (X, Y ) Z > 0, then the pair (X, Y ), conditioned to lie in Z, is an optimal coupling of ( , ), where is the law of X conditioned by the event (X, Y ) Z, and is the law of Y conditioned by the same event. Proof of Theorem 4.6. Assume that is not optimal; then there exists such that (projX )# = (projX )# = , yet c(x, y) d (x, y) < Then consider := ( ) + Z , (4.3) where Z = [X Y] > 0. Clearly, is a nonnegative measure. On the other hand, it can be written as = + Z( ); then (4.1) shows that has the same marginals as , while (4.2) implies that it has a lower transport cost than . (Here I use the fact that the total cost is nite.) This contradicts the optimality of . The conclusion is that is in fact optimal. c(x, y) d (x, y). (4.2) (projY )# = (projY )# = , (4.1)

Convexity properties

59

It remains to prove the last statement of Theorem 4.6. Assume that is the unique optimal transference plan between and ; and let be any optimal transference plan between and . Dene again by (4.3). Then has the same cost as , so = , which implies that = Z , i.e. = .

Convexity propertiesThe following estimates are of constant use: Theorem 4.8 (Convexity of the optimal cost). Let X and Y be two Polish spaces, let c : X Y R{+} be a lower semicontinuous function, and let C be the associated optimal transport cost functional on P (X ) P (Y). Let (, ) be a probability space, and let , be two measurable functions dened on , with values in P (X ) and P (Y) respectively. Assume that c(x, y) a(x) + b(y), where a L1 (d d()), b L1 (d d()). Then C

(d),

(d)

C( , ) (d) .

Proof of Theorem 4.8. First notice that a L1 ( ), b L1 ( ) for almost all values of . For each such , Theorem 4.1 guarantees the existence of an optimal transport plan ( , ), for the cost c. Then := (d) has marginals := (d) and := (d). Admitting temporarily Corollary 5.22, we may assume that is a measurable function of . So C(, ) =X Y

c(x, y) (dx dy)X Y

c(x, y)

(d) (dx dy) (d)

= X Y

c(x, y) (dx dy) C( , ) (d),

=

and the conclusion follows.

60

4 Basic properties

Description of optimal plansObtaining more precise information about minimizers will be much more of a sport. Here is a short list of questions that one might ask: Is the optimal coupling unique? smooth in some sense? Is there a Monge coupling, i.e. a deterministic optimal coupling? Is there a geometrical way to characterize optimal couplings? Can one check in practice that a certain coupling is optimal? About the second question: Why dont we try to apply the same reasoning as in the proof of Theorem 4.1? The problem is that the set of deterministic couplings is in general not compact; in fact, this set is often dense in the larger space of all couplings! So we may expect that the value of the inmum in the Monge problem coincides with the value of the minimum in the Kantorovich problem; but there is no a priori reason to expect the existence of a Monge minimizer. Example 4.9. Let X = Y = R2 , let c(x, y) = |x y|2 , let be H1 restricted to {0} [1, 1], and let be (1/2) H1 restricted to {1, 1} [1, 1], where H1 is the one-dimensional Hausdor measure. Then there is a unique optimal transport, which for each point (0, a) sends one half of the mass at (0, a) to (1, a), and the other half to (1, a). This is not a Monge transport, but it is easy to approximate it by (nonoptimal) deterministic transports (see Figure 4.1).

Fig. 4.1. The optimal plan, represented in the left image, consists in splitting the mass in the center into two halves and transporting mass horizontally. On the right the lled regions represent the lines of transport for a deterministic (without splitting of mass) approximation of the optimum.

Bibliographical notes

61

Bibliographical notesTheorem 4.1 has probably been known from time immemorial; it is usually stated for nonnegative cost functions. Prokhorovs theorem is a most classical result that can be found e.g. in [120, Theorems 6.1 and 6.2], or in my own course on integration [819, Section VII-5]. Theorems of the form inmum cost in the Monge problem = minimum cost in the Kantorovich problem have been established by Gangbo [396, Appendix A], Ambrosio [20, Theorem 2.1], and Pratelli [687, Theorem B]. The most general results to this date are those which appear in Pratellis work: Equality holds true if the source space (X , ) is Polish without atoms, and the cost is continuous X Y R{+}, with the value + allowed. (In [687] the cost c is bounded below, but it is sucient that c(x, y) a(x)+ b(y), where a L1 () and b L1 () are continuous.)

5 Cyclical monotonicity and Kantorovich duality

To go on, we should become acquainted with two basic concepts in the theory of optimal transport. The rst one is a geometric property called cyclical monotonicity; the second one is the Kantorovich dual problem, which is another face of the original MongeKantorovich problem. The main result in this chapter is Theorem 5.10.

Denitions and heuristicsI shall start by explaining the concepts of cyclical monotonicity and Kantorovich duality in an informal way, sticking to the bakery analogy of Chapter 3. Assume you have been hired by a large consortium of bakeries and cafs, to be in charge of the distribution of bread from e production units (bakeries) to consumption units (cafs). The locations e of the bakeries and cafs, their respective production and consumption e rates, are all determined in advance. You have written a transference plan, which says, for each bakery (located at) xi and each caf yj , how e much bread should go each morning from xi to yj . As there are complaints that the transport cost associated with your plan is actually too high, you try to reduce it. For that purpose you choose a bakery x1 that sends part of its production to a distant caf y1 , and decide that one basket of bread will be rerouted to another e caf y2 , that is closer to x1 ; thus you will gain c(x1 , y2 ) c(x1 , y1 ). Of e course, now this results in an excess of bread in y2 , so one basket of bread arriving to y2 (say, from bakery x2 ) should in turn be rerouted to yet another caf, say y3 . The process goes on and on until nally you e

64

5 Cyclical monotonicity and Kantorovich duality

redirect a basket from some bakery xN to y1 , at which point you can stop since you have a new admissible transference plan (see Figure 5.1).

Fig. 5.1. An attempt to improve the cost by a cycle; solid arrows indicate the mass transport in the original plan, dashed arrows the paths along which a bit of mass is rerouted.

The new plan is (strictly) better than the older one if and only if c(x1 , y2 )+c(x2 , y3 )+. . .+c(xN , y1 ) < c(x1 , y1 )+c(x2 , y2 )+. . .+c(xN , yN ). Thus, if you can nd such cycles (x1 , y1 ), . . . , (xN , yN ) in your transference plan, certainly the latter is not optimal. Conversely, if you do not nd them, then your plan cannot be improved (at least by the procedure described above) and it is likely to be optimal. This motivates the following denitions. Denition 5.1 (Cyclical monotonicity). Let X , Y be arbitrary sets, and c : X Y (, +] be a function. A subset X Y is said to be c-cyclically monotone if, for any N N, and any family (x1 , y1 ), . . . , (xN , yN ) of points in , holds the inequalityN i=1 N

c(xi , yi )

c(xi , yi+1 )i=1

(5.1)

(with the convention yN +1 = y1 ). A transference plan is said to be c-cyclically monotone if it is concentrated on a c-cyclically monotone set.

Denitions and heuristics

65

Informally, a c-cyclically monotone plan is a plan that cannot be improved: it is impossible to perturb it (in the sense considered before, by rerouting mass along some cycle) and get something more economical. One can think of it as a kind of local minimizer. It is intuitively obvious that an optimal plan should be c-cyclically monotone; the converse property is much less obvious (maybe it is possible to get something better by radically changing the plan), but we shall soon see that it holds true under mild conditions. The next key concept is the dual Kantorovich problem. While the central notion in the original MongeKantorovich problem is cost, in the dual problem it is price. Imagine that a company oers to take care of all your transportation problem, buying bread at the bakeries and selling them to the cafs; what happens in between is not your problem e (and maybe they have tricks to do the transport at a lower price than you). Let (x) be the price at which a basket of bread is bought at bakery x, and (y) the price at which it is sold at caf y. On the whole, e the price which the consortium bakery + caf pays for the transport e is (y) (x), instead of the original cost c(x, y). This of course is for each unit of bread: if there is a mass (dx) at x, then the total price of the bread shipment from there will be (x) (dx). So as to be competitive, the company needs to set up prices in such a way that (x, y), (y) (x) c(x, y). (5.2) When you were handling the transportation yourself, your problem was to minimize the cost. Now that the company takes up the transportation charge, their problem is to maximize the prots. This naturally leads to the dual Kantorovich problem: supY

(y) d(y)

(x) d(x);X

(y) (x) c(x, y) . (5.3)

From a mathematical point of view, it will be imposed that the functions and appearing in (5.3) be integrable: L1 (X , ); L1 (Y, ). With the intervention of the company, the shipment of each unit of bread does not cost more than it used to when you were handling it yourself; so it is obvious that the supremum in (5.3) is no more than the optimal transport cost:

66

5 Cyclical monotonicity and Kantorovich duality

supc Y

(y) d(y)

(x) d(x)X

(,)

inf

c(x, y) d(x, y) . (5.4)X Y

Clearly, if we can nd a pair (, ) and a transference plan for which there is equality, then (, ) is optimal in the left-hand side and is also optimal in the right-hand side. A pair of price functions (, ) will informally be said to be competitive if it satises (5.2). For a given y, it is of course in the interest of the company to set the highest possible competitive price (y), i.e. the highest lower bound for (i.e. the inmum of) (x) + c(x, y), among all bakeries x. Similarly, for a given x, the price (x) should be the supremum of all (y) c(x, y). Thus it makes sense to describe a pair of prices (, ) as tight if (y) = infx

(x) + c(x, y) ,

(x) = sup (y) c(x, y) . (5.5)y

In words, prices are tight if it is impossible for the company to raise the selling price, or lower the buying price, without losing its competitivity. Consider an arbitrary pair of competitive prices (, ). We can always improve by replacing it by 1 (y) = inf x (x) + c(x, y) . Then we can also improve by replacing it by 1 (x) = supy 1 (y) c(x, y) ; then replacing 1 by 2 (y) = inf x 1 (x) + c(x, y) , and so on. It turns out that this process is stationary: as an easy exercise, the reader can check that 2 = 1 , 2 = 1 , which means that after just one iteration one obtains a pair of tight prices. Thus, when we consider the dual Kantorovich problem (5.3), it makes sense to restrict our attention to tight pairs, in the sense of equation (5.5). From that equation we can reconstruct in terms of , so we can just take as the only unknown in our problem. That unknown cannot be just any function: if you take a general function , and compute by the rst formula in (5.5), there is no chance that the second formula will be satised. In fact this second formula will hold true if and only if is c-convex, in the sense of the next denition (illustrated by Figure 5.2). Denition 5.2 (c-convexity). Let X , Y be sets, and c : X Y (, +]. A function : X R {+} is said to be c-convex if it is not identically +, and there exists : Y R {} such that

Denitions and heuristics

67

x X

(x) = sup (y) c(x, y) .yY

(5.6)

Then its c-transform is the function c dened by y Y c (y) = infxX

(x) + c(x, y) ,

(5.7)

and its c-subdierential is the c-cyclically monotone set dened by c := (x, y) X Y; c (y) (x) = c(x, y) .

The functions and c are said to be c-conjugate. Moreover, the c-subdierential of at point x is c (x) = y Y; or equivalently z X , (x) + c(x, y) (z) + c(z, y). (5.8) (x, y) c ,

y0 x01111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000

y1 x11111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000

x2

1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000

y2

Fig. 5.2. A c-convex function is a function whose graph you can entirely caress from below with a tool whose shape is the negative of the cost function (this shape might vary with the point y). In the picture yi c (xi ).

Particular Case 5.3. If c(x, y) = x y on Rn Rn , then the ctransform coincides with the usual Legendre transform, and c-convexity is just plain convexity on Rn . (Actually, this is a slight oversimplication: c-convexity is equivalent to plain convexity plus lower semicontinuity! A convex function is automatically continuous on the largest

68

5 Cyclical monotonicity and Kantorovich duality

open set where it is nite, but lower semicontinuity might fail at the boundary of .) One can think of the cost function c(x, y) = x y as basically the same as c(x, y) = |x y|2 /2, since the interaction between the positions x and y is the same for both costs. Particular Case 5.4. If c = d is a distance on some metric space X , then a c-convex function is just a 1-Lipschitz function, and it is its own c-transform. Indeed, if is c-convex it is obviously 1Lipschitz; conversely, if is 1-Lipschitz, then (x) (y) + d(x, y), so (x) = inf y [(y) + d(x, y)] = c (x). As an even more particular case, if c(x, y) = 1x=y , then is c-convex if and only if sup inf 1, and then again c = . (More generally, if c satises the triangle inequality c(x, z) c(x, y) + c(y, z), then is c-convex if and only if (y) (x) c(x, y) for all x, y; and then = c .) Remark 5.5. There is no measure theory in Denition 5.2, so no assumption of measurability is made, and the supremum in (5.6) is a true supremum, not just an essential supremum; the same for the inmum in (5.7). If c is continuous, then a c-convex function is automatically lower semicontinuous, and its subdierential is closed; but if c is not continuous the measurability of and c is not a priori guaranteed. Remark 5.6. I excluded the case when + so as to avoid trivial situations; what I called a c-convex function might more properly (!) be called a proper c-convex function. This automatically implies that in (5.6) does not take the value + at all if c is real-valued. If c does achieve innite values, then the correct convention in (5.6) is (+) (+) = . If is a function on X , then its c-transform is a function on Y. Conversely, given a function on Y, one may dene its c-transform as a function on X . It will be convenient in the sequel to dene the latter concept by an inmum rather than a supremum. This convention has the drawback of breaking the symmetry between the roles of X and Y, but has other advantages that will be apparent later on. Denition 5.7 (c-concavity). With the same notation as in Denition 5.2, a function : Y R {} is said to be c-concave if it is not identically , and there exists : X R {} such that = c . Then its c-transform is the function c dened by x X c (x) = sup (y) c(x, y) ;yY

Kantorovich duality

69

and its c-superdierential is the c-cyclically monotone set dened by c := (x, y) X Y; (y) c (x) = c(x, y) .

In spite of its short and elementary proof, the next crucial result is one of the main justications of the concept of c-convexity. Proposition 5.8 (Alternative characterization of c-convexity). For any function : X R{+}, let its c-convexication be dened by cc = ( c )c . More explicitly, cc (x) = sup infe yY xX

(x) + c(x, y) c(x, y) .

Then is c-convex if and only if cc = . Proof of Proposition 5.8. As a general fact, for any function : Y R {} (not necessarily c-convex), one has the identity ccc = c . Indeed, ccc (x) = sup inf sup (y) c(x, y) + c(x, y) c(x, y) ;y x e y e

then the choice x = x shows that ccc (x) c (x); while the choice y = y shows that ccc (x) c (x). If is c-convex, then there is such that = c , so cc = ccc = c = . The converse is obvious: If cc = , then is c-convex, as the c-transform of c . Remark 5.9. Proposition 5.8 is a generalized version of the Legendre duality in convex analysis (to recover the usual Legendre duality, take c(x, y) = x y in Rn Rn ).

Kantorovich dualityWe are now ready to state and prove the main result in this chapter.

70

5 Cyclical monotonicity and Kantorovich duality

Theorem 5.10 (Kantorovich duality). Let (X , ) and (Y, ) be two Polish probability spaces and let c : X Y R {+} be a lower semicontinuous cost function, such that (x, y) X Y, c(x, y) a(x) + b(y)

for some real-valued upper semicontinuous functions a L1 () and b L1 (). Then (i) There is duality: min c(x, y) d(x, y)X Y

(,)

= = = supL1 ()

sup(,)Cb (X )Cb (Y); c Y Y

(y) d(y)

(x) d(x)X

sup(,)L1 ()L1 (); c

(y) d(y) (x) d(x)X

(x) d(x)X

Y Y

c (y) d(y) (y) d(y) X

= supL1 ()

c (x) d(x) ,

and in the above suprema one might as well impose that be c-convex and c-concave. (ii) If c is real-valued and the optimal cost C(, ) = inf (,) c d is nite, then there is a measurable c-cyclically monotone set X Y (closed if a, b, c are continuous) such that for any (, ) the following ve statements are equivalent: (a) is optimal; (b) is c-cyclically monotone; (c) There is a c-convex such that, -almost surely, c (y) (x) = c(x, y); (d) There exist : X R {+} and : Y R {}, such that (y) (x) c(x, y) for all (x, y), with equality -almost surely; (e) is concentrated on . (iii) If c is real-valued, C(, ) < +, and one has the pointwise upper bound c(x, y) cX (x) + cY (y), (cX , cY ) L1 () L1 (), (5.9)

Kantorovich duality

71

then both the primal and dual Kantorovich problems have solutions, so(,) X Y

min

c(x, y) d(x, y) max (y) d(y) X

=

(,)L1 ()L1 (); c

(x) d(x)X

Y

= max

L1 ()

Y

c (y) d(y)

(x) d(x) ,

and in the latter expressions one might as well impose that be cconvex and = c . If in addition a, b and c are continuous, then there is a closed c-cyclically monotone set X Y, such that for any (, ) and for any c-convex L1 (), is optimal in the Kantorovich problem if and only if [ ] = 1; is optimal in the dual Kantorovich problem if and only if c . Remark 5.11. When the cost c is continuous, then the support of is c-cyclically monotone; but for a discontinuous cost function it might a priori be that is concentrated on a (nonclosed) c-cyclically monotone set, while the support of is not c-cyclically monotone. So, in the sequel, the words concentrated on are not exchangeable with supported in. There is another subtlety for discontinuous cost functions: It is not clear that the functions and c appearing in statements (ii) and (iii) are Borel measurable; it will only be proven that they coincide with measurable functions outside of a -negligible set. Remark 5.12. Note the dierence between statements (b) and (e): The set appearing in (ii)(e) is the same for all optimal s, it only depends on and . This set is in general not unique. If c is continuous and is imposed to be closed, then one can dene a smallest , which is the closure of the union of all the supports of the optimal s. There is also a largest , which is the intersection of all the c-subdierentials c , where is such that there exists an optimal supported in c . (Since the cost function is assumed to be continuous, the c-subdierentials are closed, and so is their intersection.) Remark 5.13. Here is a useful practical consequence of Theorem 5.10: Given a transference plan , if you can cook up a pair of competitive prices (, ) such that (y) (x) = c(x, y) throughout the support of , then you know that is optimal. This theorem also shows that

72

5 Cyclical monotonicity and Kantorovich duality

optimal transference plans satisfy very special conditions: if you x an optimal pair (, ), then mass arriving at y can come from x only if c(x, y) = (y) (x) = c (y) (x), which means that x Arg min (x ) + c(x , y) .x X

In terms of my bakery analogy this can be restated as follows: A caf e accepts bread from a bakery only if the combined cost of buying the bread there and transporting it here is lowest among all possible bakeries. Similarly, given a pair of competitive prices (, ), if you can cook up a transference plan such that (y) (x) = c(x, y) throughout the support of , then you know that (, ) is a solution to the dual Kantorovich problem. Remark 5.14. The assumption c cX + cY in (iii) can be weakened into c(x, y) d(x) d(y) < +, or even X Y

x;Y

c(x, y) d(y) < +

> 0; (5.10)

y;X

c(x, y) d(x) < +

> 0.

Remark 5.15. If the variables x and y are swapped, then (, ) should be replaced by (, ) and (, ) by (, ). Particular Case 5.16. Particular Case 5.4 leads to the following variant of Theorem 5.10. When c(x, y) = d(x, y) is a distance on a Polish space X , and , belong to P1 (X ), then inf E d(X, Y ) = sup E [(X) (Y )] = supX

d

d .Y

(5.11) where the inmum on the left is over all couplings (X, Y ) of (, ), and the supremum on the right is over all 1-Lipschitz functions . This is the KantorovichRubinstein formula; it holds true as soon as the supremum in the left-hand side is nite, and it is very useful. Particular Case 5.17. Now consider c(x, y) = x y in Rn Rn . This cost is not nonnegative, but we have the lower bound c(x, y)

Kantorovich duality

73

(|x|2 + |y|2 )/2. So if x |x|2 L1 () and y |y|2 L1 (), then one can invoke the Particular Case 5.3 to deduce from Theorem 5.10 that sup E (X Y ) = inf E (X) + (Y ) = inf d +X Y

d ,

(5.12) where the supremum on the left is over all couplings (X, Y ) of (, ), the inmum on the right is over all (lower semicontinuous) convex functions on Rn , and stands for the usual Legendre transform of . In formula (5.12), the signs have been changed with respect to the statement of Theorem 5.10, so the problem is to maximize the correlation of the random variables X and Y . Before proving Theorem 5.10, I shall rst informally explain the construction. At rst reading, one might be content with these informal explanations and skip the rigorous proof. Idea of proof of Theorem 5.10. Take an optimal (which exists from Theorem 4.1), and let (, ) be two competitive prices. Of course, as in (5.4), c(x, y) d(x, y) d d = [(y) (x)] d(x, y).

So if both quantities are equal, then [c + ] d = 0, and since the integrand is nonnegative, necessarily (y) (x) = c(x, y) (dx dy) almost surely.

On the other hand, if x is an arbitrary point,

Intuitively speaking, whenever there is some transfer of goods from x to y, the prices should be adjusted exactly to the transport cost. Now let (xi )0im and (yi )0im be such that (xi , yi ) belongs to the support of , so there is indeed some transfer from xi to yi . Then we hope that (y0 ) (x0 ) = c(x0 , y0 ) (y ) (x ) = c(x , y ) 1 1 1 1 . . . (y ) (x ) = c(x , y ). m m m m

74

5 Cyclical monotonicity and Kantorovich duality

By subtracting these inequalities from the previous equalities and adding up everything, one obtains (x) (x0 ) + c(x0 , y0 ) c(x1 , y0 ) + . . . + c(xm , ym ) c(x, ym ) . Of course, one can add an arbitrary constant to , provided that one subtracts the same constant from ; so it is possible to decide that (x0 ) = 0, where (x0 , y0 ) is arbitrarily chosen in the support of . Then (x) c(x0 , y0 ) c(x1 , y0 ) + . . . + c(xm , ym ) c(x, ym ) , (5.13) and this should be true for all choices of (xi , yi ) (1 i m) in the support of , and for all m 1. So it becomes natural to dene as the supremum of all the functions (of the variable x) appearing in the right-hand side of (5.13). It will turn out that this satises the equation c (y) (x) = c(x, y) (dx dy)-almost surely.

(y0 ) (x1 ) c(x1 , y0 ) (y ) (x ) c(x , y ) 1 2 2 1 . . . (y ) (x) c(x, y ). m m

Then, if and c are integrable, one can write c d = c d d = c d d.

This shows at the same time that is optimal in the Kantorovich problem, and that the pair (, c ) is optimal in the dual Kantorovich problem. Rigorous proof of Theorem 5.10, Part (i). First I claim that it is sucient to treat the case when c is nonnegative. Indeed, let c(x, y) := c(x, y) a(x) b(y) 0, := a d + b d R.

Whenever : X R{+} and : Y R{} are two functions, dene (x) := (x) + a(x), (y) := (y) b(y).

Kantorovich duality

75

Then the following properties are readily checked: c real-valued c lower semicontinuous L1 () L1 (); (, ), (, ) L1 () L1 (), = = c real-valued c lower semicontinuous L1 () L1 (); c d = d c d ; d d ;

d =

is c-convex is c-convex; is c-concave is c-concave; (, ) are c-conjugate (, ) are c-conjugate; is c-cyclically monotone is c-cyclically monotone. Thanks to these formulas, it is equivalent to establish Theorem 5.10 for the cost c or for the nonnegative cost c. So in the sequel, I shall assume, without further comment, that c is nonnegative. The rest of the proof is divided into ve steps. Step 1: If = (1/n) n xi , = (1/n) n yj , where the costs j=1 i=1 c(xi , yj ) are nite, then there is at least one cyclically monotone transference plan. Indeed, in that particular case, a transference plan between and can be identied with a bistochastic n n array of real numbers aij [0, 1]: each aij tells what proportion of the 1/n mass carried by point xi will go to destination yj . So the MongeKantorovich problem becomes aij c(xi , yi ) inf(aij ) ij

where the inmum is over all arrays (aij ) satisfying aij = 1,i j

aij = 1.

(5.14)

Here we are minimizing a linear function on the compact set [0, 1]nn , so obviously there exists a minimizer; the corresponding transference plan can be written as

76

5 Cyclical monotonicity and Kantorovich duality

=

1 n

aij (xi ,yj ) ,ij

and its support S is the set of all couples (xi , yj ) such that aij > 0. Assume that S is not cyclically monotone: Then there exist N N and (xi1 , yj1 ), . . . , (xiN , yjN ) in S such that c(xi1 , yj2 ) + c(xi2 , yj3 ) + . . . + c(xiN , yj1 ) < c(xi1 , yj1 ) + . . . + c(xiN , yjN ). (5.15) Let a := min(ai1 ,j1 , . . . , aiN ,jN ) > 0. Dene a new transference plan by the formula =+ a nN

(xi=1

,yj+1 )

(xi

,yj )

.

It is easy to check that this has the correct marginals, and by (5.15) the cost associated with is strictly less than the cost associated with . This is a contradiction, so S is indeed c-cyclically monotone! Step 2: If c is continuous, then there is a cyclically monotone transference plan. To prove this, consider sequences of independent random variables xi X , yj Y, with respective law , . According to the law of large numbers for empirical measures (sometimes called fundamental theorem of statistics, or Varadarajans theorem), one has, with probability 1, 1 n := nn i=1

xi ,

1 n := n

n j=1

yj

(5.16)

as n , in the sense of weak convergence of measures. In particular, by Prokhorovs theorem, (n ) and (n ) are tight sequences. For each n, let n be a cyclically monotone transference plan between n and n . By Lemma 4.4, {n }nN is tight. By Prokhorovs theorem, there is a subsequence, still denoted (n ), which converges weakly to some probability measure , i.e. h(x, y) dn (x, y) h(x, y) d(x, y)

for all bounded continuous functions h on X Y. By applying the previous identity with h(x, y) = f (x) and h(x, y) = g(y), we see that

Kantorovich duality

77

has marginals and , so this is an admissible transference plan between and . For each n, the cyclical monotonicity of n implies that for all N N and n -almost all (x1 , y1 ), . . . , (xN , yN ), the inequality (5.1) is satN ised; in other words, n is concentrated on the set C(N ) of all ((x1 , y1 ), . . . , (xN , yN )) (X Y)N satisfying (5.1). Since c is conN tinuous, C(N ) is a closed set, so the weak limit N of n is also concentrated on C(N ). Let = Spt (Spt stands for support), then N = (Spt )N = Spt( N ) C(N ), and since this holds true for all N , is c-cyclically monotone. Step 3: If c is continuous real-valued and is c-cyclically monotone, then there is a c-convex such that c contains the support of . Indeed, let again denote the support of (this is a closed set). Pick any (x0 , y0 ) , and dene (x) := sup supmN

c(x0 , y0 ) c(x1 , y0 ) + c(x1 , y1 ) c(x2 , y1 ) (x1 , y1 ), . . . , (xm , ym ) . (5.17)

+ + c(xm , ym ) c(x, ym ) ;

By applying the denition with m = 1 and (x1 , y1 ) = (x0 , y0 ), one immediately sees that (x0 ) 0. On the other hand, (x0 ) is the supremum of all the quantities [c(x0 , y0 ) c(x1 , y0 )] + . . . + [c(xm , ym ) c(x0 , ym )] which by cyclical monotonicity are all nonpositive. So actually (x0 ) = 0. (In fact this is the only place in this step where c-cyclical monotonicity will be used!) By renaming ym as y, obviously (x) = sup sup supyY mN (x1 ,y1 ),...,(xm1 ,ym1 ),xm

c(x0 , y0 ) c(x1 , y0 )

+ c(x1 , y1 ) c(x2 , y1 ) + + c(xm , y) c(x, y) ; (x1 , y1 ), . . . , (xm , y) So (x) = supy [(y) c(x, y)], if is dened by (y) = sup c(x0 , y0 )c(x1 , y0 ) + c(x1 , y1 )c(x2 , y1 ) + +c(xm , y); m N, (x1 , y1 ), . . . , (xm , y) (5.19) . (5.18)

78

5 Cyclical monotonicity and Kantorovich duality

(with the convention that = out of projY ( )). Thus is a cconvex function. Now let (x, y) . By choosing xm = x, ym = y in the denition of , (x) supm

sup(x1 ,y1 ),...,(xm1 ,ym1 )

c(x0 , y0 ) c(x1 , y0 ) + + c(x, y) c(x, y) .

+ c(xm1 , ym1 ) c(x, ym1 )

In the denition of , it does not matter whether one takes the supremum over m 1 or over m variables, since one also takes the supremum over m. So the previous inequality can be recast as (x) (x) + c(x, y) c(x, y). In particular, (x) + c(x, y) (x) + c(x, y). Taking the inmum over x X in the left-hand side, we deduce that c (y) (x) + c(x, y). Since the reverse inequality is always satised, actually c (y) = (x) + c(x, y), and this means precisely that (x, y) c . So does lie in the csubdierential of . Step 4: If c is continuous and bounded, then there is duality. Let c := sup c(x, y). By Steps 2 and 3, there exists a transference plan whose support is included in c for some c-convex , and which was constructed explicitly in Step 3. Let = c . From (5.17), = sup m , where each m is a supremum of continuous functions, and therefore lower semicontinuous. In particular, is measurable.1 The same is true of . Next we check that , are bounded. Let (x0 , y0 ) c be such that (x0 ) < +; then necessarily (y0 ) > . So, for any x X , (x) = sup [(y) c(x, y)] (y0 ) c(x, y0 ) (y0 ) c ;y1

A lower semicontinuous function on a Polish space is always measurable, even if it is obtained as a supremum of uncountably many continuous functions; in fact it can always be written as a supremum of countably many continuous functions!

Kantorovich duality

79

(y) = inf [(x) + c(x, y)] (x0 ) + c(x0 , y) (x0 ) + c .x

Re-injecting these bounds into the identities = c , = c , we get (x) sup (y) (x0 ) + c ;y

(y) inf (x) (y0 ) c .x

So both and are bounded from above and below. Thus we can integrate , against , respectively, and, by the marginal condition, (y) d(y) (x) d(x) = (y) (x) d(x, y).

Since (y) (x) = c(x, y) on the support of , the latter quantity equals c(x, y) d(x, y). It follows that (5.4) is actually an equality, which proves the duality. Step 5: If c is lower semicontinuous, then there is duality. Since c is nonnegative lower semicontinuous, we can write c(x, y) = lim ck (x, y),k

where (ck )kN is a nondecreasing sequence of bounded, uniformly continuous functions. To see this, just choose ck (x, y) = inf(x ,y )

min c(x , y ), k + k d(x, x ) + d(y, y )

;

note that ck is k-Lipschitz, nondecreasing in k, and further satises 0 ck (x, y) min(c(x, y), k).2 By Step 4, for each k we can nd k , k , k such that k is bounded and c-convex, k = (k )c , and ck (x, y) dk (x, y) = k (y) d(y) k (x) d(x).

Since ck is no greater than c, the constraint k (y) k (x) ck (x, y) implies k (y) k (x) c(x, y); so all (k , k ) are admissible in the2

It is instructive to understand exactly where the lower semicontinuity assumption is used to show c = lim ck .

80

5 Cyclical monotonicity and Kantorovich duality

dual problem with cost c. Moreover, for each k the functions k and k are uniformly continuous because c itself is uniformly continuous. By Lemma 4.4, (, ) is weakly sequentially compact. Thus, up to extraction of a subsequence, we can assume that k converges to some (, ). For all indices k, we have c ck , so c d = limk

c dk ck dk k (y) d(y) k (x) d(x) .

lim supk

= lim supk

On the other hand, by monotone convergence, c d = lim So(,)

c d.

inf

c d

c d lim supk

k (y) d(y) c d.

k (x) d(x)

inf Moreover, k (y) d(y)

(,)

k (x) d(x) inf

k (,)

c d.

(5.20)

Since each pair (k , k ) lies in Cb (X ) Cb (Y), the duality also holds with bounded continuous (and even Lipschitz) test functions, as claimed in Theorem 5.10(i). Proof of Theorem 5.10, Part (ii). From now on, I shall assume that the optimal transport cost C(, ) is nite, and that c is real-valued. As in the proof of Part (i) I shall assume that c is nonnegative, since the general case can always be reduced to that particular case. Part (ii) will be established in the following way: (a) (b) (c) (d) (a) (e) (b). There seems to be some redundancy in this chain of implications, but this is because the implication (a) (c) will be used to construct the set appearing in (e).

Kantorovich duality

81

(a) (b): Let be an optimal plan, and let (k , k )kN be as in Step 5 of the proof of Part (i). Since the optimal transport cost is nite by assumption, the cost function c belongs to L1 (). From (5.20) and the marginal property of , c(x, y) k (y) + k (x) d(x, y) 0, k

so c(x, y) k (y) + k (x) converges to 0 in L1 () as k . Up to choosing a subsequence, we can assume that the convergence is almost sure; then k (yi ) k (xi ) converges to c(xi , yi ), (dxi dyi )-almost surely, as k . By passing to the limit in the inequalityN i=1 N N

c(xi , yi+1 )

i=1

[k (yi+1 ) k (xi )] =

i=1

[k (yi ) k (xi )]

(where by convention yN +1 = y1 ) we see that, N -almost surely,N i=1 N

c(xi , yi+1 )

c(xi , yi ).i=1

(5.21)

At this point we know that N is concentrated on some set N (X Y)N , such that N consists of N -tuples ((x1 , y1 ), . . . , (xN , yN )) satisfying (5.21). Let projk ((xi , yi )1iN ) := (xk , yk ) be the projection on the kth factor of (X Y)N . It is not dicult to check that := 1kN projk (N ) is a c-cyclically monotone set which has full -measure; so is indeed c-cyclically monotone. (b) (c): Let be a cyclically monotone transference plan. The function can be constructed just as in Step 3 of the proof of Part (i), only with some dierences. First, is not necessarily closed; it is just a Borel set such that [ ] = 1. (If is not Borel, make it Borel by modifying it on a negligible set.) With this in mind, dene, as in Step 3 of Part (i), (x) := sup supmN

c(x0 , y0 ) c(x1 , y0 ) + c(x1 , y1 ) c(x2 , y1 ) (x1 , y1 ), . . . , (xm , ym ) . (5.22)

+ + c(xm , ym ) c(x, ym ) ; From its denition, for any x X ,

82

5 Cyclical monotonicity and Kantorovich duality

(x) c(x0 , y0 ) c(x, y0 ) > . (Here the assumption of c being real-valued is useful.) Then there is no diculty in proving, as in Step 3, that (x0 ) = 0, that is c-convex, and that is concentrated on c . The rest of this step will be devoted to the measurability of , c and c . These are surprisingly subtle issues, which do not arise if c is continuous; so the reader who only cares for a continuous cost function might go directly to the next step. First, the measurability of is not clear at all from formula (5.22): This is typically an uncountable supremum of upper semicontinuous functions, and there is no a priori reason for this to be Borel measurable. Since c is nonnegative lower semicontinuous, there is a nondecreasing sequence (c )N of continuous nonnegative functions, such that c (x, y) converges to c(x, y) as , for all (x, y). By Egorovs theorem, for each k N there is a Borel set Ek with [Ek ] 1/k, such that the convergence of c to c is uniform on \ Ek . Since (just as any probability measure on a Polish space) is regular, we can nd a compact set k \Ek , such that [k ] 12/k. There is no loss of generality in assuming that the sets k are increasing in k. On each k , the sequence (c ) converges uniformly and monotonically to c; in particular c is continuous on k . Furthermore, since is obviously concentrated on the union of all k , there is no loss of generality in assuming that = k . We may also assume that (x0 , y0 ) 1 . Now, let x be given in X , and for each k, , m, let Fm,k, x0 , y0 , . . . , xm , ym := c(x0 , y0 ) c (x1 , y0 ) + c(x1 , y1 ) c (x2 , y1 ) + + c(xm , ym ) c (x, ym ) ,

m for (x0 , y0 , . . . , xm , ym ) k . It is clear that Fm,k, is a continuous function (because c is continuous on X X , and c is continuous on m k ). It is dened on the compact set k , and it is nonincreasing as a function of , wi