23
A DISTANCE FOR MULTISTAGE STOCHASTIC OPTIMIZATION MODELS GEORG CH. PFLUG †‡ AND ALOIS PICHLER †§ Abstract. We describe multistage stochastic programs in a purely in-distribution setting, i.e. without any reference to a concrete probability space. The concept is based on the notion of nested distributions, which encompass in one mathematical object the scenario values as well as the infor- mation structure under which decisions have to be made. The nested distance between these distributions is introduced, which turns out to be a gener- alization of the Wasserstein distance for stochastic two-stage problems. We give characterizations of this distance and show its usefulness in examples. The main result states that the difference of the optimal values of two multistage stochastic programs, which are Lipschitz and differ only in the nested distribution of the stochastic parameters, can be bounded by the nested distance of these distributions. This theorem generalizes the well-known Kantorovich-Rubinstein Theorem, which is applicable only in two-stage situations, to multistage. Moreover, a dual characterization for the nested distance is established. The setup is applicable both for general stochastic processes and for finite scenario trees. In particular, the nested distance between general processes and scenario trees is well defined and becomes the important tool for judging the quality of the scenario tree generation. Minimizing – at least heuristically – this distance is what good scenario tree generation is all about. Key words. Stochastic Optimization, Quantitative Stability, Transportation Distance, Scenario Approximation AMS subject classifications. 90C15, 90C31, 90C08 1. Introduction. Multistage stochastic programming models have been suc- cessfully developed for the financial sector (banking [9], insurance [5], pension fund management [18]), the energy sector (electricity production and trading of electricity [15] and gas [1]), the transportation [6] and communication sector [10] and airline rev- enue management [22] among others. In general, the observable data for a multistage stochastic optimization problem are modeled as a stochastic process ξ =(ξ 0 ,...,ξ T ) (the scenario process) and the decisions may depend on its observed values, making the problem an optimization problem in function spaces. The general problem is only in rare cases solvable in an analytic way and for numerical solution the stochastic process is replaced by a finite valued stochastic scenario process ˜ ξ =( ˜ ξ 0 ,..., ˜ ξ T ). By this discretization, the decisions become high dimensional vectors, i.e. are themselves discretizations of the general decision functions. An extension function is then needed to transform optimal solutions of the approximate problem to feasible solutions of the basic underlying problem. There are several results about the approximation of the discretized problem to the original problem, for instance [24, 19, 21, 14]. All these authors assume that both processes, the original ξ and the approximate ˜ ξ are defined on the same probability space. This assumption is quite unnatural, since the approximate processes are finite trees which do not have any relation to the original stochastic processes. In this paper, we demonstrate how to define a new distance between the (nested) distributions of the two stochastic processes and how this distance relates to the solutions of multistage stochastic optimization problems. University of Vienna, Austria. Department of Statistics and Operations Research International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria § [email protected] 1

A Distance For Multistage Stochastic Optimization Models

Embed Size (px)

Citation preview

A DISTANCE FOR MULTISTAGE STOCHASTIC OPTIMIZATIONMODELS

GEORG CH. PFLUG †‡ AND ALOIS PICHLER†§

Abstract. We describe multistage stochastic programs in a purely in-distribution setting, i.e.without any reference to a concrete probability space. The concept is based on the notion of nesteddistributions, which encompass in one mathematical object the scenario values as well as the infor-mation structure under which decisions have to be made.

The nested distance between these distributions is introduced, which turns out to be a gener-alization of the Wasserstein distance for stochastic two-stage problems. We give characterizationsof this distance and show its usefulness in examples. The main result states that the difference ofthe optimal values of two multistage stochastic programs, which are Lipschitz and differ only in thenested distribution of the stochastic parameters, can be bounded by the nested distance of thesedistributions. This theorem generalizes the well-known Kantorovich-Rubinstein Theorem, which isapplicable only in two-stage situations, to multistage. Moreover, a dual characterization for thenested distance is established.

The setup is applicable both for general stochastic processes and for finite scenario trees. Inparticular, the nested distance between general processes and scenario trees is well defined andbecomes the important tool for judging the quality of the scenario tree generation. Minimizing – atleast heuristically – this distance is what good scenario tree generation is all about.

Key words. Stochastic Optimization, Quantitative Stability, Transportation Distance, ScenarioApproximation

AMS subject classifications. 90C15, 90C31, 90C08

1. Introduction. Multistage stochastic programming models have been suc-cessfully developed for the financial sector (banking [9], insurance [5], pension fundmanagement [18]), the energy sector (electricity production and trading of electricity[15] and gas [1]), the transportation [6] and communication sector [10] and airline rev-enue management [22] among others. In general, the observable data for a multistagestochastic optimization problem are modeled as a stochastic process ξ = (ξ0, . . . , ξT )(the scenario process) and the decisions may depend on its observed values, makingthe problem an optimization problem in function spaces. The general problem is onlyin rare cases solvable in an analytic way and for numerical solution the stochasticprocess is replaced by a finite valued stochastic scenario process ξ = (ξ0, . . . , ξT ). Bythis discretization, the decisions become high dimensional vectors, i.e. are themselvesdiscretizations of the general decision functions. An extension function is then neededto transform optimal solutions of the approximate problem to feasible solutions of thebasic underlying problem.

There are several results about the approximation of the discretized problem tothe original problem, for instance [24, 19, 21, 14]. All these authors assume that bothprocesses, the original ξ and the approximate ξ are defined on the same probabilityspace. This assumption is quite unnatural, since the approximate processes are finitetrees which do not have any relation to the original stochastic processes. In this paper,we demonstrate how to define a new distance between the (nested) distributions of thetwo stochastic processes and how this distance relates to the solutions of multistagestochastic optimization problems.

†University of Vienna, Austria. Department of Statistics and Operations Research‡International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria§[email protected]

1

2 1. INTRODUCTION

Original Problem:

minimize f(x) : x ∈ X

Approximate Problem

minimize f(x) : x ∈ X

Extension

x+ = πX (x∗) ∈ X

Solution

x∗ ∈ argmin f ⊂ X

Approximate Solution

x∗ ∈ argmin f ⊂ X-

?

approximation

6

-

Fig. 1.1. The approximation error of the optimization problem is f(x+)− f (x∗).

Designing approximations to multistage stochastic decision models leads to adilemma. The approximation should be coarse enough to allow an efficient numericalsolution but also fine enough to make the approximation error small. It is thereforeof fundamental interest to understand the relation between model complexity andmodel quality. In Figure 1, f denotes the objective function of the basic problem andπX is the extension of the optimal solution of the approximate problem to a feasiblesolution of the original problem. Instead of the direct solution (the dashed arrow),one has to go in an indirect way (the solid arrows).

Other concepts of distances for multistage stochastic programming (see [31] fora comprehensive introduction to stochastic programming) use notions of distances offiltrations, as introduced in [3], see also [20] (cf. [12, 14, 13]). The essential progressin this paper is given by the fact that the nested, multistage distance establishedhere naturally incorporates the information, which is gradually increasing in time in asingle notion of distance. So a separate concept of a filtration-distance is not neededany longer.

This paper is organized as follows: Section 2 presents a framework for multistagestochastic optimization and develops the terms necessary for a multistage framework.Section 3 introduces a general notion of a tree as probability space to carry increasinginformation in a multistage situation. The key concept of this paper is the nestedor multistage distance, which is introduced and described in Sections 4 and 5, basicfeatures are being elaborated there as well. The next Section 6 relates the distanceto multistage stochastic optimization and contains a main result which states thatthe new distance introduced is adapted to multistage stochastic optimization in anatural way. Indeed, it turns out that the multistage optimal value is continuouswith respect to the nested distance and the nested distance turns out to be the bestdistance available in the context presented.

As the distance investigated results from a measure which is obtained by anoptimization procedure there is a dual characterization as well. We have dedicatedSection 7 to elaborate this topic, generalizing the Kantorovich Rubinstein dualitytheorem for the multistage situation.

Some selected and illustrating examples complete the paper.

3

2. Definitions and Problem Description. The stochastic structure of two-stage stochastic programs is simple: In the first stage, all decision relevant parametersare deterministic and in the second stage the uncertain parameters follow a knowndistribution, but no more information is available.

In multistage situations the notion of information is much more crucial: Theinitially unknown, uncertain parameters are revealed gradually stage-by-stage andthis increasing amount of information is the basis for the decisions at later stages.

The following objects are the basic constituents of multistage stochastic optimiza-tion problems:

• Stages: Let T = 0, 1, . . . T be an index set. An element t ∈ T is called astage and associated with time. T is the final stage.

• The information process. A stochastic process ηt, t ∈ T describes theobservable information at all stages t ∈ T. We assume that the first valueη0 is deterministic, i.e. does not contain probabilistic information. Sinceinformation cannot be lost, the information available at time t is the historyprocess νt = (η0, . . . , ηt).

• Filtration. Let Ft be the sigma-algebra generated by νt (in symbol Ft =σ (νt)). Notice that F0 is the trivial sigma-algebra, as ν0 is trivial. Thesequence F = (Ft)Tt=0 of increasing sigma-algebras1 is called a filtration. Weshall write νt C Ft to express that the function νt is Ft measurable and –following [27] – summarize by writing

ν C F

that νt C Ft for all t ∈ T.• The value process. The process describing the decision relevant quantitiesis the value process ξ = (ξ0, . . . , ξT ). The process ξ is measurable with respectto the filtration F,

ξ C F.

Therefore ξt can be viewed as a function of νt, i.e. ξt = ξt (νt) (cf. [32,Theorem II.4.3]) and ξ0 again is trivial. The values of the process ξt lie ina space endowed with a metric dt. In many situations this is just the linearmetric space (Rnt , dt), where the metric dt is any metric, not necessarily theEuclidean one.

• The decision space. At each stage t a decision xt has to be made, its valueis required to lie in a feasible set Xt, which is a linear vector space. The totaldecision space is X := (Xt)Tt=0.

• Non-Anticipativity. The decision xt must be based on the informationavailable at each time t ∈ T, therefore it must satisfy

xC F.

This measurability condition is frequently referred to as non-anticipativity inliterature.

• The Loss Function is H (ξ, x). In the sequel the cost function may be as-sociated with loss, which is intended to be minimized by choosing an optimaldecision x.

1Ft1 ⊆ Ft2 whenever t1 ≤ t2.

4 3. TREES AS THE BASIC PROBABILITY SPACE

Multistage stochastic optimization problems with expectation maximization can beframed as

minEH (ξ0, x0, . . . ξT , xT ) : xt ∈ Xt, xt C Ft, t ∈ T= min EH (ξ, x) : x ∈ X, xC F . (2.1)

In applications, problem (2.1) is formulated without reference to a specific prob-ability space, i.e. just the distributions of the stochastic processes are given. Noticethat the observable information process ν is determined only up to bijective trans-formations, since for any νt, which is a bijective image of νt, the generated filtrationF is the same. This invariance with respect to bijective transformations becomes inparticular evident, if the problem is formulated in form of a scenario tree. Here, typi-cally, the names of the nodes are irrelevant, only the tree’s topology and the scenariovalues ξ sitting on the nodes are of importance.

For this reason we reformulate the setting in a purely in-distribution manner,using the notion of trees in probabilistic sense as outlined in the next section.

3. Trees as the basic probability space. Keeping in mind that the process νtypically represents the history of the information process we may start with directlydefining the process ν as a tree process.

• Tree Processes. A stochastic process (νt) , t ∈ T with state spaces inNt, t ∈ T is called tree process, if σ (νt) = σ (ν0, . . . νt) for all t. A tree processcan be equivalently characterized by the fact that the conditional distributionof (ν0, . . . , νt−1) given νt is degenerate (i.e. sits on just one value). We denotea typical element of NT by ω and its predecessor in Nt (which is almost surelydetermined) by ωt, in symbol ωt = predt (ω).

• Trees. The tree process induces a probability distribution P on NT and wemay introduce NT as the basic probability space, i.e. we set Ω := NT . Ω isa tree (of depth T ), if there are projections predt, t ∈ T such that

preds predt = preds (s ≤ t) . (3.1)

Typically a tree is rooted, that is pred0 is one single value. Notice that in thisdefinition a tree does not have to be finite or countable.Property (3.1) implies that the sigma algebras Ft := σ (predt) form a fil-tration, which is denoted by F (pred) := σ (predt : t ∈ T). Without loss ofgenerality we my choose in the following (Ω,F (pred) , P ) as our basic filteredprobability space.

• Value-and-information-structures. As the value process ξt is a func-tion of the tree process νt, we may view it as a stochastic process on Ω adaptedto the filtration F (pred), i.e. ξt(ω) = ξt(ωt) with ωt = predt(ω). We call thestructure (Ω,F (pred) , P, ξ) the value-and-information-structure.

It is the purpose of this paper to assign a distance to different value-and-information-structures on the basis of distributional properties only. To this end we introduce thedistribution of such a value-and-information-structure as nested distribution. Thisconcept is explained in the Appendix.

The nested distributions are defined in a pure distributional concept. The re-lation between the nested distribution P and the value-and-information-structure(Ω,F (pred) , P, ξ) is comparable to the the relation between a probability measureP on Rd and a Rd-valued random variable ξ with distribution P .

5

Bearing this in mind, we may alternatively consider either the nested distributionP or its realization (

Ω,F (pred) , P, (ξt)t∈T)

with Ω being a tree, F the information and ξ the value process. We symbolize thisfact by

(Ω,F (pred) , P, ξ) ∼ P.

In the next section the nested distance between two nested distributions P and P isdefined. Due to the properties just mentioned one may – without loss of generality –assume that they are represented by two probability distributions P on (Ω,F (pred))and P on

(Ω, F (pred)

)together with the value processes ξt (ωt) and ξt (ωt).

4. The Transportation Distance. The distance of two value-and-information-structures, as defined here, is in line with the concept of transportation distances,which have been studied intensively in the recent past. In order to thoroughly intro-duce the concept recall the usual Wasserstein or Kantorovich distance for distribu-tions.

Kantorovich Distance andWasserstein Distance. Transportation distancesintend to minimize the effort or total costs that have to be taken into account whenpassing from a given distribution to a desired one. Initial works on the subject includethe original work by Monge [23] as well as the seminal work by Kantorovich [17]; acompelling treatment of the topic can be found in Villani’s books [34] and [35], as wellas in [28]; the Wasserstein distance has been discussed in [30] as well for stochastictwo-stage problems.

The cumulative cost is the sum of all respective distances arising from transportinga particle from ω to ω over the distance d(ω, ω). The optimal value to accomplishthis is called Wasserstein or Kantorovich distance of order r (r ≥ 1) and denoteddr(P, P

): 2

dr(P, P

)= inf

(∫d (ω, ω)r π [dω,dω]

) 1r

, (4.1)

where the infimum is over all bivariate probability measures (also called transportationmeasures) π on the product sigma algebra

FT ⊗ FT := σ(A×B : A ∈ FT , B ∈ FT

)having the initial distribution P and final distribution P as marginals, that is

π[A× Ω

]= P [A] and (4.2)

π [Ω×B] = P [B]

for all measurable sets A ∈ FT and B ∈ FT . The infimum in (4.2) is attained, i.e.the optimal transportation measure π exists.

The solution to the linear problem (4.1) is well investigated and understood, andin many situations (cf. [29] and [4]) allows a particular representation as a transportplan: That is to say there is a function (a transport map) τ : Ω→ Ω such that π is asimple push-forward or image measure π = P id×τ = P (id×τ)−1.

2In case of costs which are not proportional to the distance a convex cost function c can beemployed instead of d, which generalizes the concept to some extend.

6 5. THE MULTISTAGE DISTANCE

5. The Multistage Distance. In light of the introduction, the problem (4.1)just describes the situation where the filtration consists of a single sigma algebra. Togeneralize for the multistage situation let two nested distributions P and P and aparticular realization

(Ω,F, P, ξ) ∼ P and(Ω, F, P , ξ

)∼ P

be given. We intend to minimize the effort or costs that have to be taken into accountwhen passing from one value-and-information-structure to another.

For this purpose a distance on the original sample space Ω× Ω – as in (4.1) – isneeded. This is accomplished by the function

d (ω, ω) :=T∑t=0

dt(ξt (ωt) , ξt (ωt)

), (5.1)

where dt is the distance available in the state space of the processes ξ and ξ andωt = predt (ω) (ωt = predt (ω), resp.).3

Most importantly, one needs to take care of the gradually increasing informationprovided by the filtrations. In the presence of filtrations the entire, complete infor-mation is available at the very final stage T only via FT and FT . So the optimalmeasure π for (4.1) in general is not adapted to the situations of lacking information,which are described by previous σ-algebras Ft and Ft, t < T .

This is respected by the following definition. The new distance – the multistagedistance – then is influenced by both, the probability measure P and the entire se-quence of increasing information F, so that the resulting quantity depends on theentire P.

Definition 5.1 (The multistage distance). The multistage distance of orderr ≥ 0 4 of two value-and-information-structures P and P is the optimal value of theoptimization problem

minimize (in π)(∫d (ω, ω)r π [dω,dω]

) 1r

subject to π[A× Ω | Ft ⊗ Ft

](ω, ω) = P [A | Ft] (ω) (A ∈ FT , t ∈ T) ,

π[Ω×B | Ft ⊗ Ft

](ω, ω) = P

[B | Ft

](ω)

(B ∈ FT , t ∈ T

),

(5.2)where the infimum in (5.2) is among all bivariate probability measures π ∈ P

(Ω× Ω

)defined on FT ⊗ FT . Its optimal value – the nested distance – is denoted by

dlr(P, P

). (5.3)

Remark 1. It is an essential observation that the conditional measures in (5.2)depend on two variables (ω, ω), although the conditions imposed force them to effec-tively depend just on a single variable ω (ω, resp.). To ease the notation we introduce

3Alternatively one may use the equivalent distance functions

d(ω, ω) =

(T∑

t=0

dt

(ξt (ωt) , ξt (ωt)

)) 1p

or d(ω, ω) = maxt=0...T dt

(ξt (ωt) , ξt (ωt)

)instead.

4This turns out to be a distance only for r ≥ 1.

7

the natural projections i : Ω × Ω → Ω and i : Ω × Ω → Ω; moreover, f i (g i, resp.)is just shorthand for the composition f i (g i, resp.). The symbol i was chosen toaccount for identity onto the respective subspace.

Remark 2. Problem (5.2) may be generalized by replacing dr by a general (con-vex) cost function c. The theory developed below will not be affected by the particularchoice dr, it can be repeated for a general cost function c.

Remark 3. In the Appendix we show that dlr can be seen as the usual Kantorovichdistance on the complex Polish space of nested distributions. Hence dlr satisfies thetriangle inequality.

The definition of the multistage distance builds on conditional probabilities. Thiscomes quite natural as it is built on conditional information. The marginal conditions(5.2) intuitively state that the observation P [A | Ft], at some previous stage Ft,has to be reflected by π

[A× Ω | Ft ⊗ Ft

], irrespective of the current status of the

second process P and irrespective of the previous outcome in Ft, which represents theinformation available to P at the same earlier stage t ∈ T.

As for the notion of conditional probabilities involved in Definition 5.1 we recallthe basic features.

Conditional Expectation. For g a measurable function

σ (g) :=g−1 (S) : S measurable

(5.4)

is a sigma algebra. By the Radon-Nikodym Theorem (cf. [36]) there is a randomvariable, denoted E [X| g] on the image set of g, such that∫

g−1(S)X (ω) P [dω] =

∫S

E [X| g] (s) P[g−1 (ds)

]=∫g−1(S)

E [X| g] (g (ω)) P [dω]

for any measurable S; the relation to conditional expectation with respect to thefiltration σ (g) thus is

E [X| g] g = E [X|σ (g)] .

Conditional Probabilities. Conditional probabilities are defined via condi-tional expectation, P [A | Ft] := EP [1A | Ft] , where A ∈ FT and Ft ⊆ FT . Theconditional probability is a function

P [· | Ft] (·) : FT × Ω→ [0, 1]

with the characterizing property∫B

P [A | Ft] (ω)P [dω] = P [A ∩B] (A ∈ FT , B ∈ Ft) . (5.5)

Remark 4 (The constraints in (5.2) are redundant at the final stage t = T ).Note that for A ∈ Ft, P [A | Ft] = 1A and π

[A× Ω | Ft ⊗ Ft

]= 1A×Ω. As obviously

1A×Ω = 1A i

always holds true (A ∈ Ft) it follows that

π[A× Ω | Ft ⊗ Ft

]= P [A | Ft] i;

8 5. THE MULTISTAGE DISTANCE

by analogue reasoning

π[Ω×B | Ft ⊗ Ft

]= P

[B | Ft

]i

certainly holds for B ∈ Ft: Whence the marginal conditions in (5.2) are certainlysatisfied for all measures on FT (t ≤ T ) and thus are redundant for the final staget = T .

Remark 5. It follows from the previous remark that the multistage distanceand the Wasserstein distance coincide, i.e. dlr

(P, P

)= dr

(P, P

)for the filtrations

F = (F0,FT , . . .FT ) and F =(F0, FT , . . . FT

). The same, however, holds true

for the more general situation of filtrations F = (F0, . . .F0,FT , . . .FT ) and F =(F0, . . . F0, FT , . . . , FT

)where the information available increases all of a sudden for

both at the same stage.Lemma 5.2. The multistage distance (5.3) is well defined, the product measure

π := P ⊗ P is feasible for all conditions in (5.2).Proof. The product measure satisfies all above conditions:∫

C×DP [A | Ft] i ·P

[B | Ft

]i dπ =

∫C×D

P [A | Ft] i ·P [B | Ft] i dP ⊗ P

=∫C

P [A | Ft] dP ·∫D

P[B | Ft

]dP

= P [A ∩ C] · P [B ∩D] = π [(A ∩ C)× (B ∩D)]

= π [(A×B) ∩ (C ×D)] =∫C×D

π[A×B | Ft ⊗ Ft

].

As these equations hold true for any sets C ∈ Ft and D ∈ Ft, and as moreover both,P [A | Ft] i ·P [B | Ft] i and π

[A×B | Ft ⊗ Ft

]are Ft ⊗ Ft measurable, it follows

that they are just versions of each other, so they coincide

P [A | Ft] · P[B | Ft

]= π

[A×B | Ft ⊗ Ft

]π-almost everywhere. For the particular choices A = Ω or B = Ω we finally get theconditions in the primal problem (5.2).

It is an immediate consequence of Hölder’s inequality that continuity with respectto order 1 immediately implies continuity for other orders as well:

Lemma 5.3 (Hölder inequality). Suppose that 0 < r1 ≤ r2, then

dlr1

(P, P

)≤ dlr2 (P,P) .

Proof. Observe that 1r2

r2−r1

+ 1r2r1

= 1. By the generalized Hölder inequality5 thus

∫dr1dπ =

∫1 · dr1dπ ≤

(∫1

r2r2−r1 dπ

) r2−r1r2·(∫

dr1r2r1 dπ

) r1r2

=(∫

dr2dπ) r1

r2.

Taking the infimum over all feasible probability measures reveals the assertion.As π = P ⊗ P is feasible we may further conclude that dlr

(P, P

)r ≤ EP⊗P dr forany filtrations.

5The generalized Hölder inequality applies for indices smaller than 1 as well, cf. [37] or [11].

9

Theorem 5.4. For P and P probability measures and irrespective of the filtra-tions F and F there are uniform lower and upper bounds

dr(P, P

)r ≤ dlr(P, P

)r ≤ EP⊗P dr.Proof. The upper bound was established in Lemma 5.2. As for the lower bound

notice first that F0 ⊗ F0 =∅, Ω× Ω

is the trivial sigma algebra on Ω× Ω.

For the trivial sigma algebra the conditional probabilities are constant functions,thus

P [A] ≡ P [A | F0] i = π[A× Ω | F0 ⊗ F0

]≡ π [A× Ω] ,

so the first marginal conditions hold. As

P [B] ≡ P[B | F0

]i = π

[Ω×B | F0 ⊗ F0

]≡ π [Ω×B] ,

the second marginal conditions hold as well. Together they are just the marginal con-ditions for the Wasserstein distance dr in (4.2) and since this constraint is containedin the constraints (5.2), it is obvious that dr

(P, P

)≤ dlr

(P, P

).

It is important to note that the conditions (5.2) in Definition 5.1 can be relaxed:The equations do not have to hold for all sets A ∈ FT and B ∈ FT , it is sufficient torequire that those conditions just hold for sets taken from the next stage. The precisestatement will be of importance in the sequel and reads as follows:

Lemma 5.5 (Tower property). In (5.2), the conditions

π[A× Ω | Ft ⊗ Ft

]= P [A | Ft] i (A ∈ FT ) (5.6)

may be replaced by

π[A× Ω | Ft ⊗ Ft

]= P [A | Ft] i (A ∈ Ft+1) . (5.7)

Proof. To verify this observe first that for A ∈ FT

Eπ[1A i | Ft ⊗ Ft

]= Eπ

[1A×Ω | Ft ⊗ Ft

]= π

[A× Ω | Ft ⊗ Ft

]= P [A | Ft] i = EP [1A | Ft] i,

and by linearity thus

Eπ[λ i | Ft ⊗ Ft

]= EP [λ | Ft] i

for every integrable λ / FT .Assume now that (5.7) holds true and let A ∈ FT . The assertion follows from the

tower property of conditional expectation, for

π[A× Ω | Ft ⊗ Ft

]= Eπ

[1A i | Ft ⊗ Ft

]= Eπ

[Eπ[1A i | FT−1 ⊗ FT−1

]| Ft ⊗ Ft

]= Eπ

[EP [1A | FT−1] i | Ft ⊗ Ft

].

As EP [1A | FT−1] C FT−1 the steps above may be repeated to give

π[A× Ω | Ft ⊗ Ft

]= Eπ

[EP [1A | FT−1] i | Ft ⊗ Ft

]= Eπ

[Eπ[EP [1A | FT−1] i | FT−2 ⊗ FT−2

]| Ft ⊗ Ft

]= Eπ

[EP [EP [1A | FT−1] | FT−2] i | Ft ⊗ Ft

]= Eπ

[EP [1A | FT−2] i | Ft ⊗ Ft

],

10 6. RELATION TO MULTISTAGE STOCHASTIC OPTIMIZATION

and a repeated application gives further

π[A× Ω | Ft ⊗ Ft

]= Eπ

[EP [1A | FT−2] i | Ft ⊗ Ft

]= Eπ

[EP [1A | FT−3] i | Ft ⊗ Ft

]= . . .

= Eπ[EP [1A | Ft] i | Ft ⊗ Ft

].

Finally, as EP [1A | Ft] i is Ft measurable,

π[A× Ω | Ft ⊗ Ft

]= Eπ

[EP [1A | Ft] i | Ft ⊗ Ft

]= EP [EP [1A | Ft] | Ft] i = EP [1A | Ft] i = P [A | Ft] i,

which is the general condition (5.6).

6. Relation to Multistage Stochastic Optimization. As already addressedin the introduction the multistage distance is a suitable distance for multistage stochas-tic optimization problems. To elaborate the relation consider the value function v(P)of stochastic optimization problem

v (P) = inf EPH (ξ, x) : x ∈ X, xC F (6.1)

= inf∫

H (ξ, x) dP : x ∈ X, xC F

of the expectation-maximization type.

The following theorem is the main theorem to bound stochastic optimizationproblems by the nested distance, it links smoothness properties of the loss functionH with smoothness of the value function v with respect to the multistage distance.

Theorem 6.1 (Lipschitz property of the value function). Let P, P be two nesteddistributions. Assume that X is convex, and the profit function H is convex in x forany ξ fixed,

H (ξ, (1− λ)x0 + λx1) ≤ (1− λ)H (ξ, x0) + λH (ξ, x1) .

Moreover let H be uniformly Hölder continuous (β ≤ 1) with constant Lβ, that is

∣∣H (ξ, x)−H(ξ, x)∣∣ ≤ Lβ ·(∑

t∈T

dt(ξt, ξt

))βfor all x ∈ X.

Then the value function v (6.1) inherits the Hölder constant with respect to themultistage distance, that is∣∣v (P)− v

(P)∣∣ ≤ Lβ · dlr

(P, P

)βfor any r ≥ 1.6 In addition we have the following corollary.

Corollary 1 (Best possible bound). Assuming that the distance may be repre-sented by a norm, the Lipschitz constant for the situation β = 1 cannot be improved.

Proof. (Proof of Theorem 6.1) Let x C F be a decision vector for problem (6.1)and nested distribution P and let π be a bivariate probability measure on FT ⊗ FTwhich satisfies the conditions (5.2), i.e. which is an optimal transportation measure.

6For β = 1 Hölder continuity is just Lipschitz continuity.

11

Note that x is a vector of non-anticipative decisions for any t ∈ T, and whence

EPH (ξ, x) =∫H (ξ (ω) , x (ω))P [dω]

=∫H (ξ (ω) , x (ω))π [dω,dω] = EπH (ξ i, x i) . (6.2)

Define next the new decision function

x : = Eπ[x i | i

],

which has the desired measurability, that is xCF due its definition7 and the conditions(5.2) imposed on π. With this choice, by convexity,

H(ξ (ω) , x (ω)

)= H

(ξ (ω) , Eπ

[x i | i

](ω))≤ Eπ

[H(ξ (ω) , x i

)| i]

(ω)

by Jensen’s inequality: Again, Jensen’s inequality applies for all t ∈ T jointly due tothe joint restrictions (5.2) on π. Integrating with respect to P one obtains

EPH(ξ, x)

=∫

ΩH(ξ (ω) , x (ω)

)P [dω] ≤

∫ΩEπ[H(ξ i, x i

)| i]

i dP

= Eπ[Eπ[H(ξ i, x i

)| i]

i]

= EπH(ξ i, x i

),

and together with (6.2) it follows that

EPH(ξ, x)− EPH (ξ, x) ≤ Eπ

[H(ξ i, x i

)]− Eπ [H (ξ i, x i)]

= Eπ[H(ξ i, x i

)−H (ξ i, x i)

]≤ Lβ · Eπd

(i, i)β.

Now let x be an ε-optimal decision for v (P), that is∫H (ξ, x) dP < v (P) + ε.

It follows that

EPH(ξ, x)− v (P) <EPH

(ξ, x)− EPH (ξ, x) + ε

≤ε+ Lβ · Eπdβ

and whence

v(P)− v (P) < ε+ Lβ · Eπdβ .

Letting ε→ 0, taking the infimum over all π and interchanging the roles of P and Pgives that ∣∣v (P)− v

(P)∣∣ ≤ Lβ · dlβ

(P, P

)β.

The statement for general index r ≥ 1 ≥ β finally follows by applying Lemma 5.3.

7x may be defined alternatively for any component separately as xt (ω) :=∫

Ω xt (ω)π [dω | ω];the conditional probabilities are available by disintegration (cf. [7], [8] or [2]) and they satisfyπ [A×B] =

∫Bπ [A|ω]π

[i−1 (dω)

]=∫

Bπ [A|ω] P (dω).

12 6. RELATION TO MULTISTAGE STOCHASTIC OPTIMIZATION

Proof. (Proof of the Corollary) Let X be the convex set of Markov kernelsx [B|ω, ω′] (ω, ω′ ∈ Ω, B ∈ FT ) such that

π [A×B] :=∫A

x [B|ω, ω′]P [dω′] =∫A×B

x [dω′′|ω, ω′]P [dω′] (6.3)

satisfies the marginal conditions (5.2) for P and all ω, and abbreviate x [B|ω′] :=∫x [B|ω, ω′]P [dω]. Define

H (ω, x) :=∫‖ω − ω′′‖x [dω′′|ω′]P [dω′]

and observe that

H (ω, x) =∫‖ω − ω′′‖x [dω′′|ω, ω′]P [dω′]

=∫‖ω − ω′′‖x [dω′′|ω, ω′]P [dω′]P [dω] ;

moreover H is Lipschitz-1, as

H (ω1, π)−H (ω2, π) =∫‖ω1 − ω′′‖ − ‖ω2 − ω′′‖x [dω′′|ω′]P [dω′]

≤∫‖ω1 − ω2‖x [dω′′|ω′]P [ω′] = ‖ω1 − ω2‖

by the reverse triangle inequality and the fact that the role of ω1 and ω2 may beinterchanged. Then consider the value function

v (P) := infx∈X

EPH (·, x) = infx∈X

∫H (ω, x)P [dω] ,

where the infimum is among all Markov kernels x ∈ X with marginal condition π2 = P

as above. With this choice

v (P) = infx∈X

EPH (·, x) = infx∈X

∫ ∫‖ω − ω′′‖x [dω′′|ω, ω′]P [dω′]P [dω]

= infx∈X

∫ ∫‖ω − ω′′‖x [dω′′|ω]P [dω] = inf

π

∫ ∫‖ω − ω′′‖π [dω′′,dω] = dl1

(P, P

)by the marginal conditions imposed on X. Moreover

v(P)

= infx∈X

EPH (·, x) = infx∈X

∫ ∫‖ω − ω′′‖x [dω′′|ω, ω′]P [dω′] P [dω] .

Employing the Dirac-measure x [A|ω, ω′] := 1A (ω) we find that∫‖ω − ω′′‖x [dω′′|ω′, ω] = 0,

and thus

v(P)

= 0.

Whence

v (P)− v(P)

= dl1(P, P

),

and the Lipschitz constant cannot be improved. As for a general Lipschitz constantL other than 1 consider the function L ·H instead, which completes the proof.

13

7. The Dual Representation of the Multistage Distance. As the prob-lem (4.2) to compute the Kantorovich or Wasserstein distance is formulated as anoptimization problem it has a dual formulation. Its dual is well-known and well in-vestigated and may be stated as follows (cf. [2]):

Theorem 7.1 (Duality formula). The minimum of the Kantorovich Problem(4.2) equals

supµ,µ

EPµ+ EP µ

where the supremum runs among all pairs (µ, µ) such that

µ (ω) + µ (ω) ≤ d (ω, ω)r .

For the optimal measure π, in addition,

µ (ω) + µ (ω) = d (ω, ω)r π-a.e. in Ω× Ω. (7.1)

The duality formula in Theorem 7.1 can be given as optimal value for the alter-native representation

maximize (in M0, λ, λ) M0subject to M0 + λ (ω) + λ (ω) ≤ d (ω, ω)r ,

EPλ = 0 and EPλ = 0;

to accept the latter statement just shift the dual functions by their respective meansand choose λ := µ− EPµ, λ := µ− EPµ and M0 := EPµ+ EPµ.

It is natural to ask for the dual of the problem of the multistage distance, that isto say the dual of problem (5.2). The dual allows a characterization as well, which isthe content of the next theorem, its formulation is in line with the preceding remark.

Denote the set of FT ⊗ FT measurable functions as L0 (FT ⊗ FT ) and define theprojections

Prt : L0 (FT ⊗ FT )→ L0 (Ft ⊗ FT )µ i ·µi 7→ EP (µ|Ft) i ·µi

and

Prt : L0 (FT ⊗ FT )→ L0 (FT ⊗ Ft)µ i ·µi 7→ µ i ·EP

(µ|Ft

)i;

as the functions 1A i ·1B i form a basis of L0 the projections Prt and Prt are welldefined.

Theorem 7.2 (Duality for the multistage distance). The minimum of the mul-tistage problem (5.2) equals the supremum of all numbers M0 such that

MT (ω, ω) ≤ d (ω, ω)r (ω, ω) ∈ Ω× Ω,

where Mt is a R−valued process on Ω× Ω of the form

Mt := M0 +t∑

s=1

(λs + λs

)

14 7. THE DUAL REPRESENTATION OF THE MULTISTAGE DISTANCE

and the measurable functions λtCFt⊗Ft−1 and λtCFt−1⊗Ft are chosen such thatPrt−1 λt = 0 and Prt−1λt = 0.

For the optimal measure π of the primal problem and the optimal process Mt ofthe dual problem, in addition,

Mt = Eπ[dr | Ft ⊗ Ft

]π-a.e. in Ω× Ω

and Mt is a π−martingale for the filtration(Ft ⊗ Ft

)t∈T such that

dlr(P, P

)r = M0 = EπMt

for all stages t ∈ T.Remark 6. It should be noted that particularly equality is attained π-a.e. at the

final stage T , i.e.

MT = dr π-a.e. in Ω× Ω

holds true almost everywhere. This is in line with (7.1) for the two stage problem,T = 1. In order to prove the latter theorem let us start with the following observation.

Proposition 7.3 (Encoding). The measure π satisfies the conditions

π[A× Ω | Ft ⊗ Ft

]= P [A | Ft] i (7.2)

for all A ∈ FT if and only if

Eπλ = Eπ Prt λ (7.3)

holds for all integrable functions λC FT ⊗ Ft.Proof. To prove assertion (7.3) note first that it is enough to prove the claim

just for functions λ = µ i ·µi, as these functions form a basis of all FT ⊗ Ft-integrablefunctions whenever µC FT and µC Ft.

For the function 1A (A ∈ FT )

EP [1A | Ft] i = P [A | Ft] i = π[A× Ω | Ft ⊗ Ft

]= Eπ

[1A×Ω | Ft ⊗ Ft

]= Eπ

[1A i | Ft ⊗ Ft

],

and it follows by linearity that

EP [µ | Ft] i = Eπ[µ i | Ft ⊗ Ft

]π a.e. for any integrable µC FT . The left hand side, multiplied by µ i, reads

EP [µ | Ft] i ·µ i;

the right hand side, multiplied by the same quantity, gives

µi · Eπ[µ i | Ft ⊗ Ft

]= Eπ

[µ i ·µ i | Ft ⊗ Ft

]as µ is Ft-measurable and thus µ i C Ft ⊗ Ft, that is

EP [µ | Ft] i ·µ i = Eπ[µ i ·µ i | Ft ⊗ Ft

]. (7.4)

15

Taking expectation for both with respect to π whence gives that

Eπ Prt(µ i ·µ i

)= EπEP [µ | Ft] i ·µ i = EπEπ

[µ i ·µ i | Ft ⊗ Ft

]= Eπ

[µ i ·µ i

],

which is the desired assertion. To prove the converse we need to show (7.2), i. e. that

π[A× Ω | Ft ⊗ Ft

]= P [A | Ft] i

holds true for every set A ∈ FT . As both, the left hand side and the right hand sideare Ft ⊗ Ft-measurable densities (random variables) it is sufficient to show that∫

C×Dπ[A× Ω | Ft ⊗ Ft

]dπ =

∫C×D

P [A | Ft] i dπ (7.5)

for all sets C ∈ Ft and D ∈ Gt.To this end observe first that∫C×D

π[A× Ω | Ft ⊗ Ft

]dπ = Eπ

[1C×D · Eπ

[1A×Ω | Ft ⊗ Ft

]]= Eπ

[Eπ[1C×D · 1A×Ω | Ft ⊗ Ft

]]= Eπ

[1(C×D)∩(A×Ω)

]= π [(A ∩ C)×D] ;

secondly ∫C×D

P [A | Ft] i dπ = Eπ [1C×D · EP [1A | Ft] i]

= Eπ[1C i ·1D i · EP [1A | Ft] i

]= Eπ

[EP [1C · 1A | Ft] i ·1D i

]= Eπ

[EP [1C∩A | Ft] i ·1D i

]= Eπ Prt

(1C∩A i ·1D i

)and by the assertion (7.3) thus∫

C×DP [A | Ft] i dπ = Eπ

[1A∩C i ·1D i

]= Eπ

[1(A∩C)×D

]= π [(A ∩ C)×D] .

The quantities in (7.5) thus coincide and we conclude that (7.2) holds true, which isthe desired assertion.

The proof of the dual characterization can be arranged as follows.Proof. (Proof of the Duality Theorem 7.2) Using the observation from Lemma 5.5

and the latter Proposition 7.3 to encode the primal conditions (5.2) in the Lagrangian,the primal problem (5.2) rewrites

infπ≥0

supM0,µt,µt

∫drdπ +M0 · (1− Eπ1)

−T−1∑s=0

(Eπµs+1 − Eπ Prs µs+1)

−T−1∑s=0

(Eπµs+1 − EπPrsµs+1

),

where the inf is among all positive measures π ≥ 0 (so not only probability measures),and the sup among numbers M0 and functions µt C Ft ⊗ Ft−1 and µt C Ft−1 ⊗ Ft.

16 7. THE DUAL REPRESENTATION OF THE MULTISTAGE DISTANCE

According to Sion’s mini-max theorem (cf. [33]; the Lagrangian is linear in π andlinear in µs(resp.), thus convex and concave (resp.)) this saddle point has the sameobjective value as

supM0;µt,µt

infπ≥0

M0 + Eπ

dr −M0 · 1−∑T−1s=0 µs+1 − Prs µs+1

−∑T−1s=0 µs+1 − Prsµs+1

. (7.6)

Now notice that the infπ≥0 certainly is −∞ unless the integrand is positive for everymeasure π ≥ 0, which means that

M0 +T−1∑s=0

µs+1 − Prs µs+1 +T−1∑s=0

µs+1 − Prsµs+1 ≤ dr

has to hold. For a positive integrand, however, the infπ≥0 over all expectations Eπin (7.6) is 0. We thus get rid of the primal variable π and the problem to be solvedreads

maximize (in M0, µt, µt) M0subject to M0 +

∑Ts=1 (µs − Prs−1 µs) +

(µs − Prs−1µs

)≤ dr

µt C Ft ⊗ Ft−1, µt C Ft−1 ⊗ Ft

Now just define λs := µs − Prs−1 µs and λs := µs − Prs−1µs, so that the latterproblem rewrites

maximize M0subject to M0 +

∑Ts=1

(λs + λs

)≤ dr,

λt C Ft ⊗ Ft−1, λt C Ft−1 ⊗ Ft,Prt−1 λt = 0, Prt−1λt = 0,

which is the desired assertion of the Theorem.To prove the martingale assertion for the process

Mt = M0 +t∑

s=1

(λs + λs

)observe that

Eπ[MT | Ft ⊗ Ft

]= Eπ

[M0 +

T∑s=1

(λs + λs

)| Ft ⊗ Ft

]

= M0 +t∑

s=1

(λs + λs

),

because for s > t by (7.4), and using again the fact that the functions µs i ·µs−1 i forma basis of L0 (Fs ⊗ Fs−1

)Eπ[µs i ·µs−1 i | Ft ⊗ Ft

]= Eπ

[EP [µs | Ft] i ·µs−1 i | Ft ⊗ Ft

]= Eπ

[EP [EP [µs | Fs−1] | Ft] i ·µs−1 i | Ft ⊗ Ft

]= Eπ

[EP [0 | Ft] i ·µs−1 i | Ft ⊗ Ft

]= 0;

17

similarly

Eπ[µs−1 · µs | Ft ⊗ Ft

]= 0.

Moreover we find that

Mt = Eπ[MT | Ft ⊗ Ft

]≤ Eπ

[dr | Ft ⊗ Ft

].

However, as

M0 = Eπdr

because of the vanishing duality gap we may finally conclude that

Mt = Eπ[dr | Ft ⊗ Ft

]π-a.e.,

completing the proof.

8. Implementation for Finite Trees. Numerical experiments have been per-formed with the concept of the nested distance as elaborated above for two processeswith finitely many outcomes.

In order to compute the multistage distance let us start as above and computethe Wasserstein distance for two measures P =

∑i Piδωi

and P =∑j Pjδωj

first –this is the two stage situation, T = 0, 1. The linear program corresponding to (4.1)is

minimize (in π)∑i,j πi,jd

ri,j

subject to∑j πi,j = Pi∑i πi,j = Pj

πi,j ≥ 0,

(8.1)

where the matrix di,j carries the distances di,j = d (ωi, ωj).For multistage distances let the process be represented by a tree process sitting on

some nodes. Any such node has a given depth described by the function depth (n) ∈ T(the depth of the root node is 0, all terminal nodes (or leaves) have depth T ). Anynode (m, say) may have some successor nodes (the direct children), which are collectedin the set m+. The binary relation m ≤ ω indicates that m = predtω for some t ∈ T.The conditional transition from a given node m to a successor m′ ∈ m+ is describedby a given conditional probability pm:m′ ; for a tree, however, it is enough to give theconditional probability for the immediate successors.

Combining these observations one may formulate the multistage distance (5.2) asa linear program, just in line with (8.1), as

minimize (in π)∑i,j πi,jd

ri,j

subject to∑i,j πi,j = 1,

pm:ω =∑

j≥nπω,j∑

i≥m, j≥nπi,j

(m ≤ ω, depth (n) = depth (m)) ,

pn:ω′ =∑

i≥n,πi,ω′∑

i≥m, j≥nπi,j

(n ≤ ω′, depth (n) = depth (m)) ,

πi,j ≥ 0, ω ∈ ωi : i , ωj : j ,(8.2)

18 8. IMPLEMENTATION FOR FINITE TREES

:

p1:4 = 0.4 4 ω1 : P (ω1) = p0:4 = 0.2

XXXXXXz0.6

5 ω2 : P (ω2) = p0:5 = 0.3

-1

6 ω3 : P (ω3) = p0:6 = 0.3

*0.4

7 ω4 : P (ω4) = p0:7 = 0.08

-0.2

8 ω5 : P (ω5) = p0:8 = 0.04HHHH

HHj

0.4

9 ω6 : P (ω5) = p0:9 = 0.08

p0:1 = 0.5

1

:0.3

2

ZZZZZZ~

0.2

3

0

ν0 ν1 ν2

Fig. 8.1. An Example of a finite tree process ν = (ν0, ν1, ν2) with 10 nodes and 6 leaves.

where pm:ω is the probability to finally reach ω, conditional on the node m. For aconcrete numerical implementation it should be noted that

∑m≤ω pm:ω = 1. So one

of the # ω : m ≤ ω conditions in (8.2) for pm:ω (pn:ω′ , resp.) can (and should) bedropped, as they turn out to be linearly dependent, which impacts the numericalsolver to find a solution.

Example 1 (cf. [14] for a similar example). The trees in Figure 8.2 representthree 2-stage processes. They have been chosen similar, but they significantly differin their information structure (think of ε as a number between zero and one): Theoptimal transport measure π for the multistage distance does not depend on theparticular distance function chosen, it is

• π =(

p · p′ p (1− p′)(1− p) p′ (1− p) (1− p′)

)for the first and the second tree in the

display, and• similar (just replace p′ by p′′) for the distance of the first and third tree.• As for the distance between the second and third tree the optimal transportmeasure is

π =(

p′ ∧ p′′ (p′′ − p′) ∨ 0(p′ − p′′) ∨ 0 1− (p′ ∨ p′′)

).

All these trees differ – in the nested distance – by a number almost one: Indeed,for the particular choice p = p′ = p′′ = 1

2 and employing the distance (5.1) for thedifferent paths

• the nested distance of the first and second trees is 1+ε: Note that, neglectingthe information structure, the distance of these trees just would be ε, so thenested distance is able to distinguish the different information available atstage one;

• for the first and third it is 2 and• the distance of the second and third tree is 1− ε; again, for ε 1, the nested

distance approaches 0, just in line with the information available.

19

2 2-1

3

p

1

@@@@R

1− p

22+ε

1

p′

2-εPPPPq1− p′

3

*1

1HHHHj1

2

3

p′′

1

@@@@R

1− p′′

3-1

1-1

Fig. 8.2. The states of three tree processes to illustrate three different flows of information.

9. Summary and Outlook. In this paper we construct a space which is richenough to carry the outcomes of a process and the evolving information as well. Ontop of that we elaborate a notion of distance, the nested or multistage distance, whichis a refined and enhanced concept, based on the initial work of the first author in [25].

The crucial and additional observation here is that the nested distance is adaptedfor problems arising in stochastic optimization. For this reason the nested distancecan be used to quantitatively and qualitatively study stochastic optimization prob-lems from a very new perspective, because now there is a general notion of distance,respecting the flow of information, available.

Future ResearchComputational Efficiency. A speedy computation of the nested distance

is of interest. For this reasons different implementations have to be considered andcompared. Moreover, computational efficiency has to be built on the dual, reducingthe number of comparisons of sub-trees to a minimal extend. For example we did notaddress a backwards recursive computation of the nested distance here although thisis available.

Risk Measures. We have elaborated in this paper the importance of the nesteddistance for stochastic optimization with the aim to minimize the expected loss ofa given value function. However, in this context risk is not incorporated at all: Toincorporate risk measures in the stochastic programming framework seems to be atask worth the effort; this has been addressed in [26] too but an extension for nesteddistances is necessary.

Scenario Reduction. An obvious application is provided by the fact that thenested distance can be used to reduce the number of scenarios in a given scenariotree: The nested distance can be employed to reduce this given scenario tree in suchway that the objective of the reduced problem is close in the sense desired. Reducingthe scenario tree is of crucial interest, this may even turn an intractable problemcomputable.

Scenario Generation. As regards scenario generation the nested distance can

20 REFERENCES

be employed to decide where to add scenarios in order to better approximate thesituation.

Applications. Stochastic programming offers a huge field for applications, wejust pick out energy (cf. [16]) to stand representative for all of them. The nesteddistance gives a first bound at hand to relate a numerical result to the precise, althoughnumerically intractable objective value.

REFERENCES

[1] E. Allevi, M. Bertocchi, M.T. Vespucci, and M. Innorta, A stochastic optimization modelfor a gas sale company, IMA J Management Math, 19 (2008), pp. 403–416.

[2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient Flows in Metric Spaces andin the Space of Probability Measures., Birkhäuser, Basel, Switzerland, 2005.

[3] Edward S. Boylan, Epiconvergence of martingales, Ann. Math. Statist., 42 (1971), pp. 552–559.

[4] Yann Brenier, Polar factorization and monotone rearrangement of vector-valued functions,Comm. Pure Appl. Math., 44 (1991), pp. 375–417.

[5] Abraham I. Brodt, Min-mad life: A multi-period optimization model for life insurance com-pany investment decisions, Insurance: Mathematics and Economics, 2 (1983), pp. 91–102.

[6] Raymond K. Cheung and Warren B. Powell, An algorithm for multistage dynamic networkswith random arc capacities, with an application to dynamic fleet management, OperationsResearch, 44 (1996), pp. 951–963.

[7] Claude Dellacherie and Paul-André Meyer, Probabilities and Potential., North-HollandPublishing Co., Amsterdam, The Netherlands., 1988.

[8] Richard A. Durrett, Probability. Theory and Examples, Duxbury Press, Belmont, CA, sec-ond ed., 2004.

[9] N. C. P. Edirisinghe, Multiperiod portfolio optimization with terminal liability: Bounds forthe convex case., Computational Optimization and Applications, 32 (2005), pp. 29–59.

[10] Alexei A. Gaivoronski, Stochastic Optimization Problems in Telecommunications, in Appli-cations of Stochastic Programming, Stein W. Wallace and William T. Ziemba, eds., vol. 5,MPS-SIAM Series in Optimization, 2005, ch. 32.

[11] Richard J. Gardner, The Brunn-Minkowski inequality, Bulletin of the American Mathemat-ical Society, 39 (2002), pp. 355–405.

[12] Holger Heitsch and Werner Römisch, Scenario tree modeling for multistage stochasticprograms, Math. Program. Ser. A, 118 (2009), pp. 371–406.

[13] Holger Heitsch and Werner Römisch, Scenario tree reduction for multistage stochasticprograms., Computational Management Science, 2 (2009), pp. 117–133.

[14] Holger Heitsch, Werner Römisch, and Cyrille Strugarek, Stability of multistagestochastic programs, SIAM J. Optimization, 17 (2006), pp. 511–525.

[15] Ronald Hochreiter, Georg Ch. Pflug, and David Wozabal, Multi-stage stochastic elec-tricity portfolio optimization in liberalized energy markets., in System Modeling and Op-timization, vol. 199, Springer, New York, 2006, pp. 219–226.

[16] Ronald Hochreiter and David Wozabal, A multi-stage stochastic programming model formanaging risk-optimal electicity portfolios, Handbook Power Systems II, 4 (2010), pp. 383–404.

[17] Leonid Kantorovich, On the translocation of masses, C.R. Acad. Sci. URSS, 37 (1942),pp. 199–201.

[18] Willem K. Klein-Haneveld, Matthijs H. Streutker, and Maarten H. van der Vlerk,Indexation of Dutch pension rights in multistage recourse ALM models, IMA Journal ofManagement Mathematics, 21 (2010), pp. 131–148.

[19] Lisa A. Korf and Roger J.-B. Wets, Random LSC functions: An ergodic theorem, Mathe-matics of Operations Research, 26 (2001), pp. 421–445.

[20] Hirokichi Kudo, A note on the strong convergence of σ-algebras, Ann. Probability, 2 (1974),pp. 76–83.

[21] Petr Lachout, Eckhard Liebscher, and Silvia Vogel, Strong convergence of estimators asεn-minimisers of optimisation problems, Ann. Inst. Statist. Math., 57 (2005), pp. 291–313.

[22] Andris Möller, Werner Römisch, and Klaus Weber, Airline network revenue manage-ment by multistage stochastic programming, Computational Management Science, 5 (2008),pp. 355–377.

[23] Gaspard Monge, Mémoire sue la théorie des déblais et de remblais., Histoire de l’Académie

REFERENCES 21

Royale des Sciences de Paris, avec les Mémoires de Mathématique et de Physique pour lamême année, (1781), pp. 666–704.

[24] Paul Olsen, Discretizations of multistage stochastic programming problems., Math. Program-ming Stud., 6 (1976), pp. 111–124.

[25] Georg Ch. Pflug, Version-independence and nested distribution in multistage stochastic op-timization, SIAM Journal on Optimization, 20 (2009), pp. 1406–1420.

[26] Georg Ch. Pflug and Alois Pichler, Approximations for Probability Distributions andStochastic Optimization Problems, vol. 163 of International Series in Operations Research& Management Science, Springer New York, 2011, ch. 15, pp. 343–387.

[27] Georg Ch. Pflug and Werner Römisch, Modeling, Measuring and Managing Risk, WorldScientific, River Edge, NJ, 2007.

[28] Svetlozar T. Rachev, Probability metrics and the stability of stochastic models, John Wileyand Sons Ltd., West Sussex PO19, 1UD, England, 1991.

[29] Svetlozar T. Rachev and Ludger Rüschendorf, Mass Transportation Problems Vol. I:Theory, Vol. II: Applications, vol. XXV of Probability and its applications, Springer, NewYork, 1998.

[30] Werner Römisch and Rüdiger Schultz, Stability analysis for stochastic programs, Annalsof Operations Research, 30 (1991), pp. 241–266.

[31] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński, Lectures on Stochas-tic Programming, MPS-SIAM Series on Optimization 9, 2009.

[32] Albert Nikolayevich Shiryaev, Probability, Springer, New York, 1996.[33] Maurice Sion, On general minimax theorems., Pacific Journal of Mathematics, 8 (1958),

pp. 171–176.[34] Cédric Villani, Topics in Optimal Transportation, vol. 58 of Graduate Studies in Mathemat-

ics, American Mathematical Society, 2003.[35] Cédric Villani, Optimal transport, old and new, vol. 338 of Grundlehren der Mathematischen

Wissenschaften, Springer, New York, 2009.[36] David Williams, Probability with Martingales, Cambridge University Press, Cambridge, 1991.[37] Przemysław Wojtaszczyk, Banach Spaces for Analysts, Cambridge University Press, Cam-

bridge, 1991.

Appendix A. In this Appendix we demonstrate how the multistage distance forr = 1 can be seen as the usual transportation distance in the Polish space of nesteddistributions.

Recall the following well-known facts. For a given Polish space (Ξ, d) let P1(Ξ, d)be the family of Borel probabilities on (Ξ, d) such that y 7→ dt(y, y0) is integrable forsome y0 ∈ Ξt.

For two Borel probabilities P and P in P1(Ξ, d), the Wasserstein distance d(P, P

)is given by

d1(P, P

)= inf

Ed(X, X

): X ∼ P, X ∼ P

= sup

∫h(y)P (dy)−

∫h(y) P (dy) : |h(y)− h(z)| ≤ d (y, z)

.

Here X ∼ P means that the (marginal) distribution of X is P (cf. (4.1)).Notice that we denote the original metric and the induced Kantorovich metric

with the same symbol. This is justified, since (Ξ, d) is isometrically embedded inP1 (Ξ, d) by

d (δy, δz) = d (y, z)

where δy are Dirac probability measures. P1 is a complete separable metric spaceunder d. If (Ξ1, d1) and (Ξ2, d2) are Polish spaces then so is the Cartesian product(Ξ1 × Ξ2, d1,2) with metric

d1,2((y1, y2), (z1, z2)) := d1(y1, z1) + d2(y2, z2).

22 REFERENCES

@@@R

ξ0 µ1

@@@R

ξ1 µ2

Fig. A.1. Nested Distribution.

We define the following Polish metric spaces in a (backward) recursive way.

(ΞT , dlT ) = (RnT , dT )(ΞT−1, dlT−1) = (RnT−1 × P1 (ΞT :T , dlT ) , dT−1(·, ·) + dlT (·, ·))(ΞT−2, dlT−2) = (RnT−2 × P1 (ΞT−1:T , dlT−1) , dT−2(·, ·) + dlT−1(·, ·))

...(Ξ0, dl0) = (Rn0 × P1 (Ξ1:T , dl1) , d0(·, ·) + dl1(·, ·))

For simplicity, we write (Ξ, dl) instead of (Ξ0, dl0).Definition A.1. A Borel probability distribution P ∈ (Ξ0, dl) is called a nested

distribution (of depth T ).We illustrate the situation for depth T = 3. A nested distribution on Ξ = Ξ0,

has components ξ0 (a value in Rn0) and µ1 (a nested distribution on Ξ1), whichmeans that it has in turn components ξ1 (a Rn1-random variable) and µ2, a randomdistribution on Rn2). One may visualize the situation as in Figure A.1.

It has been shown in [25] how the nested distribution P is related to the value-and-information structure (Ω,F(pred), P, ξ). We summarize here the basic facts.

• If (Ω,F(pred), P, ξ) is a value-and-information structure with ωt = predt(ω),it induces a nested distribution. For T = 1, the nested distribution is L(ξ0 ×L(ξ1|ω0)), which is the joint law of ξ0 and the conditional law of ξ1 given ω0(the root of the tree) 8. For general T , the induced nested distribution is

L (ξ0 × L (ξ1 × . . .L (ξT−2 × L (ξT−1 × L (ξT |νT−1) |νT−2) |νT−1) . . . |ν1)) .

• Conversely, to every nested distribution one may associate a standard treeprocess ωt and a value process ξt = ξt(ωt) such that its nested distribution isP.

Theorem A.2. Let P resp. P be two nested distributions in Ξ and let(Ω,F(pred, P, ξ) resp. (Ω, F(pred, P , ξ) be the pertaining value-and-information struc-tures. Then the distance dl(P, P) equals to the solution of the optimization problem(5.2).

Remark 7. The theorem is formulated for the situation r = 1 only, but can begeneralized with obvious modifications to arbitrary r.

Proof. The proof goes in two steps. First, we show that that we may assumethat the two scenario processes ξ and ξ are final, i.e. that nonzero values appear only

8L (ξ) is the probability law of ξ

REFERENCES 23

in the last stage. To this end, define for each nested distribution the final nesteddistribution Pf in the following way:

ξft (ω) := 0 for t = 0, . . . T − 1

ξfT (ω) :=(ξT (ω), ξT−1

(predT−1(ω)

), . . . ξ1 (pred1(ω)) , ξ0

).

Notice that ξfT takes values in the metric space

(Rn0 × Rn1 × . . .RnT , d0 + d1,+ · · ·+ dT ) .

We show that

dl(P, P

)= dl

(Pf , Pf

).

The proof of this assertion goes by induction. It is trivial for T = 1. Suppose ithas been proved for depth T . By the recursive character of the nested distance, toshow the assertion for T + 1 it suffices to show it for T = 2. The nested distance oftwo distributions with depth 2 equals d0(ξ0, ξ0) + d1

(L(ξ1),L(ξ1)

). If d(P, P ) is the

Wasserstein distance of two probability measures with respect to the distance d andwe define a new "distance" da (v1, v2) = d (v1, v2) + a, then da

(P, P

)= d

(P, P

)+ a.

Thus with a = d0(ξ0, ξ0

),

d0(ξ0, ξ0

)+ d1

(L(ξ1),L(ξ1)

)= da1

(L(ξ1),L(ξ1)

)= d1

(L(ξ0, ξ1),L(ξ0, ξ1)

),

which is the asserted equality.For a final nested distribution, the distances dlt are given by the Wasserstein

distances in P1 (Ξt+1), implying that Ξ = P1 (P1 (. . . (P1(ΞT )) . . . )). For two finalnested distributions a correct transportation measure π must only respect the nestedstructure of the conditional distributions. At each level one should find an optimaltransportation plan for every pair of sub-trees. When combining them to an overalltransportation plan π the new plan then should satisfy the conditions

π[A× Ω | Ft ⊗ Ft

](ω, ω) = P [A | Ft] (ω) (A ∈ FT , t ∈ T) ,

π[Ω×B | Ft ⊗ Ft

](ω, ω) = P

[B | Ft

](ω)

(B ∈ FT , t ∈ T

)which is identical to (5.2) in Definition 5.1.