High-Confidence Data-Driven Ambiguity Sets for Time

High-Confidence Data-Driven Ambiguity Sets for Time-Varying Linear Systems∗1

Dimitris Boskos† , Jorge Cortes‡ , and Sonia Martınez‡2

3

Abstract. This paper builds Wasserstein ambiguity sets for the unknown probability distribution of dynamic4random variables leveraging noisy partial-state observations. The constructed ambiguity sets contain5the true distribution of the data with quantifiable probability and can be exploited to formulate6robust stochastic optimization problems with out-of-sample guarantees. We assume the random7variable evolves in discrete time under uncertain initial conditions and dynamics, and that noisy8partial measurements are available. All random elements have unknown probability distributions9and we make inferences about the distribution of the state vector using several output samples from10multiple realizations of the process. To this end, we leverage an observer to estimate the state of11each independent realization and exploit the outcome to construct the ambiguity sets. We illustrate12our results in an economic dispatch problem involving distributed energy resources over which the13scheduler has no direct control.14

Key words. Distributional uncertainty, Wasserstein ambiguity sets, stochastic systems, state estimation15

AMS subject classifications. 62M20, 62G35, 90C15, 93C05, 93E1016

1. Introduction. Decisions under uncertainty are ubiquitous in a wide range of engi-17

neering applications. Faced with complex systems that include components with probabilistic18

models, such decisions seek to provide rigorous solutions with quantifiable guarantees in hedg-19

ing against uncertainty. In practice, the designer makes inferences about uncertain elements20

based on collected data and exploits them to formulate data-driven stochastic optimization21

problems. This decision-making paradigm has found applications in finance, communications,22

control, medicine, and machine learning. Recent research focuses on how to retain high-23

confidence guarantees for the optimization problems under plausible variations of the data.24

To this end, distributionally robust optimization (DRO) formulations evaluate the optimal25

worst-case performance over an ambiguity set of probability distributions that contains the26

true one with high confidence. Such ambiguity sets are typically constructed under the as-27

sumption that data are generated from a static distribution and can be measured in a direct28

manner. In this paper we significantly expand on the class of scenarios for which reliable am-29

biguity sets can be constructed. We consider scenarios where the random variable is dynamic30

and partial measurements, corrupted by noise, are progressively collected from its evolving dis-31

tribution. In our analysis, we exploit the underlying dynamics and study how the probabilistic32

properties of the noise affect the ambiguity set size while maintaining the same guarantees.33

Literature review: Optimal decision problems in the face of uncertainty, like expected-cost34

minimization and chance-constrained optimization, are the cornerstones of stochastic pro-35

gramming [39]. Distributionally robust versions of stochastic optimization problems [2, 5, 38]36

carry out a worst-case optimization over all possibilities from an ambiguity set of probability37

∗This work was supported by the DARPA Lagrange program through award N66001-18-2-4027. A preliminaryversion appeared as [7] at the American Control Conference.†Delft Center for Systems and Control, Delft University of Technology ([email protected]).‡Department of Mechanical and Aerospace Engineering, University of California, San Diego

(cortes,[email protected]).

1

This manuscript is for review purposes only.

mailto:[email protected]

mailto:cortes,[email protected]

2 D. BOSKOS, J. CORTES, AND S. MARTINEZ

distributions. This is of particular importance in data-driven scenarios where the unknown38

distributions of the random variables are inferred in an approximate manner using a finite39

amount of data [3]. To hedge this uncertainty, optimal transport ambiguity sets have emerged40

as a promising tool. These sets typically group all distributions up to some distance from the41

empirical approximation in the Wasserstein metric [42]. There are several reasons that make42

this metric a popular choice among the distances between probability distributions, partic-43

ularly, for data-driven problems. Most notably, the Wasserstein metric penalizes horizontal44

dislocations between distributions and provides ambiguity sets that have finite-sample guaran-45

tees of containing the true distribution and lead to tractable optimization problems. This has46

rendered the convergence of empirical measures in the Wasserstein distance an ongoing active47

research area [15, 16, 18, 24, 43, 44]. Towards the exploitation of Wasserstein ambiguity sets48

for DRO problems, the work [17] introduces tractable reformulations with finite-sample guar-49

antees, further exploited in [11, 23] to deal with distributionally robust chance-constrained50

programs. The work [13] develops distributed optimization algorithms using Wasserstein balls,51

while optimal transport ambiguity sets have recently been connected to regularization for ma-52

chine learning [4, 19, 36]. The paper [28] exploits Wasserstein balls to robustify data-driven53

online optimization algorithms, and [37] leverages them for the design of distributionally ro-54

bust Kalman filters. Further applications of Wasserstein ambiguity sets include the synthesis55

of robust control policies for Markov decision processes [46] and their data-driven exten-56

sions [47], and regularization for stochastic predictive control algorithms [14]. Several recent57

works have also devoted attention to distributionally robust problems in power systems con-58

trol, including optimal power flow [22, 25] and economic dispatch [45, 31, 35]. Time-varying59

aspects of Wasserstein ambiguity sets are considered in our previous work: in [26] for dy-60

namic traffic models, in [27] for online learning of unknown dynamical environments, in [8],61

which constructs ambiguity balls using progressively assimilated dynamic data for processes62

with random initial conditions that evolve under deterministic dynamics, and in [10], which63

studies the propagation of ambiguity bands under hyperbolic PDE dynamics. In contrast, in64

the present work, the state distribution does not evolve deterministically due to the presence65

of random disturbances, which together with output measurements that are corrupted by66

noise, generate additional stochastic elements that make challenging the quantification of the67

ambiguity set guarantees.68

Statement of contributions: Our contributions revolve around building Wasserstein am-69

biguity sets with probabilistic guarantees for dynamic random variables when we have no70

knowledge of the probability distributions of their initial condition, the disturbances in their71

dynamics, and the measurement noise. To this end, our first contribution estimates the states72

of several process realizations from output samples and exploits these estimates to build a suit-73

able empirical distribution as the center of an ambiguity ball. Our second contribution is the74

exploitation of concentration of measure results to quantify the radius of this ambiguity ball75

so that it provably contains the true state distribution with high probability. To achieve this,76

we break the radius into nominal and noise components. The nominal component captures77

the deviation between the true distribution and the empirical distribution formed by the state78

realizations. The noise component captures the deviation between the empirical distribution79

and the center of our ambiguity ball. To quantify the latter, we carefully evaluate the im-80

pact of the estimation error, which due to the measurement noise, does not have a compactly81


DATA-DRIVEN AMBIGUITY SETS FOR LINEAR SYSTEMS 3

supported distribution like the internal uncertainty and requires a separate analysis. Our82

third contribution is the extension of these results to obtain simultaneous guarantees about83

ambiguity sets that are built along finite time horizons, instead of at isolated time instances.84

The fourth contribution is to generalize a concentration inequality around the mean of suffi-85

ciently light-tailed independent random variables, which enables us to obtain tighter results86

when analyzing the effect of the estimation error. Our last contribution is the validation of87

the results in simulation for a distributionally robust economic dispatch problem, for which88

we further provide a tractable reformulation. We stress that our general objective revolves89

around the robust uncertainty quantification (i.e., distributional inference) problem at hand,90

without having DRO as a necessary end-goal. Further, our approach is fundamentally differ-91

ent from classical Kalman filtering, where the initial state and dynamics noise distributions92

are known and Gaussian, and hence, the state distribution over time is also a known Gaussian93

random variable. Here, instead, we are interested to infer the unknown state distribution94

from data collected by multiple realizations of the dynamics. For each such realization, we95

use an observer since we have no concrete knowledge of the state and noise random models96

to directly invoke optimal filtering techniques. In the online version [9] of this manuscript, we97

provide explicit constants for several of the presented concentration of measure inequalities98

which, to the best of our knowledge, have not been delineated in the literature. These results99

are not essential to keep the theoretical presentation self-contained and are omitted due to100

space constraints.101

2. Preliminaries. Here we present general notation and concepts from probability theory102

used throughout the paper.103

Notation: We denote by ‖ · ‖p the pth norm in Rn, p ∈ [1,∞], using also the notation104

‖ · ‖ ≡ ‖ · ‖2 for the Euclidean norm. The inner product of two vectors a, b ∈ Rn is denoted by105

〈a, b〉 and the Khatri-Rao product [32] of a ≡ (a1, . . . , ad) ∈ Rd and b ≡ (b1, . . . , bd) ∈ Rdn,106

with each bi belonging to Rn, is a ∗ b := (a1b1, . . . , adbd) ∈ Rdn. We use the notation Bnp (ρ)107

for the ball of center zero and radius ρ in Rn with the pth norm and [n1 : n2] for the set of108

integers n1, n1 + 1, . . . , n2 ⊂ N ∪ 0 =: N0. The interpretation of a vector in Rn as an109

n × 1 matrix should be clear form the context (this avoids writing double transposes). The110

diameter of a set S ⊂ Rn with the pth norm is defined as diamp(S) := sup‖x−y‖p |x, y ∈ S111

and for z ∈ Rn, S + z := x+ z |x ∈ S. We denote the induced Euclidean norm of a matrix112

A ∈ Rm×n by ‖A‖ := max‖x‖=1 ‖Ax‖/‖x‖. Given B ⊂ Ω, 1B is the indicator function of B113

on Ω, with 1B(x) = 1 for x ∈ B and 1B(x) = 0 for x /∈ B.114

Probability Theory: We denote by B(Rd) the Borel σ-algebra on Rd, and by P(Rd) the115

probability measures on (Rd,B(Rd)). For any p ≥ 1, Pp(Rd) := µ ∈ P(Rd) |∫Rd ‖x‖

pdµ <∞116

is the set of probability measures in P(Rd) with finite pth moment. The Wasserstein distance117

between µ, ν ∈ Pp(Rd) is118

Wp(µ, ν) :=(

infπ∈H(µ,ν)

∫Rd×Rd

‖x− y‖pπ(dx, dy))1/p

,119120

where H(µ, ν) is the set of all couplings between µ and ν, i.e., probability measures on Rd×Rd121

with marginals µ and ν, respectively. For any µ ∈ P(Rd), its support is the closed set122

supp(µ) := x ∈ Rd |µ(U) > 0 for each neighborhood U of x, or equivalently, the smallest123



closed set with measure one. For a random variable X with distribution µ we also denote124

supp(X) ≡ supp(µ). We denote the product of the distributions µ in Rd and ν in Rr by the125

distribution µ ⊗ ν in Rd × Rr. The convolution µ ? ν of the distributions µ and ν on Rd is126

the image of the measure µ ⊗ ν on Rd × Rd under the mapping (x, y) 7→ x + y; equivalently,127

µ ? ν(B) =∫Rd×Rd 1B(x+ y)µ(dx)ν(dy) for any B ∈ B(Rd) (c.f. [6, Pages 207, 208]). Given a128

measurable space (Ω,F), an exponent p ≥ 1, the convex function R 3 x 7→ ψp(x) := exp − 1,129

and the linear space of scalar random variables Lψp := X |E[ψp(|X|/t)] <∞ for some t > 0130

on (Ω,F), the ψp-Orlicz norm (cf. [41, Section 2.7.1]) of X ∈ Lψp is131

‖X‖ψp := inft > 0 |E[ψp(|X|/t)] ≤ 1.132133

When p = 1 and p = 2, each random variable in Lψp is sub-exponential and sub-Gaussian,134

respectively. We also denote by ‖X‖p ≡(E[|X|p

]) 1p the norm of a scalar random variable with135

finite pth moment, i.e., the classical norm in Lp(Ω) ≡ Lp(Ω;PX), where PX is the distribution136

of X. The interpretation of ‖ · ‖p as the pth norm of a vector in Rn or a random variable in137

Lp should be clear from the context throughout the paper. Given a set Xii∈I of random138

variables, we denote by σ(Xii∈I) the σ-algebra generated by them. We conclude with a139

useful technical result which follows from Fubini’s theorem [1, Theorem 2.6.5].140

Lemma 2.1. (Expectation inequality). Consider the independent random vectors X and141

Y , taking values in Rn1 and Rn2, respectively, and let (x, y) 7→ g(x, y) be integrable. Assume142

that E[g(x, Y )] ≥ k(x) for some integrable function k and all x ∈ K with supp(X) ⊂ K ⊂ Rn1.143

Then, E[g(X,Y )] ≥ E[k(X)].144

3. Problem formulation. Consider a stochastic optimization problem where the objective145

function x 7→ f(x, ξ) depends on a random variable ξ with an unknown distribution Pξ. To146

hedge this uncertainty, rather than using the empirical distribution147

PNξ :=1

N

N∑i=1

δξi ,(3.1)148

149

formed by N i.i.d. samples ξ1, . . . , ξN of Pξ to optimize a sample average approximation of150

the expected value of f , one can instead consider the DRO problem151

infx∈X

supP∈PN

EP [f(x, ξ)],(3.2)152

153

of evaluating the worst-case expectation over some ambiguity set PN of probability measures.154

This helps the designer robustify the decision against plausible variations of the data, which155

can play a significant role when the number of samples is limited. Different approaches156

exist to construct the ambiguity set PN so that it contains the true distribution Pξ with157

high confidence. We are interested in approaches that employ data, and in particular the158

empirical distribution PNξ , to construct them. In the present setup, the data is generated159

by a dynamical system subject to disturbances, and we only collect partial (instead of full)160

measurements that are distorted by noise. Therefore, it is no longer obvious how to build a161

candidate state distribution as in (3.1) from the collected samples. Further, we seek to address162



this in a distributionally robust way, i.e., finding a suitable replacement PNξ for (3.1) together163

with an associated ambiguity set, by exploiting the dynamics of the underlying process.164

To make things precise, consider data generated by a discrete-time system165

ξk+1 = Akξk +Gkwk, ξk ∈ Rd, wk ∈ Rq,(3.3a)166167

with linear output168

ζk = Hkξk + vk, ζk ∈ Rr.(3.3b)169170

The initial condition ξ0 and the noises wk and vk, k ∈ N0 in the dynamics and the measure-171

ments, respectively, are random variables with an unknown distribution. We seek to build an172

ambiguity set for the state distribution at certain time ` ∈ N, by collecting data up to time173

` from multiple independent realizations of the process, denoted by ξi, i ∈ [1 : N ]. This can174

occur, for instance, when the same process is executed repeatedly, or in multi-agent scenarios175

where identical entities are subject to the same dynamics, see e.g. [48]. The time-dependent176

matrices in the dynamics (3.3) widen the applicability of the results, since they can cap-177

ture the linearization of nonlinear systems along trajectories or the sampled-data analogues of178

continuous-time systems under irregular sampling, even if the latter are linear and time invari-179

ant. To formally describe the problem, we consider a probability space (Ω,F ,P) containing180

all random elements from these realizations, and make the following sampling assumption.181

Assumption 3.1. (Sampling schedule). For each realization i of system (3.3), output182

samples ζi0, . . . , ζi` are collected over the discrete time instants of the sampling horizon [0 : `].183

According to this assumption, the measurements of all realizations are collected over the same184

time window [0 : `]. To obtain quantifiable characterizations of the ambiguity sets, we require185

some further hypotheses on the classes of the distributions Pξ0 of the initial condition, Pwk186

of the dynamics noise, and Pvk of the measurement errors (cf. Figure 1). These assumptions187

are made for individual realizations and allow us to consider non-identical observation error188

distributions—in this way, we allow for the case where each realization is measured by a189

non-identical sensor of variable precision.190

Assumption 3.2. (Distribution classes). Consider a finite sequence of realizations ξi,191

i ∈ [1 : N ] of (3.3a) with associated outputs given by (3.3b), and noise elements wik, vik,192

k ∈ N0. We assume the following:193

H1: The distributions Pξi0, i ∈ [1 : N ], are identically distributed; further Pwik

, i ∈ [1 : N ],194

are identically distributed for all k ∈ N0.195

H2: The sigma fields σ(ξi0 ∪ wikk∈N0

), σ(vikk∈N0

), i ∈ [1 : N ] are independent.196

H3: The supports of the distributions Pξi0and Pwik

, k ∈ N0 are compact, centered at the197

origin, and have diameters 2ρξ0 and 2ρw, respectively, for all i.198

H4: The components of the random vectors vik have uniformly bounded Lp and ψp-Orlicz199

norms, as follows,200

0 < mv ≤ ‖vik,l‖p ≤Mv, ‖vik,l‖ψp ≤ Cv,201202

for all k ∈ N0, i ∈ [1 : N ], and l ∈ [1 : r], where p ≥ 1.203



Remark 3.3. (Bounded ψp-Orlicz/Lp-norm ratio). By definition, ψp-Orlicz norms204

can become significantly larger than Lp norms for random variables with heavier tails. Thus,205

over an infinite sequence of random variables Xk, the ratio ‖Xk‖ψp/‖Xk‖p may grow un-206

bounded. We exclude this by assuming that Cv and mv are either positive or zero simultane-207

ously, in which case we set Cv/mv := 0. 208

𝜉𝜉01~ 𝑃𝑃𝜉𝜉0

⋯

⋯ ⋯𝑣𝑣11 ∼ 𝑃𝑃𝑣𝑣11

1 𝑁𝑁

𝑤𝑤11 ∼ 𝑃𝑃𝑤𝑤1

𝜉𝜉0𝑁𝑁~ 𝑃𝑃𝜉𝜉0

𝑤𝑤01 ∼ 𝑃𝑃𝑤𝑤0 𝑣𝑣01 ∼ 𝑃𝑃𝑣𝑣01 𝑤𝑤0𝑁𝑁 ∼ 𝑃𝑃𝑤𝑤0

𝑤𝑤1𝑁𝑁 ∼ 𝑃𝑃𝑤𝑤1

𝑣𝑣0𝑁𝑁 ∼ 𝑃𝑃𝑣𝑣0𝑁𝑁

𝑣𝑣1𝑁𝑁 ∼ 𝑃𝑃𝑣𝑣1𝑁𝑁

Figure 1. Illustration of the probabilistic models for the random variables in the dynamics and observationsaccording to Assumption 3.2.

A direct approach to build the ambiguity set using the measurements of the trajectories209

at time ` would be severely limited, since the output map is in general not invertible. In such210

case, the inverse image of each measurement is the translation of a subspace, whose location211

is further obscured by the measurement noise. As a consequence, candidate states for a212

generated output sample may lie at an arbitrary distance apart, which could only be bounded213

by making additional hypotheses about the support of the state distribution. Instead, despite214

the lack of full-state information, we aim to leverage the system dynamics to estimate the state215

from the whole assimilated output trajectory. To guarantee some boundedness notion for the216

state estimation errors over arbitrary evolution horizons, we make the following assumption.217

Assumption 3.4. (Detectability/uniform observability). System (3.3) satisfies one of218

the following properties:219

(i) It is time invariant and the pair (A,H) (with A ≡ Ak and H ≡ Hk) is detectable.220

(ii) It is uniformly observable, i.e., for some t ∈ N, the observability Gramian221

Ok+t,k :=k+t∑i=k

Φ>i,kH>i HiΦi,k222

223

satisfies Ok+t,k bI for certain b > 0 and all k ∈ N0, where we denote Φk+s,k := Ak+s−1 · · ·224

Ak+1Ak. Further, all system matrices are uniformly bounded and the singular values of Ak225

and the norms of ‖Hk‖ are uniformly bounded below.226

Problem statement: Under Assumptions 3.1 and 3.2 on the measurements and distribu-227

tions of N realizations of the system (3.3), we seek to construct an estimator ξi`(ζi0, . . . , ζ

i`) for228

the state of each realization and build an ambiguity set for the state distribution at time ` with229



probabilistic guarantees. Further, under Assumption 3.4 on the system’s detectability/uniform230

observability properties, we aim to characterize the effect of the estimation precision on the231

accuracy of the ambiguity sets.232

We proceed to address the problem in Section 4 by exploiting a Luenberger observer to233

estimate the states of the collected data and using them to replace the classical empirical234

distribution (3.1) in the construction of the ambiguity set. To obtain the probabilistic guar-235

antees, we leverage concentration inequalities to bound the distance between the updated236

empirical distribution and the true state distribution with high confidence. To this end, we237

further quantify the increase of the ambiguity radius due to the noise. We also study the238

beneficial effect on the ambiguity radius of detectability/uniform observability for arbitrarily239

long evolution horizons in Section 5.240

4. State-estimator based ambiguity sets. We address here the question of how to con-241

struct an ambiguity set at certain time instant `, when samples are collected from (3.3)242

according to Assumption 3.1. If we had access to N independent full-state samples ξ1` , . . . , ξ

N`243

from the distribution of ξ at `, we could construct an ambiguity ball in the Wasserstein metric244

Wp centered at the empirical distribution (3.1) with ξi ≡ ξi` and containing the true distribu-245

tion with high confidence. In particular, for any confidence 1 − β > 0, it is possible, cf. [17,246

Theorem 3.5], to specify an ambiguity ball radius εN (β) so that the true distribution of ξ` is247

in this ball with confidence 1− β, i.e.,248

P(Wp(PNξ`, Pξ`) ≤ εN (β)) ≥ 1− β.249250

Instead, since we only can collect noisy partial measurements of the state, we use a Luenberger251

observer to estimate ξ at time `. The dynamics of the observer, initialized at zero, is given by252

ξk+1 = Akξk +Kk(Hkξk − ζk), ξ0 = 0,(4.1)253254

where each Kk is a nonzero gain matrix. Using the corresponding estimates from system (4.1)255

for the independent realizations of (3.3a), we define the (dynamic) estimator-based empirical256

distribution257

PNξk :=1

N

N∑i=1

δξik,(4.2)258

259

Denoting by ek := ξk− ξk the error between (3.3a) and the observer (4.1), the error dynamics260

is ek+1 = Fkek+Gkwk+Kkvk, e0 = ξ0, where Fk := Ak+KkHk and ξ0 is the initial condition261

of (3.3a). In particular,262

ek = Ψkξ0 +k∑

κ=1

(Ψk,k−κ+1Gk−κwk−κ + Ψk,k−κ+1Kk−κvk−κ

)(4.3)263

264

for all k ≥ 1, where Ψk+s,k := Fk+s−1 · · ·Fk+1Fk, Ψk,k := I and Ψk := Ψk,0. To build the265

ambiguity set at time `, we set its center at the estimator-based empirical distribution PNξ`266

given by (4.2). In what follows, we leverage concentration of measure results to identify an267

ambiguity radius ψN (β) so that the resulting Wasserstein ball contains the true distribution268



with a given confidence 1 − β. Note that even if a distributionally robust framework is not269

employed, replacing the empirical distribution by the estimator empirical distribution in (4.2)270

does no longer guarantee consistency, in the sense that the estimator empirical distribution271

does not necessarily converge (weakly) to the true distribution. Hence, there is also no indi-272

cation that the solution to the associated estimator Sample Average Approximation (SAA)273

problem, i.e., to274

infx∈X

1

N

N∑i=1

f(x, ξi`)275

276

with ξi` replaced by ξi`, will be a consistent estimator of the solution to the nominal stochastic277

optimization problem. This is a fundamental limitation that is justified by the fact that,278

in general, the estimation error is dependent on the state realization, i.e., it has a variable279

distribution when conditioned on the state and the internal noise, and so its effect cannot be280

easily reversed (this may only be possible in rather degenerate cases, e.g., one has access to281

full-sate samples and the measurement noise is known).282

Note that the random variable ξik of a system realization at time k is a function ξik(ξi0,w

ik)283

of the random initial condition ξi0 and the dynamics noise wik ≡ (wi0, . . . , w

ik−1). Analogously,284

the estimated state ξik of each observer realization is a stochastic variable ξik(ξi0,w

ik,v

ik) with285

additional randomness induced by the output noise vik ≡ (vi0, . . . , vik−1). Using the compact286

notation ξ0 ≡ (ξ10 , . . . , ξ

N0 ), wk ≡ (w1

k, . . . ,wNk ), and vk ≡ (v1

k, . . . ,vNk ) for the corresponding287

initial conditions, dynamics noise, and output noise of all realizations, respectively, we can288

denote the empirical and estimator-based-empirical distributions at time ` as PNξ` (ξ0,w`)289

and PNξ` (ξ0,w`,v`). If we view the initial conditions and the corresponding internal noise290

of the realizations ξi over the whole time horizon as deterministic quantities, we use the291

alternative notation PNξ` (z,ω) and PNξ` (z,ω,v`) for the corresponding distributions, where292

z = (z1, . . . , zN ), z1 ≡ ξ10 , . . . , z

N ≡ ξN0 , and ω = (ω1, . . . ,ωN ), ω1 ≡ w1` , . . . ,ω

N ≡ wN` . We293

also denote by Pξ` the true distribution of the data at discrete time `, where from (3.3a),294

ξ` = Φ`ξ0 +∑k=1

Φ`,`−k+1G`−kw`−k,(4.4)295

296

where Φ` := Φ`,0 and Φ`,` := I (and with Φk+δk,k defined in Assumption 3.4). Then, it follows297

from H1 and H2 in Assumption 3.2 that the random states ξi` of the system realizations298

are independent and identically distributed. Leveraging this, our goal is to associate to each299

confidence 1− β, an ambiguity radius ψN (β) so that300

P(Wp(PNξ`, Pξ`) ≤ ψN (β)) ≥ 1− β.(4.5)301302

To achieve this, we decompose the confidence as the product of two factors:303

1− β = (1− βnom)(1− βns).(4.6)304305

The first factor (the nominal component “nom”) is exploited to control the Wasserstein dis-306

tance between the empirical distribution and the true state distribution Pξ` . The purpose of307



the second factor (the noise component “ns”) is to bound the Wasserstein distance between308

the true- and the estimator-based-empirical distributions, which is affected by the measure-309

ment noise. Using this decomposition, our strategy to get (4.5) builds on further breaking the310

ambiguity radius as311

ψN (β) := εN (βnom) + εN (βns).(4.7)312313

We exploit what is known [8] for the no-noise case to bound the nominal ambiguity radius314

εN (βnom) with confidence 1− βnom. Moreover, we bound the noise ambiguity radius εN (βns)315

with confidence 1 − βns. This latter radius corresponds to the impact on distributional un-316

certainty of the internal and measurement noise. In the next sections we present the precise317

individual bounds for these terms and then combine them to obtain the overall ambiguity318

radius.319

4.1. Nominal ambiguity radius. According to Assumption 3.2, the initial condition and320

internal noise distributions are compactly supported, and hence, the same holds also for the321

state distribution along time. We will therefore use the following result, that is focused on322

compactly supported distributions and bounds the distance between the true and empirical323

distribution for any fixed confidence level.324

Proposition 4.1. (Nominal ambiguity radius [8, Corollary 3.3]). Consider a se-325

quence Xii∈N of i.i.d. Rd-valued random variables with a compactly supported distribu-326

tion µ. Then for any p ≥ 1, N ≥ 1, and confidence 1 − β with β ∈ (0, 1), we have327

P(Wp(µN , µ) ≤ εN (β, ρ)) ≥ 1− β, where328

εN (β, ρ) :=

(ln(Cβ−1)

c

) 12p ρ

N12p, if p > d/2,

h−1(

ln(Cβ−1)cN

) 1pρ, if p = d/2,(

ln(Cβ−1)c

) 1d ρ

N1d, if p < d/2,

(4.8)329

330

µN := 1N

∑Ni=1 δXi, ρ := 1

2diam∞(supp(µ)), h(x) := x2

(ln(2+1/x))2, x > 0, and the constants C331

and c depend only on p and d.332

This result shows how the nominal ambiguity radius depends on the size of the distri-333

bution’s support, the confidence level, and the number of samples, and is based on recent334

concentration of measure inequalities from [18]. The determination of the constants C and335

c in (4.8) for the whole spectrum of data dimensions d and Wasserstein exponents p is a336

particularly cumbersome task. Nevertheless, in the online version [9, Section 8.2], we provide337

some alternative concentration of measure results and use them to obtain explicit formulas338

for these constants when d > 2p.339

4.2. Noise ambiguity radius. In this section, we quantify the noise ambiguity radius340

εN (βns) for any prescribed confidence 1 − βns. We first give a result that uniformly bounds341

the distance between the true- and estimator-based-empirical distributions with prescribed342

confidence for all values of the initial condition and the internal noise from the set BNd∞ (ρξ0)×343

BN`q∞ (ρw), which contains the support of their joint distribution (and hence all their possible344



realizations). For the results of this section, the initial condition and the internal noise are345

interpreted as deterministic quantities, as discussed above.346

Lemma 4.2. (Distance between true- & estimator-based-empirical distribution).347

Let (z,ω) ∈ BNd∞ (ρξ0) × BN`q

∞ (ρw) and consider the discrete distribution PNξ` ≡ PNξ` (z,ω)348

and the empirical distribution PNξ` ≡ PNξ` (z,ω,v`), where v` is the measurement noise of the349

realizations. Then,350

Wp(PNξ`, PNξ` ) ≤ 2

p−1p Mw + 2

p−1p

( 1

N

N∑i=1

(Ei)p) 1p, where(4.9a)351

Mw :=√d‖Ψ`‖ρξ0 +

√q∑k=1

‖Ψ`,`−k+1G`−k‖ρw,(4.9b)352

Ei ≡ E(vi) :=∑k=1

‖Ψ`,`−k+1K`−k‖‖vi`−k‖1.(4.9c)353

354

The next result gives bounds for the norms of the random variables Ei in Lemma 4.2.355

Lemma 4.3. (Orlicz- & Lp-norm bounds for Ei). The random variables Ei in (4.9c)356

satisfy357

‖Ei‖p ≤Mv := Mvr∑k=1

‖Ψ`,`−k+1K`−k‖,(4.10a)358

‖Ei‖ψp ≤ Cv := Cvr∑k=1

‖Ψ`,`−k+1K`−k‖,(4.10b)359

‖Ei‖p ≥ mv := mvr1p

(∑k=1

‖Ψ`,`−k+1K`−k‖p) 1p

,(4.10c)360

361

with mv, Mv, and Cv as given in H4.362

The proofs of both results above are given in the Appendix. We further rely on the363

following concentration of measure result around the mean of nonnegative independent random364

variables, whose proof is also in the Appendix, to bound the term(

1N

∑Ni=1(Ei)p

) 1p , and control365

the Wasserstein distance between the true- and the estimator-based-empirical distribution.366

Proposition 4.4. (Concentration around pth mean). Let X1, . . . , XN be scalar, non-367

negative, independent random variables with finite ψp norm and E[Xpi ] = 1. Then,368

P((

1

N

N∑i=1

Xpi

) 1p

− 1 ≥ t)≤ 2 exp

(− c′N

R2αp(t)

),(4.11)369

370

for every t ≥ 0, with c′ = 1/10, R := maxi∈[1:N ] ‖Xi‖ψp + 1/ ln 2, and371

αp(s) :=

s2, if s ∈ [0, 1],

sp, if s ∈ (1,∞).(4.12)372

373



Combining the results above, we obtain the main result of this section regarding the374

ambiguity center difference.375

Proposition 4.5. (Distance guarantee between true- & estimator-based-empirical376

distribution). Consider a confidence 1− βns and let377

εN (βns) := 2p−1p

(Mw + Mv + Mvα

−1p

(R2

c′Nln

2

βns

)),(4.13)378

379

with Mw, Mv given by (4.9b), (4.10a),380

R := Cv/mv + 1/ ln 2,(4.14)381382

and Cv, mv as in (4.10b), (4.10c). Then, for all (z,ω) ∈ BNd∞ (ρξ0)×BN`q

∞ (ρw), we have383

P(Wp(P

Nξ`

(z,ω,v`), PNξ`

(z,ω)) ≤ εN (βns))≥ 1− βns.(4.15)384385

Proof. For each i, the random variable Xi := Ei/‖Ei‖p satisfies ‖Xi‖p = 1. Thus, we386

obtain from Proposition 4.4 that387

P((

1

N

N∑i=1

(Ei

‖Ei‖p

)p) 1p

− 1 ≥ t)≤ 2 exp

(− c′N

R2αp(t)

),388

389

where R = maxi∈[1:N ]

∥∥Ei/‖Ei‖p∥∥ψp + 1/ ln 2. From (4.10b), (4.10c), and (4.14), we deduce390

R ≥ R, and thus,391

P((

1

N

N∑i=1

(Ei

‖Ei‖p

)p) 1p

− 1 ≥ t)≤ 2 exp

(− c′N

R2αp(t)

).392

393

Now, it follows from (4.10a) that394

Mv

(1

N

N∑i=1

(Ei

‖Ei‖p

)p) 1p

−Mv ≥(

1

N

N∑i=1

(Ei)p) 1p

−Mv.395

396

Thus, we deduce397

P((

1

N

N∑i=1

(Ei)p) 1p

−Mv ≥Mvt

)≤ 2 exp

(− c′N

R2αp(t)

),398

399

or, equivalently, that400

P((

1

N

N∑i=1

(Ei)p) 1p

≥Mv + s

)≤ 2 exp

(− c′N

R2αp

( s

Mv

)).(4.16)401

402

To establish (4.15), it suffices by Lemma 4.2 to show that403

P(

2p−1p Mw + 2

p−1p

( 1

N

N∑i=1

(Ei)p) 1p ≤ εN (βns)

)≥ 1− βns.404



405

By the definition of εN and exploiting that it is strictly decreasing with βns, it suffices to prove406

P(( 1

N

N∑i=1

(Ei)p) 1p<Mv + Mvα

−1p

( R2

c′Nln

2

βns

))≥ 1− βns.407

408

Setting τ = α−1p

(R2

c′N ln 2βns

), we equivalently need to show409

P(( 1

N

N∑i=1

(Ei)p) 1p ≥Mv + τMv

)≤ βns,410

411

which follows by (4.16) with s = τMv.412

4.3. Overall ambiguity set. Here we combine the results from Sections 4.1 and 4.2 to413

obtain the ambiguity set of the state distribution in the following result.414

Theorem 4.6. (Ambiguity set under noisy dynamics & observations). Consider415

data collected from N realizations of system (3.3) in accordance to Assumptions 3.1 and 3.2,416

a confidence 1−β, and let βnom, βns ∈ (0, 1) satisfying (4.6). Then, the guarantee (4.5) holds,417

where ψN (β) is given in (4.7) and its components εN (βnom) ≡ εN (βnom, ρξ`) and εN (βns) are418

given by (4.8) and (4.13), respectively, with419

ρξ` :=√d‖Φ`‖ρξ0 +

√q∑k=1

‖Φ`,`−k+1G`−k‖ρw.(4.17)420

421

Proof. Due to (4.7) and the triangle inequality for Wp,422

Wp(PNξ`, Pξ`) ≤ ψN (β) ⊃ Wp(P

Nξ`, PNξ` ) ≤ εN (βns) ∩ Wp(P

Nξ`, Pξ`) ≤ εN (βnom, ρξ`).423424

Thus, to show (4.5), it suffices to show that425

E[1Wp(PNξ`

,PNξ`)−εN (βns)≤0 × 1Wp(PNξ`

,Pξ` )−εN (βnom,ρξ` )≤0

]≥ 1− β.(4.18)426

427

We therefore exploit Lemma 2.1 with the random variable X ≡ (ξ0,w`), taking values in428

the compact set K ≡ BNd∞ (ρξ0) × BN`q

∞ (ρw), the random variable Y ≡ v` ∈ RN`r, and429

g(X,Y ) ≡ g(ξ0,w`,v`), where430

g(ξ0,w`,v`) := 1Wp(PNξ`(ξ0,w`),Pξ` )−εN (βnom,ρξ` )≤0 × 1Wp(PNξ`

(ξ0,w`,v`),PNξ`

(ξ0,w`))−εN (βns)≤0.431432

Due to (4.15), E[1Wp(PNξ`

(z,ω,v`),PNξ`

(z,ω))−εN (βns)≤0

]≥ 1 − βns for any x = (z,ω) ∈ K and433

thus E[g(x, Y )] ≥ 1Wp(PNξ`(x),Pξ` )−εN (βnom,ρξ` )≤0 × (1 − βns) =: k(x), for all x ∈ K. Hence,434

since X ≡ (ξ0,w`) and Y ≡ v` are independent by H2, we deduce from Lemma 2.1 that435

E[g(X,Y )] ≥ E[1Wp(PNξ`

(ξ0,w`),Pξ` )−εN (βnom,ρξ` )≤0(1− βns)]

436



= (1− βns)P(Wp(PNξ`

(ξ0,w`), Pξ`) ≤ εN (βnom, ρξ`)).437438

From (4.4) and H3 in Assumption 3.2, it follows that Pξ` is supported on the compact set439

Bd∞(ρξ`) with diam∞(Bd

∞(ρξ`)) = 2ρξ` and ρξ` given in (4.17). In addition, due to H1 and H2440

in Assumption 3.2 the random states ξi` in the empirical distribution PNξ` (ξ0,w`) = 1N

∑Ni=1 δξi`

441

are i.i.d.. Thus, we get from Proposition 4.1 that P(Wp(PNξ`

(ξ0,w`), Pξ`) ≤ εN (βnom, ρξ`)) ≥442

1− βnom, which implies E[g(X,Y )] ≥ (1− βns)(1− βnom) = 1− β. Finally, (4.18) follows from443

this and the definition of g.444

With this result at hand, we deduce from the expressions (4.8) and (4.15) for the compo-445

nents of the ambiguity radius that it decreases as we exploit a larger number N of independent446

trajectories and relax our confidence choices, i.e., reduce 1 − βnom and 1 − βns. Notice fur-447

ther that no matter how many trajectories we use, the noise ambiguity radius decreases to a448

strictly positive value. It is also worth to observe that ψN generalizes the nominal ambiguity449

radius εN in the DRO literature (even when dynamic random variables are considered [8])450

and reduces to εN in the noise-free case where εN = 0.451

Drawing conclusions about how the ambiguity radius behaves as we simultaneously allow452

the horizon [0 : `] and the number N of sampled trajectories to increase is a more delicate453

matter. The value of the nominal component depends essentially on N and the support of454

the distribution at `, with the latter in turn depending on the system’s stability properties455

and the support of the initial condition and internal noise distributions. On the other hand,456

the noise component depends on N and the quality of the estimation error. We quantify in457

the next section how the latter guarantees uniform boundedness of the noise radius under458

detectability-type assumptions.459

Remark 4.7. (Positive lower bound of the noise radius). The positive lower bound460

2p−1p (Mw+Mv) on the noise radius in (4.13) represents in general a fundamental limitation for461

the ambiguity set accuracy, which is independent of the number N of estimated state samples.462

This is because the bound is related to the size of the state estimation error, which persists463

under the presence of noise and may further grow in time if there is no system detectability.464

Remark 4.8. (Optimal radius selection). Once a desired confidence level 1 − β and465

the number of independent trajectories N are fixed, we can optimally select the ambiguity466

radius by minimizing the function467

βnom 7→ ψN (βnom) ≡ εN (βnom) + εN ((β − βnom)/(1− βnom)),468469

where we have taken into account the constraint (4.6) between the nominal and the noise470

confidence. This function is non-convex, but one-dimensional, and its minimizer is in the471

interior of the interval (0, β), so its optimal value can be approximated with high accuracy.472

4.4. Uncertainty quantification over bounded time horizons. In this section we discuss473

how the guarantees can be extended to scenarios where an ambiguity set is built over a474

finite-time horizon instead of a single instance `. In this case we assume that samples are475

collected over the time window [0 : `2] and we seek to build an ambiguity set about the476

state distribution along [`1 : `2], with 0 ≤ `1 ≤ `2. We distinguish between two ambiguity set477

descriptions depending on the way the associated probabilistic guarantees are obtained. In the478



first, we directly build an ambiguity set for the probability distribution of the random vector479

ξ` := (ξ`1 , . . . , ξ`2) ∈ R˜d with ` := (`1, . . . , `2) and ˜= `2−`1 +1, comprising of all states over480

the interval of interest and using the concentration of measure result of Proposition 4.1 for ˜d-481

dimensional random variables. This has the drawback that the ambiguity radius decays slowly482

with the number of trajectories due to the high dimension of ξ`. The other description derives483

an ambiguity set about the state distribution Pξ`1 at time `1 with prescribed confidence, and484

propagates it under the dynamics while taking into account the possible values of the internal485

noise. We also present sharper results for the case when the internal noise sequence is known.486

The first ambiguity set description is provided by the following analogue of Theorem 4.6.487

Theorem 4.9. (Ambiguity set over a bounded time horizon). Consider output data488

collected from N realizations of system (3.3) over the interval [0 : `2] and let Assumption 3.1489

hold. Pick a confidence 1−β, let βnom, βns ∈ (0, 1) satisfying (4.6), and consider the bounded-490

horizon estimator empirical distribution491

PNξ` :=1

N

N∑i=1

δξi`

492

493

over the horizon [`1 : `2], where ξi` := (ξi`1 , . . . , ξi`2

) ∈ R˜d and each ξi` is given by the observer494

(4.1). Then495

P(Wp(PNξ`, Pξ`) ≤ ψN (β)) ≥ 1− β(4.19)496497

holds, where ξ` := (ξ`1 , . . . , ξ`2) and ψN (β) is given in (4.7). The nominal component498

εN (βnom) ≡ εN (βnom, ρξ`) is given by (4.8) (with d in the expression substituted by ˜d)499

ρξ` := max`∈[`1:`2]

√d‖Φ`‖ρξ0 +

√q∑k=1

‖Φ`,`−k+1G`−k‖ρw,(4.20)500

501

whereas εN (βns) is given as502

εN (βns) := 2p−1p

(Mw + Mv + Mvα

−1p

(R2

c′Nln

2

βns

)), with503

Mw :=

`2∑`=`1

Mw(`), Mv :=

`2∑`=`1

Mv(`),504

R :=Cvmv

+1

ln 2, Cv :=

`2∑`=`1

Cv(`), mv :=

`2∑`=`1

mv(`),505

506

and Mw(`) ≡ Mw, Mv(`) ≡ Mv, Cv(`) ≡ Cv, and mv(`) ≡ mv, as given by (4.9b), (4.10a),507

(4.10b), and (4.10c), respectively.508

The proof of this result follows the argumentation employed for the proof of Theorem 4.6509

(a sketch can be found in the online version [9]). For the second ambiguity set description we510

use a pointwise-in-time approach. To this end, we build a family of ambiguity balls so that511



under the same confidence level the state distribution at each time instant of the horizon lies512

in the associated ball, i.e.,513

P(Pξ` ∈ BψN,`(P

Nξ`

) ∀` ∈ [`1 : `2])≥ 1− β,(4.21)514515

where BψN,`(PNξ`)

:= P ∈ Pp(Rd) |Wp(P, PNξ`

) ≤ ψN,` and PNξ` is the center of the ball.516

This is well suited for stochastic optimization problems that have a separable structure with517

respect to the stochastic argument across different time instances, i.e., problems of the form518

infx∈X

E[f1(x, ξ`1) + · · ·+ f˜(x, ξ`2)

].519

520

The pointwise ambiguity sets are quantified in the following result.521

Theorem 4.10. (Pointwise ambiguity sets over a bounded time horizon). Let the522

assumptions of Theorem 4.9 hold, assume that the internal noise sequence w` is independent523

(also of the initial state), Pw` ∈ Pp(Rd) for ` ∈ [`1 : `2], i.e., it is not necessarily compactly524

supported, and consider either of the following two cases for its distribution when ` ∈ [`1 : `2]:525

(i) Pw` is not known and E[‖w`‖p

] 1p ≤ qw.526

(ii) Pw` is known.527

Then, for any confidence 1 − β, and βnom, βns ∈ (0, 1) satisfying (4.6), (4.21) holds, with528

PNξ`1:= PNξ`1

and ψN,`1 as given by Theorem 4.6 (for ` ≡ `1), and PNξ` , ψN,`, ` ∈ [`1 + 1 : `2]529

defined as follows for the respective two cases above:530

(i) The ambiguity set center is PNξ` := 1N

∑Ni=1 δξi`

with ξi` := Φ`,`1 ξ`1 and the radius is531

given recursively by ψN,` := ‖A`−1‖ψN,`−1 + qw.532

(ii) The ambiguity set center is PNξ` :=((A`−1)#P

Nξ`−1

)? Pw`−1

and the radius is ψN,` :=533

‖A`−1‖ · · · ‖A`1‖ψN,`1.534

In our technical approach, we use the next result, whose proof is given in the Appendix.535

The result examines what happens to the Wasserstein distance between the distributions of536

two random variables when other random variables are added.537

Lemma 4.11. (Wasserstein distance under convolution). Given p ≥ 1 and distribu-538

tions P1, P2, Q ∈ Pp(Rd), it holds that Wp(P1, P2) ≤Wp(P1 ? Q, P2 ? Q). Also, if it holds that539 ( ∫Rd ‖x‖

pQ(dx)) 1p ≤ q, then Wp(P1, P2 ? Q) ≤Wp(P1, P2) + q.540

Proof of Theorem 4.10. The proof is carried out by induction on ` ∈ [`1 : `2]. In particular,541

it suffices to establish that542

Wp

(Pξ`1 , P

Nξ`1

)≤ ψN,`1 =⇒Wp

(Pξ`′ , P

Nξ`′

)≤ ψN,`′ ∀`′ ∈ [`1 : `].(4.22)543

544

Note that from Theorem 4.6, P(Pξ`1 ∈ BψN,`1 (PNξ`1

))≥ 1 − β. From (4.22), this also implies545

that P(Pξ`′ ∈ BψN,`′ (P

Nξ`′

) ∀`′ ∈ [`1 : `])≥ 1− β, establishing validity of the result for ` ≡ `2.546

For ` ≡ `1, the induction hypothesis (4.22) is a tautology. Next, assuming that it is547

true for certain ` ∈ [`1 : `2 − 1], we show that it also holds for ` + 1. Hence it suffices548

to show that if Wp

(Pξ` , P

Nξ`

)≤ ψN,` then also Wp

(Pξ`+1

, PNξ`+1

)≤ ψN,`+1 for both cases (i)549

and (ii). Consider first (i) and note that then the ambiguity set center at ` + 1 satisfies550



PNξ`+1= 1

N

∑Ni=1 δξi`+1

= 1N

∑Ni=1 δA`ξi`

= (A`)#PNξ`, where we have exploited that ξik ≡ Φk,`1 ξ

i`1

551

and the definition of Φk,`1 (for k = `− 1, `) to derive the second equality. Using also the fact552

that Pξ`+1=((A`)#Pξ`

)? Pw` , we get from the second result of Lemma 4.11 that553

Wp(Pξ`+1, PNξ`+1

) = Wp

(((A`)#Pξ`

)? Pw` , (A`)#P

Nξ`

)≤Wp

((A`)#Pξ` , (A`)#P

Nξ`

)+ qw554

≤ ‖A`‖Wp

(Pξ` , P

Nξ`

)+ qw ≤ ‖A`‖ψN,` + qw = ψN,`+1.555556

Here we also used the fact that Wp(f#P, f#Q) ≤ LWp(P,Q) for any globally Lipschitz func-557

tion f : Rd → Rr with Lipschitz constant L in the second to last inequality (see e.g., [42,558

Proposition 7.16]).559

Next, we prove the induction hypothesis for (ii). Using Lemma 4.11 and the definition of560

the ambiguity set center and radius,561

Wp(Pξ`+1, PNξ`+1

) = Wp

(((A`)#Pξ`

)? Pw` ,

((A`)#P

Nξ`

)? Pw`

)≤Wp

((A`)#Pξ` , (A`)#P

Nξ`

)562

≤ ‖A`‖Wp

(Pξ` , P

Nξ`

)≤ ‖A`‖ψN,` = ‖A`‖‖A`−1‖ · · · ‖A`1‖ψN,`1 = ψN,`+1,563564

completing the proof.565

5. Sufficient conditions for uniformly bounded noise ambiguity radii. In this section566

we leverage Assumption 3.4 to establish that the noise ambiguity radius remains uniformly567

bounded as the sampling horizon increases. We first provide uniform bounds for the matrices568

involved in the system and observer error dynamics.569

Proposition 5.1. (Bounds on system/observer matrices). Under Assumption 3.4, the570

gain matrices Kk can be selected so that the following properties hold:571

(i) There exist K?,K?, G? > 0 and Ψ?

s > 0, s ∈ N0, so that ‖Gk‖ ≤ G?, K? ≤ ‖Kk‖ ≤ K?,572

and ‖Ψk+s,k‖ ≤ Ψ?s for all and k ∈ N0.573

(ii) There exists s0 ∈ N so that ‖Ψk+s,k‖ ≤ 12 for all k ∈ N0 and s ≥ s0.574

Proof. Note that we only need to verify part (i) for the time-varying case. Since all Gk575

are uniformly bounded, we directly obtain the bound G?. Let576

Kk := −AkΦk,k−t−1O−1k,k−t−1Φ>k,k−t−1H

>k ,577

578

(for k > t + 1) as selected in [33, Page 574] (but with a minus sign at the front to get579

the plus sign in Fk = Ak +KkHk) and with the observability Gramian Ok,k−t−1 as defined in580

Assumption 3.4(ii). Then, the upper bound K? follows from the fact that the system matrices581

are uniformly bounded combined with the uniform observability property of Assumption 3.4,582

which implies that all O−1k,k−t−1 are also uniformly bounded. On the other hand, the lower583

bound K? follows from the assumption that the system matrices are uniformly bounded, which584

imposes a uniform lower bound on the smallest singular value of O−1k,k−t−1, the uniform lower585

bound on the smallest singular value of Ak, hence, also on that of Φk,k−t−1 and Φ>k,k−t−1,586

and the uniform lower bound on ‖Hk‖ (all found in Assumption 3.4). Finally, the bounds Ψ?s587

follow from the uniform bounds for all Ak and Hk and the derived bound K? for all Kk.588

To show part (ii), assume first that Assumption 3.4(i) holds, i.e., the system is time589

invariant and (A,H) is detectable. Then, we can choose a nonzero gain matrix K so that590



F = A+KH is convergent (cf. [40, Theorem 31]), namely lims→∞ ‖F s‖ = 0. Consequently,591

there is s0 ∈ N with ‖F s‖ ≤ 12 for all s ≥ s0 and the result follows by taking into account that592

Ψk+s,k = F s. In case Assumption 3.4(ii) holds, let593

ek+1 = Fkek(5.1)594595

be the recursive noise-free version of the error equation (4.3). Then, from [33, Page 577],596

there exists a quadratic time-varying Lyapunov function V (k, e) := e>Qke with each Qk597

being positive definite, a1, a2 > 0, a3 ∈ (0, 1), and m ∈ N so that598

a1 ≤ λmin(Qk) ≤ λmax(Qk) ≤ a2(5.2a)599

V (k +m, ek+m)− V (k, ek) ≤ −a3V (k, ek)(5.2b)600601

for any k and any solution of (5.1) with state ek at time k. Thus, Ψ>k+m,mQk+mΨk+m,m 602

(1− a3)Qk, and hence, by induction Ψ>k+νm,mQk+νmΨk+νm,m (1− a3)νQk, since603

Ψ>k+(ν+1)m,mQk+(ν+1)mΨk+(ν+1)m,m=Ψ>k+m,kΨ>k+(ν+1)m,k+mQk+(ν+1)mΨk+(ν+1)m,k+mΨk+m,k604

(1− a3)νΨ>k+m,kQk+mΨk+m,k (1− a3)(ν+1)Qk.605606

Next, pick e with ‖e‖ = 1 and ‖Ψk+νm,me‖ = ‖Ψk+νm,m‖. Taking into account that607

e>Ψ>k+νm,kQk+νmΨk+νm,me ≤ (1 − a3)ν e>Qke, we get λmin(Qk+νm)‖Ψk+νm,ke‖2 ≤ (1 −608

a3)νλmax(Qk). Using (5.2a),609

‖Ψk+νm,k‖ ≤ (1− a3)ν2

(a2

a1

) 12.(5.3)610

611

Now, select ν so that (1 − a3)ν′2 (a2/a1)

12 ≤ 1/(2 maxs∈[1:m] Ψ?

s) for all ν ′ ≥ ν. Let s0 := νm612

and pick s ≥ s0. Then, s = s′0 + m′ for some s′0 = ν ′m, ν ′ ≥ ν, and m′ ∈ [0 : m − 1] and we613

get from (5.3), part (i), and the selection of ν that614

‖Ψk+sm,k‖=‖Ψk+s′0+m′,k+s′0Ψk+s′0,k

‖≤‖Ψk+s′0+m′,k+s′0‖‖Ψk+ν′m,k‖≤Ψ?

m′1

2 maxs∈[1:m] Ψ?s

≤ 1

2,615

616

which establishes the result.617

Based on this result and Assumption 3.4 about the system’s detectability/uniform observ-618

ability properties, we proceed to provide a uniform bound on the size of the noise radius for619

arbitrarily long evolution horizons.620

Proposition 5.2. (Uniform bounds for noise ambiguity radius). Consider data col-621

lected from N realizations of system (3.3), a confidence 1 − β as in (4.6), and let Assump-622

tions 3.1, 3.2, and 3.4 hold. Then, there exist observer gain matrices Kk so that the noise623

ambiguity radius εN in (4.13) is uniformly bounded with respect to the sampling horizon size.624

In particular, there exists `0 ∈ N so that, for each ` ≥ `0, Mw ≡ Mw(`), Mv ≡ Mv(`), and625

R ≡ R(`) given by (4.9b), (4.10a), and (4.14), are uniformly upper bounded as626

Mw ≤1

2

√dρξ0 + 3

√q

`0−1∑j=0

Ψ?jG

?ρw,627



Mv ≤ 3Mvr

`0−1∑j=0

Ψ?jK

?, R ≤ 3Cvmv

rp−1p

∑`0−1j=0 Ψ?

jK?

K?.628

629

Proof. Consider gain matrices Kk and the time s0 as given in Proposition 5.1, and let630

`0 := s0. Then, for any ` ≥ `0, ` = n`0 + r′ with 0 ≤ r′ < `0 and we have631

∑k=1

‖Ψ`,`−k+1G`−k‖ ≤∑k=1

‖Ψ`,`−k+1‖G? =

( r′∑k=1

‖Ψ`,`−k+1‖+∑

k=r′+1

‖Ψ`,`−k+1‖)G?632

≤( r′−1∑

s=0

Ψ?s +

n`0+r′∑k=r′+1

‖Ψn`0+r′,n`0+r′−k+1‖)G? (k 7→ (ν − 1)`0 + j + r′)633

=

( r′−1∑s=0

Ψ?s +

n∑ν=1

`0∑j=1

‖Ψn`0+r′,(n−ν)`0+r′+`0−j+1‖)G? (`0 + 1− j 7→ j)634

≤( r′−1∑

s=0

Ψ?s +

n∑ν=1

`0∑j=1

‖Ψn`0+r′,(n−ν+1)`0+r′‖‖Ψ(n−ν)`0+r′+`0,(n−ν)`0+r′+j‖)G?635

=

( r′−1∑s=0

Ψ?s +

n∑ν=1

‖Ψn`0+r′,(n−ν+1)`0+r′‖`0∑j=1

‖Ψ(n−ν)`0+r′+`0,(n−ν)`0+r′+j‖)G?636

≤( r′−1∑

s=0

Ψ?s +

n∑ν=1

( ν−1∏κ=1

‖Ψ(n+1−κ)`0+r′,(n−κ)`0+r′‖) `0∑j=1

Ψ?`0−j

)G?637

≤( `0−1∑

s=0

Ψ?s +

n∑ν=1

(1

2

)ν−1`0−1∑j=0

Ψ?j

)G? ≤ 3

`0−1∑j=0

Ψ?jG

?,638

639

where we have used∑−1

κ=0 ≡∑0

κ=1 ≡ 0 and∏0κ=1 ≡ 1. From this and the fact that from640

Proposition 5.1, ‖Ψ`‖ ≤ 12 for all ` ≥ `0, we get the upper bound for Mw. The one for Mv is641

obtained analogously. Finally, for R, we obtain the same type of upper bound for Cv as for Mw,642

and exploit Proposition 5.1(i) to get the lower bound mv = mvr1p(∑`

k=1 ‖Ψ`,`−k+1K`−k‖p) 1p ≥643

mvr1p ‖Ψ`,`K`−1‖ ≥ mvr

1pK?, which is independent of `.644

Remark 5.3. (Noise ambiguity radius for time-invariant systems). For time-645

invariant systems, it is possible to improve the bounds of Proposition 5.2 for Mw, Mv, and R646

by exploiting the fact that the system and observer gain matrices are constant. The precise647

bounds in this case (see also [7, Proposition 5.5]) are648

Mw ≤1

2

√dρξ0 + 2

√q

`0−1∑k=0

‖ΨkG‖ρw,649

Mv ≤ 2Mvr

`0−1∑k=0

‖ΨkK‖, R ≤ 2Cvmv

rp−1p

∑`0−1k=0 ‖ΨkK‖(∑`0−1k=0 ‖ΨkK‖p

) 1p

,650

651



with `0 as in the time-invariant case of Proposition 5.2, and whereG andK denote the constant652

values of the internal noise and observer gain matrices, resp. The superiority of these bounds653

can be checked using the matrix bounds in Proposition 5.1(i) and their derivation is based on654

a simplified version of the arguments employed for the proof of Proposition 5.2. 655

6. Application to economic dispatch with distributed energy resources. In this section,656

we take advantage of the ambiguity sets constructed with noisy partial measurements, cf.657

Theorem 4.6, to hedge against the uncertainty in an optimal economic dispatch problem.658

This is a problem where uncertainty is naturally involved due to (dynamic) energy resources,659

which the scheduler has no direct access to control or measure, like storage or renewable energy660

elements. The financial implications of the associated decisions are of utmost importance for661

the electricity market and justify the use of a reliable decision framework that accounts for662

the variability of the uncertain factors.663

6.1. Network model and optimization objective. Consider a network with distributed664

energy resources [12] comprising of n1 generator units and n2 storage (battery) units. The665

network needs to operate as close as possible to a prescribed power demand D at the end666

of the time horizon [0 : `], corresponding to a uniform discretization of step-size δt of the667

continuous-time domain. To this end, each generator and storage unit supplies the network668

with positive power P j and Sι, respectively, at time `. We assume we can control the power669

of the generators, which additionally needs to be within the upper and lower thresholds P jmin670

and P jmax, respectively. Each battery is modeled as an uncertain dynamic element with an671

unknown initial state distribution and we can decide whether it is connected (ηι = 1) or not672

(ηι = 0) to the network at time `. Our goal is to minimize the energy cost while remaining as673

close as possible to the prescribed power demand. Thus, we minimize the overall cost674

C(P ,η) :=

n1∑j=1

gj(P j) +

n2∑ι=1

ηιhι(Sι) + c

( n1∑j=1

P j +

n2∑ι=1

ηιSι −D)2

(6.1)675

676

where P := (P 1, . . . , Pn1), η := (η1, . . . , ηn2), gj and hι are cost functions for the power677

provided by generator j and storage unit ι, respectively. We treat the deviation of the injected678

power from its prescribed demand as a soft constraint by assigning it a quadratic cost with679

weight c and augmenting the overall cost function (6.1). Due to the uncertainty about the680

batteries’ state and their injected powers Sι, the minimization of (6.1) is a stochastic problem.681

6.2. Battery dynamics and observation model. Each battery is modeled as a single-cell682

dynamic element and we consider its current Iι discharging over the operation interval (if683

connected to the network) as a fixed and a priori known function of time. Its dynamics is684

conveniently approximated by the equivalent circuit in Figure 2(a) (see e.g., [29, 30]), where685

zι is the state of charge (SoC) of the cell and Ocv(zι) is its corresponding open-circuit voltage,686

which we approximate by the affine function αιzι+βι in Figure 2(b). The associated discrete-687

time cell model is688

χιk+1 ≡(Iι,2k+1

zιk+1

)=

(aι 00 1

)(Iι,2kzιk

)+

(1− aι−δt/Qι

)Iιk689

θιk ≡ V ιk = αιzιk + βι − IιkRι,1 − I

ι,2k Rι,2690



load

Ocv(𝑧𝑧𝜄𝜄)

source

𝑅𝑅𝜄𝜄,1

𝑅𝑅𝜄𝜄,2𝐼𝐼𝜄𝜄,2

𝐼𝐼𝜄𝜄,1

𝐼𝐼𝜄𝜄

𝐶𝐶𝜄𝜄

+ −

−+

(a)

State of Charge

Ope

n Ci

rcui

t Vol

tage

0 0.2 0.60.4 0.82

2.5

3.5

4

3

(b)

Figure 2. (a) shows the equivalent circuit model of a lithium-ion battery cell in discharging mode (c.f. [30,Figure 2],[29, Figure 1]). (b) is taken from [29, Figure 3] and shows the nonlinear dependence of the opencircuit voltage on the state of charge and its affine approximation.

691

where aι := e−δt/(R2,ιCι), δt is the time discretization step, and Qι is the cell capacity. Here, we692

assume that for all k ∈ [0 : `] the cell is neither fully charged or discharged (by e.g., requiring693

that 0 < z0 −∑`−1

k=0 δtIιk/Q

ι < 1 for all k and any candidate initial conditions and input694

currents) and so, the evolution of its voltage is accurately represented by the above difference695

equation. The initial condition comprising of the SoC zι0 and the current Iι,20 through Rι,2 is696

random with an unknown probability distribution. We also consider additive measurement697

noise with an unknown distribution, namely, we measure698

θιk = αιzιk + βι − IιkRι,1 − Iι,2k Rι,2 + vk.699700

To track the evolution of each random element through a linear system of the form (3.3), we701

consider for each battery a nominal state trajectory χι,?k = (Iι,2,?k , zι,?k ) initiated from the center702

of the support of its initial-state distribution. Setting ξιk = χιk−χι,?k and ζιk = θk(χ

ιk)−θk(χ

ι,?k ),703

ξιk+1 = Aιkξιk704

ζιk = Hιkξιk + vk,705706

where Aιk := diag(a, 1) and Hιk := (αι,−Rι,2). Denoting ξ := (ξ1, . . . , ξn2) and ζ :=707

(ζ1, . . . , ζn2), we obtain a system of the form (3.3) for the dynamic random variable ξ. Despite708

the fact that the state distribution ξk of the batteries across time is unknown, we assume hav-709

ing access to output data from N independent realizations of their dynamics over the horizon710

[0 : `]. Using these samples we exploit the results of the paper to build an ambiguity ball PN711

of radius εN in the 2-Wasserstein distance (i.e., with p = 2), that contains the batteries’ state712

distribution Pξ` at time ` with prescribed probability 1−β. In particular, we take the samples713

from each realization i ∈ [1 : N ] and use an observer to estimate its state ξi` at time `. The714

ambiguity set is centered at the estimator-based empirical distribution PNξ` = 1N

∑Ni=1 δξi`

and715

its radius can be determined using Theorem 4.6 and Proposition 4.5.716



6.3. Decision problem as a distributionally robust optimization (DRO) problem. To717

solve the decision problem regarding whether or not to connect the batteries for economic718

dispatch, we formulate a distributionally robust optimization problem for the cost (6.1) using719

the ambiguity set PN . To do this, we derive an explicit expression of how the cost function C720

depends on the stochastic argument ξ`. Notice first that the power injected by each battery721

at time ` is722

Sι = Iι`Vι` = Iι`

(αιzι` + βι − Iι`Rι,1 − I

ι,2` Rι,2

)723

= 〈(−Iι`Rι,2, αιIι`), χι`〉+ βιIι` − (Iι`)2Rι,1 = 〈αι, ξι`〉+ βι ≡ (αι)>ξι` + βι,724725

with αι := (−Iι`Rι,2, αιIι`) and726

βι := 〈αι, χι,?` 〉+ Iι`βι − (Iι`)

2Rι,1 = Iι`Iι,2,?` Rι,2 − αιIι`z

ι,?` + Iι`β

ι − (Iι`)2Rι,1.727728

Considering further affine costs hι(S) := αιS+ βι for the power provided by the batteries, the729

overall cost C becomes730

C(P ,η) = g(P ) + (η ∗ α)>ξ` + η>β + c(1>P + (η ∗ α)>ξ` + η>β −D

)2,(6.2)731732

where ∗ denotes the Khatri-Rao product (cf. Section 2) and733

g(P ) :=

n1∑j=1

gj(P j), α := (α1, . . . , αn2), β := (β1, . . . , βn2),734

α := (α1α1, . . . , αn2αn2), β := (α1β1 + β1, . . . , αn2 βn2 + βn2).735736

Using the equivalent description (6.2) for C and recalling the upper and lower bounds P jmin737

and P jmax for the generator’s power, we formulate the DRO power dispatch problem738

infη,P

fη(P ) + sup

Pξ`∈PN

EPξ`[hη(P , ξ`)

],(6.3a)739

s.t. P jmin ≤ Pj ≤ P jmax ∀j ∈ [1 : n1],(6.3b)740741

with the ambiguity set PN introduced above and742

fη(P ) := g(P ) + cP>11>P + 2c(η>β −D)1>P + c(η>β −D)2 + η>β743

hη(P , ξ`) := cξ>` (η ∗ α)(η ∗ α)>ξ` +(2c(1>P + η>β −D)(η ∗ α)> + (η ∗ α)>

)ξ`,744745

This formulation aims to minimize the worst-case expected cost with respect to the plausible746

distributions of ξ at time `.747

6.4. Tractable reformulation of the DRO problem. Our next goal is to obtain a tractable748

reformulation of the optimization problem (6.3). To this end, we first provide an equivalent749

description for the inner maximization in (6.3), which is carried out over a space of probability750

measures. Exploiting strong duality (see [20, Corollary 2(i)] or [5, Remark 1]) and recalling751



that our ambiguity set PN is based on the 2-Wasserstein distance, we equivalently write the752

inner maximization problem supPξ`∈PN EPξ`

[hη(P , ξ`)

]as753

infλ≥0

λψ2

N +1

N

N∑i=1

supξ`∈Ξhη(P , ξ`)− λ‖ξ` − ξi`‖2

,(6.4)754

755

where ψN ≡ ψN (β) is the radius of the ambiguity ball, Ξ ⊂ R2n2 is the support of the756

batteries’ unknown state distribution, and the ξi` are the estimated states of their realizations.757

We slightly relax the problem, by allowing the ambiguity ball to contain all distributions with758

distance ψN from PNξ` that are supported on R2n2 and not necessarily on Ξ. Thus, we first759

look to solve for each estimated state ξi` the optimization problem760

supξ`∈R2n2

hη(P , ξ`)− λ‖ξ` − ξi`‖2,761

762

which is written763

supξ`∈R2n2

ξ>` Aξ` +

(2c(1>P + η>β −D)(η ∗ α)> + (η ∗ α)>

)ξ` − λ(ξ` − ξi`)>(ξ` − ξi`)

764

= −λ(ξi`)>ξi` + sup

ξ`∈R2n2

ξ>` (A− λI2n2)ξ`765

+(2c(1>P + η>β −D)(η ∗ α)> + (η ∗ α)> + 2λ(ξi`)

>)ξ`766

= −λ(ξi`)>ξi` + sup

ξ`∈R2n2

ξ>` (A− λI2n2)ξ` + (ri)>ξ`

767

768

where ri ≡ riη(P , λ) := 2c(1>P+η>β−D)(η∗α)+η∗α+2λξi` and A ≡ Aη := c(η∗α)(η∗α)>769

is a symmetric positive semi-definite matrix with diagonalization A = Q>DQ where the770

eigenvalues decrease along the diagonal. Hence, we get771

supξ`∈R2n2

ξ>` (A− λI2n2)ξ` + (ri)>ξ`

= supξ`∈R2n2

ξ>` (Q>DQ−Q>λI2n2Q)ξ` + (ri)>ξ`

772

= supξ∈R2n2

ξ>(D− λI2n2)ξ + (ri)>ξ

773

774

with ri := Qri and denoting λmax(A) the maximum eigenvalue of A we have775

supξ∈R2n2

ξ>(D− λI2n2)ξ + (ri)>ξ

=

∞ if 0 ≤ λ < λmax(A)14(ri)>(λI2n2 −D)−1ri if λ > λmax(A).

(6.5)776

777

To obtain this we exploited that Q(ξ) := ξ>(D− λI2n2)ξ + (ri)>ξ is maximized when778

∇Q(ξ?) = 0 ⇐⇒ 2(D− λI2n2)ξ? + ri = 0 ⇐⇒ ξ? =1

2(λI2n2 −D)−1ri,779

780



which gives the optimal value Q(ξ?) = 14(ri)>(λI2n2 −D)−1ri. Note that we do not need to781

specify the value of the expression in (6.5) for λ = λmax. In particular, since the function we782

minimize in (6.4) is convex in λ, the inner part of the DRO problem is equivalently written783

infλ>λmax(A)

λ

(ψ2N −

1

N

N∑i=1

(ξi`)>ξi`

)+

1

4N

N∑i=1

riη(P , λ)>(λI2n2 −D)−1riη(P , λ)

.784

785

Taking further into account that786

(λI2n2 −D)−1 = diag( 1

λ− λmax(A), . . . ,

1

λ− λmin(A)

),787

788

as well as the constraints (6.3b) on the decision variable P , the overall DRO problem is789

reformulated as790

minη

infP ,λ

fη(P ) + λ

(ψ2N −

1

N

N∑i=1

(ξi`)>ξi`

)+

1

4N

N∑i=1

riη(P , λ)>(6.6a)791

× diag( 1

λ− λmax(A), . . . ,

1

λ− λmin(A)

)riη(P , λ)

792

subject to P jmin ≤ Pj ≤ P jmax ∀j ∈ [1 : n1](6.6b)793

λ > λmax(A).794795

6.5. Simulation results. For the simulations we consider n1 = 4 generators and n2 = 3796

batteries with the same characteristics. We assume that the distributions of each initial797

SoC zι0 and current Iι,20 are known to be supported on the intervals [0.45, 0.9] and [1.5, 1.7],798

respectively. The true SoC distribution for batteries 2 and 3 at time zero is Pz20 = Pz30 =799

U [0.45, 0.65] (U denotes uniform distribution). On the other hand, the provider of battery 1800

has access to the distinct batteries 1A and 1B and selects randomly one among them with801

probabilities 0.9 and 0.1, respectively. The SoC distribution of battery 1A at time zero is802

Pz1A0= U [0.45, 0.65], whereas that of battery 1B is Pz1B0

= U [0.84, 0.86]. Thus, we get the803

bimodal distribution Pz10 = 0.9U [0.45, 0.65] + 0.1U [0.84, 0.86], which is responsible for non-804

negligible empirical distribution variations, since for small numbers of samples, it can fairly805

frequently occur that the relative percentage of samples from 1B deviates significantly from its806

expected one. On the other hand, we assume that the true initial currents Iι,20 of all batteries807

are fixed to 1.6308, namely, PI1,20

= PI2,20

= PI3,20

= δ1.6308. For the measurements, we consider808

the Gaussian mixture noise model Pvk = 0.5N (0.01, 0.012)+0.5N (−0.01, 0.012) with N (µ, σ2)809

denoting the normal distribution with mean µ and variance σ2.810

To compute the ambiguity radius for the reformulated DRO problem (6.6), we specify811

its nominal and noise components εN (βnom, ρξ`) and εN (βns), where due to Proposition 4.1,812

ρξ` can be selected as half the diameter of any set containing the support of Pξ` in the813

infinity norm. It follows directly from the specific dynamics of the batteries that ρξ` does not814

exceed half the diameter of the initial conditions’ distribution support, which is isometric to815

[0.45, 0.9]3× [1.5, 1.7]3 ⊂ R6. Hence, using Proposition 19 in the online version [9] with p = 2,816

d = 6, and ρξ` = 0.225, we obtain817

εN (βnom, ρξ`) = 4.02N−16 + 1.31(lnβ−1

nom)14N−

14 .818



Figure 3. Results from 100 realizations of the power dispatch problem with N = 10 independent samplesused for each realization. We compute the optimizers of the SAA and DRO problems, plot their correspondingoptimal values (termed “SAA cost” and “DRO cost”), and also evaluate their performance with respect to thetrue distribution (“true cost with SAA optimizer” and “true cost with DRO optimizer”). With the exceptionof two realizations (whose DRO cost and true cost with the DRO optimizer are framed inside black boxes), theDRO cost is above the true cost of the DRO optimizer, namely, this happens with high probability. From theplot, it is also clear that the SAA solution tends to over-promise since its value is most frequently below thetrue cost of the SAA optimizer.

819

To determine the noise radius, we first compute lower and upper bounds mv and Mv for the L2820

norm of the Gaussian mixture noise vk and an upper bound Cv for its ψ2 norm. Denoting by EP821

the integral with respect to the distribution P , we have for Pvk = 0.5N (µ1, σ21)+0.5N (µ2, σ

22)822

that ‖vk‖22 = E 12

(P1+P2)

[v2k

]= 1

2(µ21 + σ2

1 + µ22 + σ2

2), where P1 = N (µ1, σ21), P2 = N (µ2, σ

22)823

and we used the fact that EPi[v2k

]= µ2

i +EPi[(vk−µi)2

]= µ2

i +σ2i . Hence, in our case, where824

µi = σi = 0.01, we can pick mv = Mv = 0.01√

2. Further, using Proposition 21 from the825

online version [9], we can select Cv = 0.01(√

8/3 +√

ln 2). To perform the state estimation826

from the output samples we used a Kalman filter. Its initial condition covariance matrix827

corresponds to independent Gaussian distributions for each SoC zι0 and current Iι,20 with a828

standard deviation of the order of their assumed support. We also select the same covariance829

as in the components of the Gaussian mixture noise to model the measurement noise of the830

Kalman filter. Using the dynamics of the filter and the values of mv, Mv, and Cv above, we831

obtain from (4.9b), (4.10a)-(4.10c), and (4.14) the constants Mw = 0.325, Mv = 0.008, and832

R = 2.72 for the expression of the noise radius. In particular, we have from Proposition 4.5833

that εN (βns) = 0.47 + 0.0113√

74.98/N ln(2/βns) and the overall radius is834

ψN (β) = 0.47 + 4.02N−16 + 1.31(lnβ−1

nom)14N−

14 + 0.0973(ln(2β−1

ns ))12N−

12 .(6.7)835836

We assume that the energy cost of the generators is lower than that of the batteries837

and select the quadratic power generation cost g(P ) = 0.25∑4

j=1(P j − 0.1)2 and the same838



(a)

(b)

Figure 4. Analogous results to those of Figure 3, from 100 realizations with (a) N = 40 and (b) N =160 independent samples, and the ambiguity radius tuned so that the same confidence level is preserved. Inboth cases, the DRO cost is above the true cost of the DRO optimizer with high probability (in fact, always).Furthermore, the cost of the DRO optimizer (red star) is strictly better than the true cost of the SAA one (greencircle) for a considerable number of realizations (highlighted in the illustrated boxes).

lower/upper power thresholds P jmin = 0.2/P jmax = 0.5 for all generators. For the batteries, we839

pick the same resistances Rι,1 = 0.34 and Rι,2 = 0.17, and we take aι = 0.945 and Iιk = 8840

for all times. We nevertheless use different linear costs hι(S) = αιS for their injected powers,841

with α1 = 1 and α2 = α3 = 1.3, since battery 1 is less reliable due to the large SoC fluctuation842

among its two modes.843

We solve 100 independent realizations of the overall economic dispatch problem. For each844

of them, we generate independent samples from the batteries’ initial condition distributions845

and solve the associated sample average approximation (SAA) and DRO problems for N = 10,846

N = 40, and N = 160 samples, respectively, using CVX [21]. It is worth noting that the radius847

ψN given by (6.7) is rather conservative. The main reasons for this are 1) conservativeness848

of the concentration of measure results used for the derivation of the nominal radius, 2)849

lack of homogeneity of the distribution’s support (the a priori known support of the Iι,20850

components is much smaller than that of the zι0 ones), 3) independence of the batteries’851

individual distributions, which we have not exploited, and 4) conservative upper bounds for852



the estimation error. Although there is room to sharpen all these aspects, it requires multiple853

additional contributions and lies beyond the scope of the paper. Nevertheless, the formula (6.7)854

gives a qualitative intuition about the decay rates for the ambiguity radius. In particular, it855

indicates that under the same confidence level and for small sample sizes, an ambiguity radius856

proportional to N−14 is a reasonable choice. Based on this, we selected the ambiguity radii857

0.05, 0.0354, and 0.025 for N = 10, N = 40, and N = 160. The associated simulation results858

are shown in Figures 3, 4(a), and 4(b), respectively. We plot there the optimal values of the859

SAA and DRO problems (termed “SAA cost” and “DRO cost”) and provide the expected860

performance of their respective decisions with respect to the true distribution (“true cost with861

SAA optimizer” and “true cost with DRO optimizer”). We observe that in all three cases,862

the DRO value is above the true cost of the DRO optimizer for nearly all realizations (and863

for all when N is 40 or 160), which verifies the out-of-sample guarantees that we seek in DRO864

formulations [17, Theorem 3.5]. In addition, when solving the problem for 40 or 160 samples,865

we witness a clear superiority of the DRO decision compared to the one of the non-robust866

SAA, because it considerably improves the true cost for a significant number of realizations867

(cf. Figure 4).868

6.6. Discussion. The SAA solution tends to consistently promise a better outcome com-869

pared to what the true distribution reveals for the same decision (e.g., magenta circle being870

usually under the green circle in all figures). This rarely happens for the DRO solution, and871

when it does, it is only by a small margin. This makes the DRO approach preferable over the872

SAA one in the context of power systems operations where honoring committments at a much873

higher cost than anticipated might result in significant losses, and not fulfilling committments874

may lead to penalties from the system operator.875

7. Conclusions. We have constructed high-confidence ambiguity sets for dynamic random876

variables using partial-state measurements from independent realizations of their evolution.877

In our model, both the dynamics and measurements are subject to disturbances with unknown878

probability distributions. The ambiguity sets are built using an observer to estimate the full879

state of each realization and leveraging concentration of measure inequalities. For systems880

that are either time-invariant and detectable, or uniformly observable, we have established881

uniform boundedness of the ambiguity radius. To aid the associated probabilistic guarantees,882

we also provided auxiliary concentration of measure results. Future research will include the883

consideration of robust state estimation criteria to mitigate the noise effect on the ambiguity884

radius, the extension of the results to nonlinear dynamics, and the construction of ambiguity885

sets with information about the moments.886

8. Appendix. Here we give proofs of various results of the paper.887

Proof of Lemma 2.1. Since X and Y are independent, their joint distribution P(X,Y ) is888

the product measure PX ⊗ PY of the individual distributions PX and PY . Thus, from the889

Fubini theorem [1, Theorem 2.6.5] and integrability of g, k we get890

E[g(X,Y )] =

∫Rn1×Rn2

g(x, y)dP(X,Y ) =

∫Rn1

∫Rn2

g(x, y)dPY dPX =

∫Rn1

E[g(x, Y )]dPX891

=

∫KE[g(x, Y )]dPX ≥

∫Kk(x)dPX =

∫Rn1

k(x)dPX = E[k(X)],892



893

which concludes the proof.894

Proof of Lemma 4.2. Using [8, Lemma A.2] to bound the Wasserstein distance of two895

discrete distributions, we get Wp(PNξ`, PNξ` ) ≤

(1N

∑Ni=1 ‖ξi` − ξi`‖p

) 1p

=(

1N

∑Ni=1 ‖ei`‖p

) 1p.896

From (4.3), we have897

‖ei`‖ =∥∥∥Ψ`z

i +∑k=1

(Ψ`,`−k+1G`−kω

i`−k + Ψ`,`−k+1K`−kv

i`−k)∥∥∥898

≤ ‖Ψ`‖‖zi‖+∑k=1

‖Ψ`,`−k+1G`−k‖‖ωi`−k‖+∑k=1

‖Ψ`,`−k+1K`−k‖‖vi`−k‖1899

=: M(zi,ωi) + E(vi),900901

with E(vi) ≡ Ei given in the statement. Since (a+ b)p ≤ 2p−1(ap + bp) for a, b ≥ 0 and p ≥ 1,902

Wp(PNξ`, PNξ` ) ≤

( 1

N2p−1

N∑i=1

(M(zi,ωi)p + (Ei)p)) 1p.903

904

Next, using (a+ b)1p ≤ a

1p + b

1p for a, b ≥ 0 and p ≥ 1, we have905

Wp(PNξ`, PNξ` ) ≤

( 1

N2p−1

N∑i=1

M(zi,ωi)p) 1p

+( 1

N2p−1

N∑i=1

(Ei)p) 1p.(8.1)906

907

Finally, since (z,ω) ∈ BNd∞ (ρξ0) × BN`q

∞ (ρw), we get M(zi,ωi)p ≤ ‖Ψ`‖√d‖zi‖∞908

+∑`

k=1 ‖Ψ`,`−k+1G`−k‖√q‖ωi`−k−1‖∞ ≤Mw. This combined with (8.1) yields (4.9a).909

Proof of Lemma 4.3. From H4 in Assumption 3.2, we obtain for each summand in (4.9c)910 ∥∥‖Ψ`,`−k+1K`−k‖‖vi`−k‖1∥∥ψp≤ ‖Ψ`,`−k+1K`−k‖

(∥∥vi`−k,1∥∥ψp + · · ·+∥∥vi`−k,r∥∥ψp)911

≤ Cvr‖Ψ`,`−k+1K`−k‖.912913

Hence, we deduce that914

‖Ei‖ψp ≤∑k=1

∥∥‖Ψ`,`−k+1K`−k‖‖vi`−k‖1∥∥ψp≤ Cvr

∑k=1

‖Ψ`,`−k+1K`−k‖.915

916

For the Lp bounds, note that ‖Ei‖p =∥∥∑

k∈[1:`],l∈[1:r] ‖Ψ`,`−k+1K`−k‖|vi`−k,l|∥∥p. Thus, from917

the inequality ‖∑

i ciXi‖p ≤∑

i ci‖Xi‖p, which holds for any nonnegative ci and Xi in Lp,918

‖Ei‖p ≤∑

k∈[1:`],l∈[1:r]

‖Ψ`,`−k+1K`−k‖‖vi`−k,l‖p,919

920



which, by the upper bound in H4 of Assumption 3.2, implies (4.10a). For the other bound,921

we exploit linearity of the expectation and the inequality(∑

i ci)p ≥ ∑i c

pi , which holds for922

any nonnegative ci, to get923

(E[(Ei)p

]) 1p ≥

( ∑k∈[1:`],l∈[1:r]

‖Ψ`,`−k+1K`−k‖pE[|vi`−k,l|

]p) 1p

.924

925

Thus, from the lower bound in H4 of Assumption 3.2 we also obtain (4.10c).926

We next prove Proposition 4.4, along the lines of the proof of [41, Theorem 3.1.1], which927

considers the special case of sub-Gaussian distributions. We rely on the following concentration928

inequality [41, Corollary 2.8.3].929

Proposition 8.1. (Bernstein inequality). Let X1, . . . , XN be scalar, mean-zero, sub-930

exponential, independent random variables. Then, for every t ≥ 0 we have931

P(∣∣∣ 1

N

N∑i=1

Xi

∣∣∣ ≥ t) ≤ 2 exp(− c′min

t2R2

,t

R

N),932

933

where c′ = 1/10 and R := maxi∈[1:N ] ‖Xi‖ψ1.934

The precise constant c′ above is not specified in [41] but we provide an independent proof935

of this result in the online version [9, Section 8.2].936

Proof of Proposition 4.4. Note that each random variable Xpi −1 is mean zero by assump-937

tion. Additionally, we have that ‖Xpi − 1‖ψ1 ≤ ‖X

pi ‖ψ1 + ‖1‖ψ1 = ‖Xi‖ψp + 1/ ln 2 ≤ R, where938

we took into account that939

E[ψ1(Xpi /t

p)] = E[ψp(Xi/t)]⇒ ‖Xpi ‖ψ1 = ‖Xi‖ψp ,940941

and the following fact, shown after the proof.942

. Fact I. For any constant random variable X = µ ∈ R, it holds ‖X‖ψp = |µ|/(ln 2)1p . /943

Thus, we get from Proposition 8.1 that944

P(∣∣∣∣ 1

N

N∑i=1

Xpi − 1

∣∣∣∣ ≥ t) ≤ 2 exp(− c′N

R2mint2, t

),(8.2)945

946

where we used the fact that R > 1. We will further leverage the following facts shown after947

the proof of the proposition.948

. Fact II. For all p ≥ 1 and z ≥ 0 it holds that |z − 1| ≥ δ ⇒ |zp − 1| ≥ maxδ, δp. /949

. Fact III. For any δ ≥ 0, if u = maxδ, δp, then minu, u2 = αp(δ), with αp as given950

by (4.12). /951

By exploiting Fact II, we get952

P(∣∣∣∣( 1

N

N∑i=1

Xpi

) 1p

− 1

∣∣∣∣ ≥ t) ≤ P(∣∣∣∣ 1

N

N∑i=1

Xpi − 1

∣∣∣∣ ≥ maxt, tp)

953



≤ 2 exp(− c′N

R2minmaxt, tp2,maxt, tp

).954

955

Thus, since P(|Y | ≥ t) ≥ P(Y ≥ t) for any random variable Y , we obtain (4.11) from Fact III956

and conclude the proof.957

Proof of Fact I. From the ψp norm definition, ‖X‖ψp = inft > 0 |E

[e(|X|/t)p] ≤ 2

=958

inft > 0 | t ≥ |µ|/(ln 2)

1p

= |µ|/(ln 2)1p , which establishes the result.959

Proof of Fact II. Assume first that z < 1. Then, we have that |zp− 1| = 1− zp > 1− z ≥960

δ ≥ δp. Next, let z ≥ 1. Then, we get |zp− 1| = zp− 1 ≥ z− 1 ≥ δ. In addition, when δp ≥ δ,961

namely, when δ ≥ 1, we have that zp−(z−1)p ≥ 1, and hence, |zp−1| = zp−1 ≥ (z−1)p ≥ δp.962

Proof of Fact III. We consider two cases. Case (i): 0 ≤ δ ≤ 1 ⇒ δ ≥ δp ⇒ u =963

maxδ, δp = δ. Then minu, u2 = minδ, δ2 = δ2. Case (ii): δ > 1 ⇒ δ ≤ δp ⇒ u =964

maxδ, δp = δp. Then minu, u2 = minδp, δ2p = δp. Thus, we get that minu, u2 = αp(δ)965

for all δ ≥ 0.966

Proof of Lemma 4.11. The proof of the first part is found inside the proof of [34, Lemma 1].967

To prove the second part, let π be an optimal coupling of P1 and P2 for the Wasserstein distance968

Wp(P1, P2) and define969

π(B) :=

∫Rd×Rd

∫Rd

1B(x, y + z)Q(dz)π(dx, dy)970971

for any measurable subset B of Rd × Rd. Then its marginals satisfy972

π(Rd ×By) =

∫Rd×Rd

∫Rd

1By(y + z)Q(dz)π(dx, dy)973

=

∫Rd

∫Rd

1By(y + z)Q(dz)P2(dy) = P2 ? Q(By)974975

and π(Bx × Rd) = P1(Bx), which is derived by analogous arguments. In addition,976 (∫Rd×Rd

‖x− y‖pπ(dx, dy)

) 1p

=

(∫Rd×Rd

∫Rd‖x− (y + z)‖pQ(dz)π(dx, dy)

) 1p

977

=

(∫Rd×Rd

∫Rd‖(x− y) + z‖pQ(dz)π(dx, dy)

) 1p

978

≤(∫

Rd×Rd

∫Rd‖x− y‖pQ(dz)π(dx, dy)

) 1p

+

(∫Rd×Rd

∫Rd‖z‖pQ(dz)π(dx, dy)

) 1p

979

≤Wp(P1, P2) +

(∫Rd×Rd

qpπ(dx, dy)

) 1p

= Wp(P1, P2) + q,980981

where we used the definition of π in the first equality, the triangle inequality for the space982

Lp(Rd×R2d;Q⊗π) in the first inequality, and that( ∫

Rd ‖x‖pQ(dx)

) 1p ≤ q in the last inequality.983

This concludes also the second part of the proof.984



REFERENCES985

[1] R. B. Ash, Real Analysis and Probability, Academic Press, 1972.986[2] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski, Robust optimization, Princeton University Press,987

2009.988[3] D. Bertsimas, V. Gupta, and N. Kallus, Robust sample average approximation, Mathematical Pro-989

gramming, 171 (2018), pp. 217–282.990[4] J. Blanchet, Y. Kang, and K. Murthy, Robust Wasserstein profile inference and applications to991

machine learning, Journal of Applied Probability, 56 (2019), pp. 830–857.992[5] J. Blanchet and K. Murthy, Quantifying distributional model risk via optimal transport, Mathematics993

of Operations Research, 44 (2019), pp. 565–600.994[6] V. Bogachev, Measure theory, vol. 1, Springer, 2007.995[7] D. Boskos, J. Cortes, and S. Martinez, Data-driven ambiguity sets for linear systems under distur-996

bances and noisy observations, in American Control Conference, Denver, CO, July 2020, pp. 4491–9974496.998

[8] D. Boskos, J. Cortes, and S. Martinez, Data-driven ambiguity sets with probabilistic guarantees for999dynamic processes, IEEE Transactions on Automatic Control, 66 (2021), pp. 2991–3006.1000

[9] D. Boskos, J. Cortes, and S. Martınez, High-confidence data-driven ambiguity sets for time-varying1001linear systems, https://arxiv.org/abs/2102.01142, (2021).1002

[10] F. Boso, D. Boskos, J. Cortes, S. Martınez, and D. M. Tartakovsky, Dynamics of data-driven1003ambiguity sets for hyperbolic conservation laws with uncertain inputs, (2021). To appear.1004

[11] Z. Chen, D. Kuhn, and W. Wiesemann, Data-driven chance constrained programs over Wasserstein1005balls, arXiv preprint arXiv:1809.00210, (2018).1006

[12] A. Cherukuri and J. Cortes, Distributed coordination of DERs with storage for dynamic economic1007dispatch, IEEE Transactions on Automatic Control, 63 (2018), pp. 835–842.1008

[13] A. Cherukuri and J. Cortes, Cooperative data-driven distributionally robust optimization, IEEE Trans-1009actions on Automatic Control, 65 (2020), pp. 4400–4407.1010

[14] J. Coulson, J. Lygeros, and F. Dorfler, Data-enabled predictive control: In the shallows of the1011DeePC, in European Control Conference, 2019, pp. 307–312.1012

[15] J. Dedecker and F. Merlevede, Behavior of the empirical Wasserstein distance in Rd under moment1013conditions, Electronic Journal of Probability, 24 (2019).1014

[16] S. Dereich, M. Scheutzow, and R. Schottstedt, Constructive quantization: Approximation by1015empirical measures, Annales de l’Institut Henri Poincar, Probabilits et Statistiques, 49 (2013),1016p. 11831203.1017

[17] P. M. Esfahani and D. Kuhn, Data-driven distributionally robust optimization using the Wasser-1018stein metric: performance guarantees and tractable reformulations, Mathematical Programming, 1711019(2018), pp. 115–166.1020

[18] N. Fournier and A. Guillin, On the rate of convergence in Wasserstein distance of the empirical1021measure, Probability Theory and Related Fields, 162 (2015), p. 707738.1022

[19] R. Gao, X. Chen, and A. J. Kleywegt, Wasserstein distributional robustness and regularization in1023statistical learning, arXiv preprint arXiv:1712.06050, (2017).1024

[20] R. Gao and A. Kleywegt, Distributionally robust stochastic optimization with Wasserstein distance,1025arXiv preprint arXiv:1604.02199, (2016).1026

[21] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1. http:1027//cvxr.com/cvx, Mar. 2014.1028

[22] Y. Guo, K. Baker, E. DallAnese, Z. Hu, and T. H. Summers, Data-based distributionally robust1029stochastic optimal power flow–Part I: Methodologies, IEEE Transactions on Power Systems, 34 (2018),1030pp. 1483–1492.1031

[23] A. Hota, A. Cherukuri, and J. Lygeros, Data-driven chance constrained optimization under Wasser-1032stein ambiguity sets, in American Control Conference, Philadelphia, PA, USA, 2019, pp. 1501–1506.1033

[24] B. Kloeckner, Empirical measures: regularity is a counter-curse to dimensionality, arXiv preprint1034arXiv:1802.04038, (2019), http://dx.doi.org/https://doi.org/10.1051/ps/2019025.1035

[25] B. Li, J. Mathieu, and R. Jiang, Distributionally robust chance constrained optimal power flow assum-1036ing log-concave distributions, in Power Systems Computation Conference, 2018, pp. 1–7.1037


https://arxiv.org/abs/2102.01142

http://cvxr.com/cvx

http://cvxr.com/cvx

http://cvxr.com/cvx

http://dx.doi.org/https://doi.org/10.1051/ps/2019025


[26] D. Li, D. Fooladivanda, and S. Martınez, Data-driven variable speed limit design for highways1038via distributionally robust optimization, in European Control Conference, Napoli, Italy, June 2019,1039pp. 1055–1061.1040

[27] D. Li, D. Fooladivanda, and S. Martınez, Online learning of parameterized uncertain dynamical1041environments with finite-sample guarantees, IEEE Control Systems Letters, (2020).1042

[28] D. Li and S. Martınez, Online data assimilation in distributionally robust optimization, in IEEE Int.1043Conf. on Decision and Control, Miami, FL, USA, December 2018, pp. 1961–1966.1044

[29] M. Li, Li-ion dynamics and state of charge estimation, Renewable Energy, 100 (2017), pp. 44–52.1045[30] B. Y. Liaw, G. Nagasubramanian, R. G. Jungst, and D. H. Doughty, Modeling of lithium ion1046

cells–A simple equivalent-circuit model approach, Solid State Ionics, 175 (2004), pp. 835–839.1047[31] J. Liu, Y. Chen, C. Duan, J. Lin, and J. Lyu, Distributionally robust optimal reactive power dispatch1048

with Wasserstein distance in active distribution network, Journal of Modern Power Systems and Clean1049Energy, 8 (2020), pp. 426–436.1050

[32] S. Liu, Matrix results on the Khatri-Rao and Tracy-Singh products, Linear Algebra and its Applications,1051289 (1999), p. 267277.1052

[33] J. B. Moor and B. D. O. Anderson, Coping with singular transition matrices in estimation and control1053stability theory, International Journal of Control, 31 (1980), pp. 571–586.1054

[34] G. C. Pflug and A. Pichler, From empirical observations to tree models for stochastic optimization:1055convergence properties, SIAM Journal on Optimization, 26 (2016), pp. 1715–1740.1056

[35] B. K. Poolla, A. R. Hota, S. Bolognani, D. S. Callaway, and A. Cherukuri, Wasserstein1057distributionally robust look-ahead economic dispatch, arXiv preprint arXiv:2003.04874, (2020).1058

[36] P. E. S. Shafieezadeh-Abadeh, D. Kuhn, Regularization via mass transportation, Journal of Machine1059Learning Research, 20 (2019), pp. 1–68.1060

[37] S. Shafieezadeh-Abadeh, V. A. Nguyen, D. Kuhn, and P. M. Esfahani, Wasserstein distribution-1061ally robust Kalman filtering, in Advances in Neural Information Processing Systems, 2018, pp. 8474–10628483.1063

[38] A. Shapiro, Distributionally robust stochastic programming, SIAM Journal on Optimization, 27 (2017),1064p. 22582275.1065

[39] A. Shapiro, D. Dentcheva, and A. Ruszczyski, Lectures on Stochastic Programming: Modeling and1066Theory, vol. 16, SIAM, Philadelphia, PA, 2014.1067

[40] E. D. Sontag, Mathematical control theory: deterministic finite dimensional systems, Springer, 1998.1068[41] R. Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47,1069

Cambridge University Press, 2018.1070[42] C. Villani, Topics in optimal transportation, no. 58 in Graduate Studies in Mathematics, American1071

Mathematical Society, 2003.1072[43] J. Weed and F. Bach, Sharp asymptotic and finite-sample rates of convergence of empirical measures1073

in Wasserstein distance, Bernoulli, 25 (2019), pp. 2620–2648.1074[44] J. Weed and Q. Berthe, Estimation of smooth densities in Wasserstein distance, arXiv preprint1075

arXiv:1902.01778, (2019).1076[45] F. Xin, B.-M. Hodge, L. Fangxing, D. Ershun, and K. Chongqing, Adjustable and distributionally1077

robust chance-constrained economic dispatch considering wind power uncertainty, Journal of Modern1078Power Systems and Clean Energy, 7 (2019), pp. 658–664.1079

[46] I. Yang, A convex optimization approach to distributionally robust Markov decision processes with1080Wasserstein distance, IEEE Control Systems Letters, 1 (2017), pp. 164–169.1081

[47] I. Yang, Wasserstein distributionally robust stochastic control: A data-driven approach, arXiv preprint1082arXiv:1812.09808, (2018).1083

[48] S. Zeng, Sample-based population observers, Automatica, 101 (2019), pp. 166–174.1084


Documents

High-Confidence Data-Driven Ambiguity Sets for Time