Universal approximators for Direct Policy Search in multi-purpose water reservoir management

Matteo Giuliani, Emanuele Mason, Andrea Castelletti, Francesca Pianosi, Rodolfo Soncini-Sessa Dipartimento di Elettronica, Informazione, e Bioingegneria, Politecnico di Milano, Milano, Italy Hydroinformatics Lab, Como Campus, Politecnico di Milano, Italy Department of Civil and Environmental Engineering, University of Bristol, Bristol, UK

Universal approximators for Direct Policy Search in multi-purpose water reservoir management: A comparative analysis

IFAC 2014 CAPE TOWN -‐ZA

Modelling and Control of Water Systems

Controlling hydro-environmental systems

The long-term optimal operation of hydro-environmental systems can be formulated as a q-objective stochastic optimal control problem

xt+1 = ft(xt,ut, �t+1)

J i = limh⇤⌅

E⇥h1⇥�

�h�1⇤

t=0

�tgit(xt,ut, ⇥t+1)

⇥i = 1, . . . , q

i-th immediate cost discount factor i-th objective

state control disturbance subject to

!

"t+1 ⇠ �(·)

minµt(·)

J = |J1 J2...Jq|

ut = µt(xt)

xt 2 Rnx

ut 2 Rnu

"t 2 Rn"

SDP and the 3 curses

Stochastic Dynamic Programming is - in principle - the best approach to solve the problem - in practice - it suffers from 3 major shortcomings

1)  Curse of dimensionality: computational cost grows exponentially with state, control and disturbance dimension [Bellman, 1967];

u�t

Qt

ut

xt

Look-up table Q-function

unknown Q-function

computations are numerically performed on a discretized variable domain



1)  Curse of dimensionality: computational cost grows exponentially with state, control and disturbance dimension [Bellman, 1967];

u�t

Qt

ut

xt

Look-up table Q-function

unknown Q-function

computations are numerically performed on a discretized variable domain

2)  Curse of modelling: any variable considered among the operating rule’s arguments has to be modelled [Bertsekas and Tsitsiklis, 1996];

time t t+1

xt

ut, "t+1

models are use in a multiple one-step-ahead-simulation mode



3)  Curse of multiple objectives: computational cost grows exponentially with the number of objectives considered [Powell, 2011].

PARETO frontier

multi-objective problems are solved by reiteratively solving single objective problems

J1

J2

J3

Beyond SDP: ADP and RL

Approximate Dynamic Programming and Reinforcement Learning provide a framework to overcome some or all the SDP’s curses.

[Powell, 2007; Busoniu et al. 2011

VALUE FUNCTION-BASED APPROCHES:

•  Approximate value iteration

•  Approximate policy iteration

•  Approximate policy evaluation

Model-free or model-based // parametric or non-parametric

POLICY SEARCH-BASED APPROACHES:

•  Direct policy search

Simulation-based optimization // parametric

Beyond SDP: ADP and RL

Approximate Dynamic Programming and Reinforcement Learning provide a framework to overcome some or all the SDP’s curses.

[Powell, 2007; Busoniu et al. 2011

VALUE FUNCTION-BASED APPROCHES:

•  Approximate value iteration

•  Approximate policy iteration

•  Approximate policy evaluation

Model-free or model-based // parametric or non-parametric

POLICY SEARCH-BASED APPROACHES:

•  Direct policy search

Simulation-based optimization // parametricv

Multi-objective Direct Policy Search (MODPS)

Assuming the operating rule belong to a given family of functions and search the optimal solution in the policy’s parameter space


"t+1 ⇠ �(·)

minµt(·)

J = |J1 J2...Jq|

ut = µt(xt)

xt 2 Rnx

ut 2 Rnu

"t 2 Rn"

subject to

!

ut = µt(xt, )✓t


"t+1 ⇠ �(·)

minµt(·)

J = |J1 J2...Jq|

xt 2 Rnx

ut 2 Rnu

"t 2 Rn"

subject to

!

ORIGINAL PROBLEM POLICY SEARCH PROBLEM

ut = µt(xt, )✓t

✓t

✓t 2 ⇥t2 Rn✓

WHEN 1.  The system is already operated

2.  the system is simple (i.e. one reservoir) AND/OR the systems has one single objective (e.g. water supply)

Selecting the policy approximation: Ad hoc/Empirism

500!

450!

400!

350!

300!

250!

200!

150!

100!

50!

0!

0 25 50 75 100 125 150 175 200 225 250!

rele

ase

[m3 /s

]"

storage [Mm3]"

�1

�3

�5

�2 �4

•  NEW York City rule [Clark, 1950]

•  Space rule [Clark, 1956]

•  Standard Operating Policy [Draper, 2004]

•  …..

Identify existing regularities in a sample of the operator behaviour

Empirical rules identified in the past

Selecting the policy approximation: Universal Approx.

Provided that some conditions are met, an Universal Approximator is approximate arbitrarily closely every continuous function.

ARTIFICIAL NEURAL NETWORKS [Cybenko 1989, Funahashi 1989, Hornik et al. 1989]

GAUSSIAN RADIAL BASIS FUNCTIONS [Busoniu et al. 2011]

n✓

= N(2nx

+ nu

)

n✓

= nu

(N(nx

+ 2) + 1)

Parameter dimension

Parameter dimension

Number of NEURONS

Number of BASES

x1

x2

x3

u1

x1

x2

x3

u1

Selecting the optimization algorithm

Key problem features •  High dimensional search spaces (rich parameterizations) •  Complex search spaces (many local minima) •  Sensitivity to parameter initialization (no-preconditioning) •  Multiple objectives •  Non differentiable objective functions •  Sensitivity to noise

Selecting the optimization algorithm

BORG is self-adaptive and employs •  multiple search operators adaptively selected during the optimization •  e-dominance archiving with internal operators to detect search stagnation •  randomized restarts to escape local optima

BORG [Hadka and Reed 2012; Reed et al. 2013] a MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM

Key problem features •  High dimensional search spaces (rich parameterizations) •  Complex search spaces (many local minima) •  Sensitivity to parameter initialization (no-preconditioning) •  Multiple objectives •  Non differentiable objective functions •  Sensitivity to noise

CASE STUDY

Hanoi

HoaBinh

TaBu

LaiChau

TamDuong

NamGiang

MuongTe

VuQuangYenBai

BaoLacHaGiang

BacMe

VIETNAM

CHINA

LAOS

CAMBODIA

THAILAND

Da

Thao Lo

Red-Thai Binh River System - Vietnam

Integrated Management of Red-Thai Binh Rivers System (IMRR) funded by the Italian Ministry of Foreign Affairs http://www.imrr.info/

Hoa Binh reservoir - Vietnam

Main characteristics

•  Catchment area 52,000 km2

•  Active capacity 6 x 109 m3

•  8 penstocks 2,360 m3/s (240 MW)

•  12 bottom gates 22,000 m3/s

•  6 spillways 14,000 m3/s

•  15% national energy (7,800 GWh)

source: IWRP2008

Operating objectives •  Hydropower production

•  Flood control (Hanoi)

RESERVOIR

CATCHMENT

POWER PLANT

DIVERSION DAM

COMSUMPTIVE USE THAO

LO

DA

HOABINH

Experimental Setting: ANN vs RBF

STATE VECTOR (n_x=5)

•  2 time indexes (sin, cosin) •  Storage •  Previous day inflow to reservoir •  Previous day lateral inflow

CONTROL VECTOR (n_u=1) •  release from the reservoir

RESERVOIR

CATCHMENT

POWER PLANT

DIVERSION DAM

COMSUMPTIVE USE THAO

LO

DA

HOABINH

ALGORITHM SETTING and RUNNING

•  Default Borg MOEA parameterization [Hadka and Reed 2013]

•  NFE = 500,000 per replication

•  20 replications to avoid dependence on randomness

•  Historical horizon 1962-1969, which comprises normal, wet and dry years

Policy perfomance – operating objectives

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8(b) Generational distance (c) Additive ε-indicator (d) Hypervolume

4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16# of neurons/basis # of neurons/basis # of neurons/basis

0 100 200 300 400 500 600 7001.8

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6 x 107

Jflo

Jhyd

46810

121416

Legend: ANN - RBFnumber of neurons/basis

(a) Policy Performance with different ANN and RBF architectures

FIG. 2. Policy performance obtained with di↵erent ANN and RBF structures (a), andevaluation of the associated Pareto front in terms of generational distance (b), additive"-indicator (c), and hypervolume (d). Solid bars represent the best performance acrossthe multiple runs, while transparent ones the average performance for each policyarchitecture.

30

ANN RBF

Hydrop

ower -‐ kWh/d

Floods – cm2/d

Policy perfomance – front approximation quality

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8(b) Generational distance (c) Additive ε-indicator (d) Hypervolume

4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16# of neurons/basis # of neurons/basis # of neurons/basis

0 100 200 300 400 500 600 7001.8

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6 x 107

Jflo

Jhyd

46810

121416

Legend: ANN - RBFnumber of neurons/basis

(a) Policy Performance with different ANN and RBF architectures

FIG. 2. Policy performance obtained with di↵erent ANN and RBF structures (a), andevaluation of the associated Pareto front in terms of generational distance (b), additive"-indicator (c), and hypervolume (d). Solid bars represent the best performance acrossthe multiple runs, while transparent ones the average performance for each policyarchitecture.

30

ANN RBF

Hydrop

ower -‐ kWh/d

Floods – cm2/d

CONVERGENCE CONSISTENCY DIVERSITY

Policy reliability

4 6 8 10 12 14 16 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1(a) 75% best metric value (b) 95% best metric value

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 4 6 8 10 12 14 16

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 4 6 8 10 12 14 16

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 4 6 8 10 12 14 16

gene

ratio

nal d

istan

cead

ditive

ε-in

dicat

orhy

perv

olum

e

# of neurons/basis # of neurons/basis4 6 8 10 12 14 16 4 6 8 10 12 14 16

0

0.2

0.4

0.6

0.8

1

4 6 8 10 12 14 16 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1

FIG. 3. Probability of attainment with a threshold equal to 75% (a) and to 95% (b) ofthe best metric values for di↵erent ANN (blue bars) and RBF (red bars) architecturesin terms of number of neurons/basis.

31

ANN RBF

CONVERGENCE

CONSISTENCY

DIVERSITY

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9(a) Generational distance (b) Additive ε−indicator (c) Hypervolume

FIG. 4. Analysis of runtime search dynamics for ANN (red lines) and RBF (blue lines)operating policy optimization in terms of generational distance (a), additive "-indicator(b), and hypervolume (c).

32

ANN ( 6 neurons ) RBF (6 bases)

Run time search dynamics (NFA = 2M)

CONVERGENCE CONSISTENCY DIVERSITY

NFA (x106) NFA (x106) NFA (x106)

Policy validation

0 100 200 300 400 500 600 700 800 9001.8

2

2.2

2.4

2.6 x 107

Jflo

Jhyd

0 100 200 300 400 500 600 700 800 9001.8

2

2.2

2.4

2.6 x 107

Jflo

Jhyd

ANNRBF

ANNRBF

(a) Results over the optimization horizon (1962-1969)

(b) Results over the validation horizon (1995-2004)

FIG. 5. Comparison of ANN and RBF policy performance over the optimization (a)and the validation (b) horizons.

33

ANN RBF

Hydrop

ower -‐ kWh/d

Floods – cm2/d

Hydrop

ower -‐ kWh/d

Floods – cm2/d

Conclusions

§  MODPS is an interesting alternative to SDP familiy methods for a number of

good reasons

1.  No discretization required: NO curse of dimensionality;

2.  Does not require separability in time of constraints and objective functions (e.g. duration curves): NO curse of dimensionality;

3.  Can easily include any model-free information as long as this is control-indipendent: NO curse of modelling;

4.  Can be combined with any simulation model (also high fidelity ones): NO curse of modelling;

5.  Can be easily combined with truly multi-objective optimization algorithms: NO curse of the multiple objectives.

Conclusions

§  RBFs and ANNs seem to perform comparatively well when evaluated

in terms of policy performance

§  RBFs outperform ANNs in terms of quality of the Pareto front

approximation, reliability and run time search dynamics

§  Future works will focus on exploring multiple output policies (e.g.

network of reservoirs)

THANKS