Event-Based Stochastic Learning and...

Preview:

Citation preview

Event-Based Stochastic

Learning and Optimization

Xi-Ren Cao

Hong Kong Uni. of Science & Tech.

Example 1: Admission Control in Communication

l b(n)

1-b(n)

q0i

qij

n: number of all customers in networkni: number of customers at server i n=(n1,…,nM): state, N: Capacity

How do we choose the admission probability b(n)?

The Problem

n Non-standard n Not a Markov Decision Process (MDP)

Action depends on the population n,Different state n need to take the same action

n Not a Partially Observable MDP (POMDP)When a customer arrives, know something about the next state (population will not decrease)

n Additional New featuresn Control only applied when a customer arrivesn Relatively small policy space: n -> a vs n -> a

n New formulation and approachn Event based and sensitivity view

Not traditional: MDP with constraintsn Generally applicable to many other problems

Perturbation Analysis

i. Queuing sys.ii. Markov sys.

Markov Decision Process:

Policy Iteration

Event-Based Optimization:i. PAii. Policy Iterat

A sensitivityview of

Learning &Optimization

APPLIC

ATIO

NS

Table of Contents

n Formulation of the Event-based Optimizationn Formulating events

n Event-based optimization

n Introduction to a Sensitivity-Based View of Learning and Optimizationn The Problem

n Two fundamental sensitivity-based approaches

n Limitations of state-based approaches

n Solutions to the Event-Based Optimization

n Examples

n Summary and Discussion

Event-Based Approaches

n X={Xn, n=1,2,…..}, ergodic, Xn in S={1,2,…,M}.

n Transition prob. Matrix P=[p(i,j)]ij=1,..,M

n Steady-state probability: p=(p(1), p(2),..,p(M)).

n Performance function: f=(f(1), …, f(M))T

n Steady-state performance: h = pf = Si p(i)f(i)

Pe=e, p(I-P)=0, pe=1. e=(1,1,…,1)T, I: identity

The Markov ModelThe Markov Model

1

32

p(1,3)

p(3,1)

p(2,1)

p(3,2)

p(1,2)p(2,3)

Formulating Events

lb(n)

1-b(n)

An event may contain information about current state, and

next state after transition

A customer arrival: population not reduced after transition

A transition: <i,j>; An event: a set of transitions

n=(n1,…,nM): n+i=(n1,…,ni+1,…,nM) ; n-i=(n1,…,ni-1,…,nM)

An arrival customer accepted: a+ = {< n, n+i >, all n & i}

A customer departure: d- = {< n, n-i>, all n & i}

An arrival customer rejected: a- = {< n, n>, all n}

A customer arrival: a = a+U a-

Arriving customer finds population n:

Three Types of Events

Observable Event:

1. We may observe the event2. Contains some information

0 0 2

0 1 1 1 0 1

0 2 0 1 1 0 2 0 0

0 0 1

0 1 0 1 0 00 0 0

n=2

n=0

a1

d

bb

n=1

3 svr, 2 cust.

a1

a1

Controllable Event: Arriving customer accepted: an,+ = {< n, n+i >, all i & n in Sn}

Sn = {n, with S ni = n }

Arriving customer rejected: an,- = {< n, n>, n in Sn}

We can control the probabilities of these events: b(n) = p(an,+|an)

0 0 2

0 1 1 1 0 1

0 2 0 1 1 0 2 0 0

0 0 1

0 1 0 1 0 00 0 0

n=2

n=0

a1

d

bb

n=1a1-

a1 a1

a1+

a=a1+ U a1-

Natural Transition Event: A customer entering server i: an,+i = {< n, n+i >, n in Sn}

Nature determines randomly

0 0 2

0 1 1 1 0 1

0 2 0 1 1 0 2 0 0

0 0 1

0 1 0 1 0 00 0 0

n=2

n=0

a1

d

bb

n=1a1-

a1 a1

a1+

a1,+3a1,+2

a1,+1

Natural transitionEvent

Controllable Event

Action

Observable Event

Is this an arrival?No

Do nothing

yes

No. of customer n?

Determine b(n)

Accept Reject

b(n) 1-b(n)

Customerjoins sv j

Customerleaves

Logical Relations in Events

Three events, observable, controllable, and naturaltransition, happens simultaneously in time, but have a logical order

è State dependent Markov model does not fit well

Event-Based Optimization

DeterminesState

transition

Underlying Markov process {Xn , n=0,1,…}

At time n, we observe an observable event an or b, d (not Xn )

Determine an action L(an) - controls the prob. of

Controllable event occurs - an+ or an-

Natural transition event follows

Goal: find a policy L to optimize the performance.

Partially Observable MDP (POMDP)

Underlying Markov process {Xn , n=0,1,…}

At time n, we observe a random variable Yn (not Xn )

Determine an action L(Y0..Yn) - controls state transition prob.

State transition follows

Goal: find a policy L to optimize the performance.

HOWTo Find a Solution?

A Sensitivity-Based View of

Learning and Optimization

The System

(state x)

Inputs al , l=0,1,…(actions)

Output yl , l=0,1,…

(Observations)

System Performance

Dynamic system

h

Policy: action= f(information) , a =L(y)

L may depend on parameters q

Optimizatiom: Determine a policy L that maximizes hL

Learning and Optimization

Learning:

1. Observing /analyzing sample paths (with or without estimating system parameters).

2. Study the behavior of a system under one policy

to learn how to improve system performance

Difficulties:

1. Policy space too large (# states: N, # actions: M, # policies: MN)

2. System too complicated, structure/parameters unknown

A Philosophical Boundary

§ We can study/observe

only one policy at a time

We can do very little!

§ By studying or observingthe behavior of one policy,

in general we cannot

obtain the performance of

other policies

§ Continuous Parameters § Discrete Policy Space

Two Basic Approaches

q+Dq

q

(policy gradient) (policy iteration)

Qgd

dp

q

h= Qg'' phh =-

We can only start with these two sensitivity formulas!

Performance Difference Formula

For two Markov chains P, f, h, p and P’, f, h’, p’, let Q=P’-P

Poisson equation and performance potentials g :

fgePI =+- )( p

hpp == fg:p´ f'' ph =

:'p´ QggPP ')'('' pphh =-=-

Potentials g can be estimated from a sample path of (P,f)

å¥

=

=-=0

0 }|])([{)(k

k iXXfEig h

Policy Iteration

gPPQg )'(''' -==- pphh

1. h’>h if P’g>Pg

2. Policy iteration:

At any state find a policy P’ with P’g>Pg

3. Multi-chain (average performance): A simple, intuitive approach without discounting

Limitations of State-Based Approaches

n Curse of dimensionality: state space too large

n Independent actions: action at different statesneed to be independent

n Structural property lost

Event-Based Optimization

Event-based policy: an à b(n)

Problem: Find a policy that maximizes steady-state perf.

Method: Start with perf. difference formula

How to derive PDF?

Construction method with performance potentialsas building blocks.

Two policies: b(n) and b’(n)

Potential aggregation:

p(n): prob. of event an , arrival finding population n

åå ååå

-=== =

+

nn

M

i nn

ii

ii

ngnpngnpqn

nd )}()()]()([{)(

1)(

10

p

å-

=

-=-1

0

)}()]()(')[('{'N

n

ndnbnbnphh1. PDF:

å-

=

=1

0

)}()(

)({N

n

ndd

ndbn

d

d

qp

q

h q2. Performance deriv:

1è policy iteration: Reduced computation: N linear in sizeaction depends on states

2è gradient-based optimization (Perturbation analysis)

Performance Difference Formula

3. Learning and On-line implementation:

d(n) estimated with same amount of computation as g(n)!

n Action taken when an event happens

n Same event may happen at different states, i.e., actions may be the same for different states

Why ?

n Event-based policy

n Perf. diff. formula for event based policiesn Utilize system structure (aggregation)

How ?

Benefit ?n Event-based policy iteration possible

n Reduce computation (e.g, linear in system size)

n Intuitive, clear physical meaning

Summary:

Other Examples

Example 1: Admission Control in Communication

l b(n)

1-b(n)

q0i

qij

n: number of all customers in networkni: number of customers at server i n=(n1,…,nM): state, N: Capacity

Events: A customer arrives finding a population n

How do we choose the admission probability b(n)?

qi i=0

Example 2: Manufacturing

M products: i=1,2,…, M Operation of product i consists of Ni phases, j=1,…,Ni

State: (i,j)

Two level hierarchical control When a product i is finishes, how do we choose

The next product i’, (probability c(i,i’) )?

Events: product i finished

c(i, i’)

Example 3: Control with Lebesgue Sampling

Riemann Sampling Lebesgue Sampling

t t

x xf(x) f(x)

x1

x2

x3

x4

LSC More efficient than RSC (Astrom, et al, for X={x1})

X= {-inf, +inf} X= {x1, x2, x3 , x4}

How to develop an optimal control policy for LSC with X= {x1, … } ?

Events: System reaches a state in X= {x1, … }

Example 4: Large Communication Networks

Subnet 1

Subnet 2

Subnet 3

Subnet 4

Conclusion and Discussion

n Learning and optimization are bases on

two performance sensitivity formulas

n They lead to two fundamental approaches: gradient based (PA) and policy iteration based

n State-based model has limitations:State space too large, action independent, no structure

n Event-based optimizationn Event: a set of transitions, extension of set of states

n Three types of events, logically related, structure properties

n Solution based on PDF

n Computation reduced, maybe linear in system size

n On-lime implementation

n Widely applicable

(Exs: some not fit Markov model, others utilize special features)

Thank You!

Recommended