Testing Stochastic Processes Through Reinforcement Learning

1

Testing Stochastic Processes Through

Reinforcement Learning

François Laviolette

Sami Zhioua

Nips-Workshop

December 9th, 2006

Josée Desharnais

2

Outline

Program Verification Problem

The Approach for trace-equivalence

Other equivalences

Conclusion

Application on MDPs

3

Stochastic Program Verification

Specification (LMP):an MDP without rewards

Implementation

s0

s1

s3

s6

s2

s4 s5

a[0.5]a[0.3]

b[0.9]cb[0.9]

c

How far the Implementation is from the Specification ?

(Distance or divergence)

The Specification model is available.

The Implementation is available only for interaction (no model).

4

1. Non deterministic trace equivalence

P

a

a c

b

cb

ac

c b b

Q

a

b a

c

cb

aa

b

c

a

b

Trace Equivalence

Two systems are trace equivalent iff they accept the same set of traces

T(P) = {a, aa, aac, ac, b, ba, bab,

c, cb,cc}T(Q) = {a, ab, ac, abc, abca,

ba, bab, c, ca}

2. Probabilistic trace equivalence

Two systems are trace equivalent iff they accept the same set of traces and with the same probabilities

P

a[2/3]

a[1/3] b[2/3]

a[1/4]

cb

a[1/4]a[3/4]

c

a

b[1/2] c[1/2]

a 7/12

aa 5/12

aac 1/6

bc 2/3

…

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

a 1

aa 1/2

aac 0

bc 0

…

5

Testing (Trace Equivalence)

The system is a black box.

The button goes down (transition)

The button does not go down (no transition)

When a button is pushed

(action execution)

Grammar (trace equiv):

t ::= | a.t

Observations :

When a test is executed, several observations are possible : O t.

b[0.7]

s0

s3

a[0.2]a[0.5]

[2,4) [7,10]

Example:

Ot = {a, a.b, a.b}

0.3 0.56

t = a.b.

0.14

a b z

6

Outline



Other equivalences

Conclusion

Application on MDPs

7

Why Reinforcement Learning ?

s0

s1

s4

s2

s5 s6

a[0.2]a[0.5]

b[0.7]a[0.3]a

s7

b

s3

b[0.9]

a[0.7]

s8

s0

s1 s2 s3

s4s6

s7 s8

s5

a b

a a b

ab

LMP

MDP

Reinforcement Learning is particularly efficient in the absence of the full model.

0.5 0.2 0.9

10.3

0.7

1 0.7

Reinforcement Learning can deal with bigger systems.

Analogy :

LMP MDP

Trace Policy

Divergence Optimal Value ( V* )

8

A Stochastic Game towards RL

F S S F S F S F F S F S

F F S S S F

S S S F F F

+ 10

- 1

b[0.7]

s0

s1

s3

s6

s2

s4 s5

a[0.2]a[0.5]

b[0.3]a

c[0.4]

s7

c[0.2]

s10

b

s8

b

Implementation Specifications0

s1

s3

s2

s4 s5

a[0.2]a[0.3]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

b[0.9]

Specification (clone)s0

s1

s3

s2

s4 s5

a[0.2]a[0.3]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

b[0.9]

Reward : (+1) when Impl Spec

Reward : (-1) when Spec Clone

9

MDP Defintion

MDP : Specification LMP StatesActionsNext-state probability distribution

MDPs0

s1

s3

s6

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.4]

s7

c[0.2]

s10

b

s8

b

s0

s1

s3

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

Implémentation Spécification

b[0.9]

s0

s1 s2 s3

s3s4

s8 s9

s5

s7

s10

0.5 0.2 0.9

1 0.3 0.7

1 0.80.7

1

a b

a b

cbc

b

Dead

10

Divergence Computation

F S S

F S F

S F F

S F S

F F S S S F

S S S F F F

+ 1

0

- 1

V*(s0)

0 : Equivalent

1 : Different

*

s0

s1

s3

s6

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.4]

s7

c[0.2]

s10

b

s8

b

s0

s1

s3

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

Implementation Specification

b[0.9]

MDPs0

s1 s2 s3

s3s4

s8 s9

s5

s7

s10

0.5 0.2 0.9

1 0.3 0.7

1 0.80.7

1

a b

a b

cbc

b

Dead

11

Symmetry Problem

Implementation Specification

F S S S F F

F F S S S F

+ 1 - 1

Create two variants for each action (a):

Success variant ( a )

Failure variant ( a )

s0

s1

a[1]

s0

s1

a[0.5]

Spec (Clone)

s0

s1

a[0.5]

Compute and give reward

Give reward 0

Select action make a prediction (, ×)

If pred = obs

If pred obs

Prediction:

execute action

Prob=0*.5*.5+1*.5*.5 = .25Prob=0*.5*.5+1*.5*.5 = .25

12

The Divergence (with the symmetry problem fixed)

Theorem. Let "Spec" and "Impl" be two LMPs, and M their induced MDP.

V*(s0) ≥ 0, and

V*(s0) = 0 iff "Spec" and "Impl" are trace-equivalent.

13

Implementation and PAC Guaranty

There exists a PAC Guaranty for Q-Learning Algorithm but ..

Fiechter algorithm has a simpler PAC guaranty.

Besides, it is possible to obtain a bottom bound thanks to the Hoeffding inequality :

If then :

Implementation :

= 0.8

Action selection : softmax ( decreasing from 0.8 to 0.01)

RL algorithm : Q-Learning

decreasing according to the function 1/x

PAC guaranty :

14

Outline



Other equivalences

Conclusion

Application on MDPs

15

Testing (Bisimulation)

The system is a black box.

Grammar

t ::= | a.t

a b z

b[0.7]

s0

s3

a[0.2]a[0.5]

[2,4) [7,10]

Example:

Ot = {a, a.(b, b), a.(b,b), a.(b,b), a.(b,b)}

0.3 0.518

t = a.(b,b)

0.042 0.042 0.098Pt,s0 :

Replication

| (t1, … , tn)

(bisimulation) :

16

P

a

c

b[1/3] c[2/3]

c

a[1/3] a[2/3]

b

c

Q

New Equivalence Notion

‘’By-Level Equivalence’’

17

K-Moment Equivalence

t ::= | a.t

t ::= | ak.t k 2

1-moment (trace)

2-moment

3-moment t ::= | ak.t k 3

: is a random variable such that is the probability to perform

the trace and make a transition to a state that accepts action a with probability pi .

is equal toTwo systems are “By-level’’ equivalent

Recall : kth moment of X = E(Xk) = ( xik . Pr(X=xi) )

k

18

Ready Equivalence and Failure equivalence

1. Ready Equivalence

Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process accepting all actions from A.

.

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

(<a>,{b,c}) 2/3 (<a>,{b,c}) 1/2

Test t ::= | a.t | {a1, .. , an}

1. Failure Equivalence

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

(<a>,{b,c}) 1/3 (<a>,{b,c}) 1/2

Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process refusing all actions from A.

Test t ::= | a.t | {a1, .. , an}

19

1. Barb acceptation

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

Barb equivalence

(<a,b>,<{a,b},{b,c},>) 2/3

2. Barb Refusal

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

(<a,b>,<{b,c},{b,c}>) 1/3

Test t ::= | a.t | {a1, .. , an}a.t

Test t ::= | a.t | {a1, .. , an}a.t

20

Outline



Other equivalences

Conclusion

Application on MDPs

21

MDP 1

s0

s1 s2 s3

s3s4

s8 s9

s5

s7

a b

a b

cbc

0.8 0.2 1

1 0.3 0.7

1 11

r1 r2 r3

r3 r4 r5

r7 r8r6

s0

s1 s2 s3

s4s6

s7 s8

s5

a b

a a b

ab

0.5 0.2 0.9

10.3

0.7

1 0.7

r1 r2 r3

r3 r4 r5

r7 r8

Application on MDPs

MDP 2

Case 3 : The reward space is very large (continuous) : w.l.o.g. [0,1]

Case 1 : The reward space contains 2 values (binary) : 0 and 1

Case 2 : The reward space is small (discrete) : {r1, r2, r3, r4, r5}

22

Application on MDPs

Case 1 : The reward space contains 2 values (binary)

r1 : 0 F

r2 : 1 S

Case 2 : The reward space is small (discrete)

{r1, r2, r3, r4, r5}

ar1 a

r2 ar3 a

r4 ar5

br1 b

r2 br3 b

r4 br5

F

S

Case 3 :

The reward space is very large (continuous)

Intuition : r = 3/41 with probability 3/4

a rpick a reward value (ranVal)

randomly

ranVal r

ranVal < r

S

F

0 with probability 1/4

23

Current and Future Work

Application to different equivalence notions :- Failure equivalence- Ready equivalence- Barb equivalence, etc.

Experimental analysis on realistic systems

Applying the approach to compute the divergence between : - HMMs

- POMDPs

Studying the properties of the divergence

- Probabilistic automata

Documents

Testing Stochastic Processes Through Reinforcement Learning