23
1 Testing Stochastic Processes Through Reinforcement Learning François Laviolette Sami Zhioua Nips-Workshop December 9 th , 2006 Josée Desharnais

Testing Stochastic Processes Through Reinforcement Learning

  • Upload
    nimrod

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Testing Stochastic Processes Through Reinforcement Learning. Josée Desharnais. Nips-Workshop December 9 th , 2006. François Laviolette. Sami Zhioua. Outline.  Program Verification Problem.  The Approach for trace-equivalence.  Other equivalences.  Application on MDPs. - PowerPoint PPT Presentation

Citation preview

Page 1: Testing Stochastic Processes  Through  Reinforcement Learning

1

Testing Stochastic Processes Through

Reinforcement Learning

François Laviolette

Sami Zhioua

Nips-Workshop

December 9th, 2006

Josée Desharnais

Page 2: Testing Stochastic Processes  Through  Reinforcement Learning

2

Outline

Program Verification Problem

The Approach for trace-equivalence

Other equivalences

Conclusion

Application on MDPs

Page 3: Testing Stochastic Processes  Through  Reinforcement Learning

3

Stochastic Program Verification

Specification (LMP):an MDP without rewards

Implementation

s0

s1

s3

s6

s2

s4 s5

a[0.5]a[0.3]

b[0.9]cb[0.9]

c

How far the Implementation is from the Specification ?

(Distance or divergence)

The Specification model is available.

The Implementation is available only for interaction (no model).

Page 4: Testing Stochastic Processes  Through  Reinforcement Learning

4

1. Non deterministic trace equivalence

P

a

a c

b

cb

ac

c b b

Q

a

b a

c

cb

aa

b

c

a

b

Trace Equivalence

Two systems are trace equivalent iff they accept the same set of traces

T(P) = {a, aa, aac, ac, b, ba, bab,

c, cb,cc}T(Q) = {a, ab, ac, abc, abca,

ba, bab, c, ca}

2. Probabilistic trace equivalence

Two systems are trace equivalent iff they accept the same set of traces and with the same probabilities

P

a[2/3]

a[1/3] b[2/3]

a[1/4]

cb

a[1/4]a[3/4]

c

a

b[1/2] c[1/2]

a 7/12

aa 5/12

aac 1/6

bc 2/3

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

a 1

aa 1/2

aac 0

bc 0

Page 5: Testing Stochastic Processes  Through  Reinforcement Learning

5

Testing (Trace Equivalence)

The system is a black box.

The button goes down (transition)

The button does not go down (no transition)

When a button is pushed

(action execution)

Grammar (trace equiv):

t ::= | a.t

Observations :

When a test is executed, several observations are possible : O t.

b[0.7]

s0

s3

a[0.2]a[0.5]

[2,4) [7,10]

Example:

Ot = {a, a.b, a.b}

0.3 0.56

t = a.b.

0.14

a b z

Page 6: Testing Stochastic Processes  Through  Reinforcement Learning

6

Outline

Program Verification Problem

The Approach for trace-equivalence

Other equivalences

Conclusion

Application on MDPs

Page 7: Testing Stochastic Processes  Through  Reinforcement Learning

7

Why Reinforcement Learning ?

s0

s1

s4

s2

s5 s6

a[0.2]a[0.5]

b[0.7]a[0.3]a

s7

b

s3

b[0.9]

a[0.7]

s8

s0

s1 s2 s3

s4s6

s7 s8

s5

a b

a a b

ab

LMP

MDP

Reinforcement Learning is particularly efficient in the absence of the full model.

0.5 0.2 0.9

10.3

0.7

1 0.7

Reinforcement Learning can deal with bigger systems.

Analogy :

LMP MDP

Trace Policy

Divergence Optimal Value ( V* )

Page 8: Testing Stochastic Processes  Through  Reinforcement Learning

8

A Stochastic Game towards RL

F S S F S F S F F S F S

F F S S S F

S S S F F F

+ 10

- 1

b[0.7]

s0

s1

s3

s6

s2

s4 s5

a[0.2]a[0.5]

b[0.3]a

c[0.4]

s7

c[0.2]

s10

b

s8

b

Implementation Specifications0

s1

s3

s2

s4 s5

a[0.2]a[0.3]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

b[0.9]

Specification (clone)s0

s1

s3

s2

s4 s5

a[0.2]a[0.3]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

b[0.9]

Reward : (+1) when Impl Spec

Reward : (-1) when Spec Clone

Page 9: Testing Stochastic Processes  Through  Reinforcement Learning

9

MDP Defintion

MDP : Specification LMP StatesActionsNext-state probability distribution

MDPs0

s1

s3

s6

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.4]

s7

c[0.2]

s10

b

s8

b

s0

s1

s3

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

Implémentation Spécification

b[0.9]

s0

s1 s2 s3

s3s4

s8 s9

s5

s7

s10

0.5 0.2 0.9

1 0.3 0.7

1 0.80.7

1

a b

a b

cbc

b

Dead

Page 10: Testing Stochastic Processes  Through  Reinforcement Learning

10

Divergence Computation

F S S

F S F

S F F

S F S

F F S S S F

S S S F F F

+ 1

0

- 1

V*(s0)

0 : Equivalent

1 : Different

*

s0

s1

s3

s6

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.4]

s7

c[0.2]

s10

b

s8

b

s0

s1

s3

s2

s4 s5

a[0.2]a[0.5]

b[0.7]b[0.3]a

s7 s9

c[0.8]c[0.7]

s10

b

s8

b

Implementation Specification

b[0.9]

MDPs0

s1 s2 s3

s3s4

s8 s9

s5

s7

s10

0.5 0.2 0.9

1 0.3 0.7

1 0.80.7

1

a b

a b

cbc

b

Dead

Page 11: Testing Stochastic Processes  Through  Reinforcement Learning

11

Symmetry Problem

Implementation Specification

F S S S F F

F F S S S F

+ 1 - 1

Create two variants for each action (a):

Success variant ( a )

Failure variant ( a )

s0

s1

a[1]

s0

s1

a[0.5]

Spec (Clone)

s0

s1

a[0.5]

Compute and give reward

Give reward 0

Select action make a prediction (, ×)

If pred = obs

If pred obs

Prediction:

execute action

Prob=0*.5*.5+1*.5*.5 = .25Prob=0*.5*.5+1*.5*.5 = .25

Page 12: Testing Stochastic Processes  Through  Reinforcement Learning

12

The Divergence (with the symmetry problem fixed)

Theorem. Let "Spec" and "Impl" be two LMPs, and M their induced MDP.

V*(s0) ≥ 0, and

V*(s0) = 0 iff "Spec" and "Impl" are trace-equivalent.

Page 13: Testing Stochastic Processes  Through  Reinforcement Learning

13

Implementation and PAC Guaranty

There exists a PAC Guaranty for Q-Learning Algorithm but ..

Fiechter algorithm has a simpler PAC guaranty.

Besides, it is possible to obtain a bottom bound thanks to the Hoeffding inequality :

If then :

Implementation :

= 0.8

Action selection : softmax ( decreasing from 0.8 to 0.01)

RL algorithm : Q-Learning

decreasing according to the function 1/x

PAC guaranty :

Page 14: Testing Stochastic Processes  Through  Reinforcement Learning

14

Outline

Program Verification Problem

The Approach for trace-equivalence

Other equivalences

Conclusion

Application on MDPs

Page 15: Testing Stochastic Processes  Through  Reinforcement Learning

15

Testing (Bisimulation)

The system is a black box.

Grammar

t ::= | a.t

a b z

b[0.7]

s0

s3

a[0.2]a[0.5]

[2,4) [7,10]

Example:

Ot = {a, a.(b, b), a.(b,b), a.(b,b), a.(b,b)}

0.3 0.518

t = a.(b,b)

0.042 0.042 0.098Pt,s0 :

Replication

| (t1, … , tn)

(bisimulation) :

Page 16: Testing Stochastic Processes  Through  Reinforcement Learning

16

P

a

c

b[1/3] c[2/3]

c

a[1/3] a[2/3]

b

c

Q

New Equivalence Notion

‘’By-Level Equivalence’’

Page 17: Testing Stochastic Processes  Through  Reinforcement Learning

17

K-Moment Equivalence

t ::= | a.t

t ::= | ak.t k 2

1-moment (trace)

2-moment

3-moment t ::= | ak.t k 3

: is a random variable such that is the probability to perform

the trace and make a transition to a state that accepts action a with probability pi .

is equal toTwo systems are “By-level’’ equivalent

Recall : kth moment of X = E(Xk) = ( xik . Pr(X=xi) )

k

Page 18: Testing Stochastic Processes  Through  Reinforcement Learning

18

Ready Equivalence and Failure equivalence

1. Ready Equivalence

Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process accepting all actions from A.

.

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

(<a>,{b,c}) 2/3 (<a>,{b,c}) 1/2

Test t ::= | a.t | {a1, .. , an}

1. Failure Equivalence

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

(<a>,{b,c}) 1/3 (<a>,{b,c}) 1/2

Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process refusing all actions from A.

Test t ::= | a.t | {a1, .. , an}

Page 19: Testing Stochastic Processes  Through  Reinforcement Learning

19

1. Barb acceptation

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

Barb equivalence

(<a,b>,<{a,b},{b,c},>) 2/3

2. Barb Refusal

P

a[1/3]

a[1/3] a[2/3]

b

cb

a[1/4]a[3/4]

c

a

b[1/2] b[1/2]

Q

a[1/3]

a[1/2] a[1/2]

b

cb

a[1/4]a[3/4]

b[1/2]

c

a

b[1/2]

(<a,b>,<{b,c},{b,c}>) 1/3

Test t ::= | a.t | {a1, .. , an}a.t

Test t ::= | a.t | {a1, .. , an}a.t

Page 20: Testing Stochastic Processes  Through  Reinforcement Learning

20

Outline

Program Verification Problem

The Approach for trace-equivalence

Other equivalences

Conclusion

Application on MDPs

Page 21: Testing Stochastic Processes  Through  Reinforcement Learning

21

MDP 1

s0

s1 s2 s3

s3s4

s8 s9

s5

s7

a b

a b

cbc

0.8 0.2 1

1 0.3 0.7

1 11

r1 r2 r3

r3 r4 r5

r7 r8r6

s0

s1 s2 s3

s4s6

s7 s8

s5

a b

a a b

ab

0.5 0.2 0.9

10.3

0.7

1 0.7

r1 r2 r3

r3 r4 r5

r7 r8

Application on MDPs

MDP 2

Case 3 : The reward space is very large (continuous) : w.l.o.g. [0,1]

Case 1 : The reward space contains 2 values (binary) : 0 and 1

Case 2 : The reward space is small (discrete) : {r1, r2, r3, r4, r5}

Page 22: Testing Stochastic Processes  Through  Reinforcement Learning

22

Application on MDPs

Case 1 : The reward space contains 2 values (binary)

r1 : 0 F

r2 : 1 S

Case 2 : The reward space is small (discrete)

{r1, r2, r3, r4, r5}

ar1 a

r2 ar3 a

r4 ar5

br1 b

r2 br3 b

r4 br5

F

S

Case 3 :

The reward space is very large (continuous)

Intuition : r = 3/41 with probability 3/4

a rpick a reward value (ranVal)

randomly

ranVal r

ranVal < r

S

F

0 with probability 1/4

Page 23: Testing Stochastic Processes  Through  Reinforcement Learning

23

Current and Future Work

Application to different equivalence notions :- Failure equivalence- Ready equivalence- Barb equivalence, etc.

Experimental analysis on realistic systems

Applying the approach to compute the divergence between : - HMMs

- POMDPs

Studying the properties of the divergence

- Probabilistic automata