Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal...

ETHEM ALPAYDIN© The MIT Press, 2010

alpaydin@boun.edu.trhttp://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Introduction Game-playing: Sequence of moves to win a game

Robot in a maze: Sequence of actions to find a goal

Agent has a state in an environment, takes an action and sometimes receives reward and the state changes

Credit-assignment

Learn a policy

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Single State: K-armed Bandit

aQaraQaQ tttt 11

Among K levers, choose the one that pays best

Q(a): value of action aReward is ra

Set Q(a) = ra

Choose a* if Q(a*)=maxa Q(a)

Rewards stochastic (keep an expected reward):

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Elements of RL (Markov Decision Processes) st : State of agent at time t

at: Action taken at time t

In st, action at is taken, clock ticks and reward rt+1 is received and state changes to st+1

Next state prob: P (st+1 | st , at )

Reward prob: p (rt+1 | st , at )

Initial state(s), goal state(s)

Episode (trial) of actions from initial state to goal

(Sutton and Barto, 1998; Kaelbling et al., 1996)

Policy and Cumulative Reward tt sa : AS

Policy,

Value of a policy,

Finite-horizon:

Infinite horizon:

iitTtttt rErrrEsV

rate discount the is 10

itttt rErrrEsV

stttttt

asQassPrEasQ

saasQsV

sVassPrEsV

in of Valuemax

Bellman’s equation

Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is known

There is no need for exploration

Can be solved using dynamic programming

Solve for

Optimal policy

Model-Based Learning

t sVassPrEsVt

** ,|max

tttttta

t sVassPasrEstt

*,|,|max arg*

Value Iteration

Policy Iteration

Temporal Difference Learning Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not known;

model-free learning

There is need for exploration to sample from

P (st+1 | st , at ) and p (rt+1 | st , at )

Use the reward received in the next time step to update the value of current state (action)

The temporal difference between the value of the current action and the value discounted from the next state

ε-greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1-ε

Probabilistic:

Move smoothly from exploration/exploitation.

Decrease ε

Annealing

Exploration Strategies

asQsaP

1bTbsQ

TasQsaP

/,exp|

Deterministic: single possible reward and next state

used as an update rule (backup)

Starting at zero, Q values increase, never decrease

Deterministic Rewards and Actions 1111

stttttt asQassPrEasQ

,max,|, **

ttt asQrasQt

,ˆmax,ˆ

Consider the value of action marked by ‘*’:If path A is seen first, Q(*)=0.9*max(0,81)=73Then B is seen, Q(*)=0.9*max(100,81)=90

Or,If path B is seen first, Q(*)=0.9*max(100,0)=90Then A is seen, Q(*)=0.9*max(100,81)=90

Q values increase but never decrease

γ=0.9

Nondeterministic Rewards and Actions

ttttt sVsVrsVsV 11

When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments

Q-learning (Watkins and Dayan, 1992):

Off-policy vs on-policy (Sarsa)

Learning V (TD-learning: Sutton, 1988)

ttttt asQasQrasQasQt

,ˆ,ˆmax,ˆ,ˆ111

backup

Q-learning

Eligibility Traces

asaseasQasQ

asQasQr

aassase

tttttt

otherwise

and if

Keep a record of previously visited states (actions)

Sarsa (λ)

Tabular: Q (s , a) or V (s) stored in a table

Regressor: Use a learner to estimate Q (s , a) or V (s)

Generalization

zeros all with

yEligibilit

tttttt

ttttttt

tttttt

asQasQr

asQasQasQr

asQasQrE

Partially Observable States The agent does not know its state but receives an

observation p(ot+1|st,at) which can be used to infer a belief about states

Partially observable

The Tiger Problem

Two doors, behind one of which there is a tiger

p: prob that tiger is behind the left door

R(aL)=-100p+80(1-p), R(aR)=90p-100(1-p)

We can sense with a reward of R(aS)=-1

We have unreliable sensors

If we sense oL, our belief in tiger’s position changes

130100

110090

180100

)|(),()|(),()|(

)()|()|('

LRRRLLLRLR

LRRLLLLLLL

ozPzarozPzaroaR

zPzoPozPp

)())|(),|(),|(max()())|(),|(),|(max(

)()|(max'

oPoaRoaRoaRoPoaRoaRoaR

oPoaRV

RRSRRRLLLSLRLL

110090

180100

Let us say the tiger can move from one room to the other with prob 0.8

110090

180100

When planning for episodes of two, we can take aL, aR, or sense and wait:

110090

180100

Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Temporal...

Documents

TE34.92 - Attachment 1 -King & Spadina - Public Realm Strategy · ltt T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Figure 22. Key views

Introduction to Machine Learningethem/i2ml2e/2e_v1-0/i2ml2... · 2016-08-08 · Undirected Graphs: Markov Random Fields In a Markov random field, dependencies are symmetric, for example,

T T T T T T T T T T - alojaweb.educastur.es

9 1 10 1 2 2 8 3 4 5 8 5 7 6 - Lindbergh · h e r c u l e s - s t r. t t t t l t t t t s t l t l t t l t t t t t t t l l l l l t t t t k l l t t t t t t t t t t t t t t t t t t t

F T E P FF or TF H · t t t t t t t t t t t t t t t t t t t t t t t t t t t t tt t t t t t t t t ttttttttttttttttttttttttttttttttttt t ttttttt t t ttt t t.6.9.2.5 1.5.5 1.0.5 1.0

ice u Skut Zm na . 1 ÚP RANÁ - Hlinskot t t t t t t t t t t t t t t t t t tt t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t

ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

SC2.2 Badu Island maps - Torres Strait Island Regional … - Schedule 2...t t t t e t t e t t t t t t t t t t t t t t n t t t t t t t t t t d t t t t t SOLZIA PAIN WARU WAK UPAI MAINAB

T t - T t - T t - T t - T t - T t...Topolino no estés triste Topo bueno toma el tren, Toma el carro, ten la moto Y este cohete también. T t - T t - T t - T t - T t - T t Ta - Te

INSET A - Environment€¦ · t t t t t t t t t t t t t t t t t t t t t t t t t

Options for automated tests DatabaseBusiness Logic User Interface Database Unit Tests T T T T T T T T T T T T T T T T T T T T T T T T Web Performance

WEST - University of California, Berkeleymedia.journalism.berkeley.edu/mission/2011/03/20110303... · 2011. 3. 3. · t t t t t t v t t t t t t 20 th St t t t t t t t e t t e t t

NATU RLD IGEB S - UConn CT ECOcteco.uconn.edu/maps/town/basin/basin_Eastford.pdf · t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t

Zm na . 1 ÚP RANÁ · t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t tt t t t t t t t t t t t t t t t t t t t t t

Aşağıdaki varlıkların hangisinin adında “T-t” sesi ...€¦ · Sesi okuyup yazalım. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T 2 . t t t

Lehi City Storm Water Drainage Master Plan...T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T TT T T T T T T T T T T T T T TT T T T T T T T T T T T T T T T T T T T T

Introduction to Machine Learning - CmpE WEBethem/i2ml2e/2e_v1-0/i2ml2e-chap1… · Combining Multiple Sources Early integration: Concat all features and train a single learner Late

NATURAL DRAINAGE BA SIN - cteco.uconn.educteco.uconn.edu/maps/town/basin/basin_Seymour.pdf · t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t

No. 19-168 In the Supreme Court of the United States · 1 T T T T T T T T T T T T T♦ T T T T T T T T T T T T T INTEREST OF AMICI CURIAE The National Rifle Association of America,

HMN - Couverture actuelle ADSL et VDSLwgs-users.s3.amazonaws.com/cg52/Documents/CD52_HMN_ELIGIBI… · "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t "t