14
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

  • View
    226

  • Download
    1

Embed Size (px)

Citation preview

Reinforcement Learning

Mitchell, Ch. 13

(see also Barto & Sutton book on-line)

Rationale

• Learning from experience

• Adaptive control

• Examples not explicitly labeled, delayed feedback

• Problem of credit assignment – which action(s) led to payoff?

• tradeoff short-term thinking (immediate reward) for long-term consequences

• Transition function – T:SxA->S, environment• Reward function R:SxA->real, payoff• Stochastic but Markov

• Policy=decision function, :S->A• “rationality” – maximize long term expected

reward– Discounted long-term reward (convergent series)– Alternatives: finite time horizon, uniform weights

Agent Model

=

R,T

Markov Decision Processes (MDPs)• if know R and T(=P), solve for value func V(s)• policy evaluation• Bellman Equations • dynamic programming (|S| eqns in |S| unknowns)

• finding optimal policies

• Value iteration – update V(s) iteratively until (s)=argmaxa V(s) stops changing

• Policy iteration – iterate between choosing and updating V over all states

• Monte Carlo sampling: run random scenarios using and take average rewards as V(s)

MDPs

Q-learning: model-free• Q-function: reformulate as value function

of S and A, independent of R and T(=)

Q-learning algorithm

Convergence

• Theorem: Q converges to Q*, after visiting each state infinitely often (assuming |r|<)

• Proof: with each iteration (where all SxA visited), magnitude of largest error in Q table decreases by at least

Training• “on-policy”– exploitation vs. exploration– will relevant parts of the space be explored if stick to

current (sub-optimal) policy?– -greedy policies: choose action with max Q value

most of the time, or random action % of the time

• “off-policy”– learn from simulations or traces– SARSA: training example database: <s,a,r,s’,a’>

• Actor-critic

Non-deterministic case

Temporal Difference Learning

• convergence is not the problem

• representation of large Q table is the problem (domains with many states or continuous actions)

• how to represent large Q tables?– neural network– function approximation– basis functions– hierarchical decomposition of state space