Aprendizagem por reforço

Mestrado em Engenharia Informática

Aprendizagem por Reforço

11-10-2010 Aprendizagem Automática / Machine Learning 2

Procura

Dado um espaço de soluções, procurar a melhor (ou uma aceitável)


Interagir com um ambiente e descobrir qual a melhor acção para cada estado

Aprendizagem Não supervisionada

Dados vários padrões descobrir semelhanças entre eles, agrupá-los

Reduzir número de atributos considerados

Aprendizagem Supervisionada

Sabendo o que se passou no passado, prever o que se segue

Induzir uma regra dados exemplos


É aplicado em problemas em que um “agente” tem de interagir com um ambiente (às vezes dinâmico)

O agente tem de aprender quais as acções que lhe dão maiores recompensas, por tentativa e erro

O agente consegue observar o estado do ambiente (ou parte dele) e o seu estado interno

O agente avalia a qualidade de cada estado e cada acção


Recepção de estado do ambiente (s) no tempo t

Escolha da acção (a) Recepção da recompensa (r, positiva,

negativa, binária ou decimal) Actualização dos parâmetros de escolha da

acção (usando a politica de escolha, (s,a))


Pode ser externa (basear-se em mais informação do que o estado)

ou

interna (crítico heurístico, heuristic critic) Avaliação do estado/Acção executados pelo agente Tipicamente o agente tenta optimizar a soma,

descontada, das recompensas a longo prazo:

R: recompensa totalrt: recompensa no tempo tγ: termo de desconto para recompensas futuras 0< γ<1


t

t

trR

Equação de Bellman

V*(s): Valor óptimo (ou utilidade) do estado sP(s,a,s’): Probabilidade de transitar do estado s,

para o estado s’ após efectuar acção a

Se são conhecidos r,s,a e P(s,a,s’)… Pode ser resolvido analiticamente, ou … Value iteration: Itera pelos valores possíveis para o

estado (há várias politicas para o efeito) até convergir.


))'()',,((max)( *,

*

s

asa

sVsasPrsV

Temporal Difference LearningTD(λ) [Sutton 88]

Método de descobrir a utilidade interagindo com o ambiente (sem modelo dado)

TD(0) com eligibility traces

s’: estado seguinte a s (observado) (note-se que esta actualização é feita só ao atingir o objectivo)

e(s): elegibilidade do estado s para actualização, pode ser visto também como o grau em que foi visitado no passado recente

: taxa de aprendizagem


)())'(()()1()( sesVrsVsV

http://en.wikipedia.org/wiki/Temporal_difference_learning

Q(s,a): Qualidade da escolha da acção a, no estado s

Para garantir desempenho óptimo é preciso passar por todos os pares (s,a) um número infinito de vezes e que α diminua ao longo do treino.


))','(max(),()1(),('

, asQrasQasQa

as

Problema: Exploration vs. Exploitation

Escolher a melhor (Exploitation)

Exploração (de Boltzmann) (com temperatura (T) decrescente)Mesmo princípio que Simulated Annealing.

ε-greedy Escolher com probabilidade 1-ε a melhor acção Escolher com probabilidade ε uma acção aleatória ε deve diminuir ao longo do treino


'

/)'(

/)(

)(

a

TaQ

TaQ

e

eaP

http://en.wikipedia.org/wiki/Boltzmann_machine




Problemas Exploração de Boltzmann: cálculo de exponenciais Sistemas com grande variância (problemas dinâmicos) Observações parciais do estado Coordenação em Sistemas Multiagentes (Tragedy Of the

Commons), equilíbrio de Nash Grande número de estados e acções, possibilidades de solução:

▪ ANN▪ Hierarchical RL (MSc/PhD)

Repetição das experiências por ordem inversa [Lin 92]

Dyna: Aprender modelo e repetir experiências [Sutton 90]


Dado um labirinto, que pode ser percorrido repetidas vezes, é possível criar um programa que aprenda qual o melhor caminho para sair desse labirinto?


[Bellman 57] R. Bellman, Dynamic Programming, Princeton University Press, 1957 [Sutton 88] R. S. Sutton, Learning to predict by the method of temporal diferences,

Machine Learning 3(1), pp 9-44, 1988 [Watkins 89] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D. thesis Kings

College, Cambridge, UK, 1989 [Sutton 90] R. S. Sutton, Integrated architectures for learning planning and reacting

based on approximating dynamic programmig, in Proc. of the Seventh InternationalConfertence on Machine Learning, Austin TX, Morgan Kaufmann, 1990

[Kaelbling et al. 96] Leslie P. Kaelbling, Michael L. Littman, Andrew W. Moore, Reinforcement Learning: A Survey, in Journal of Artificial Intelligence Research 4: 237–285, 1996

[Sutton, Barto 98] Richard S. Sutton, Andrew G. Barto, Reinforcement Learning: An Introduction. MIT Press. ISBN 0-262-19398-1, 1998

[Rummery, Niranjan 94] G. A. Rummery, M. Niranjan, Online Q-Learning using connectionist systems., Tech. report CUED/F-INFENG/TR166, Cambridge University.

[Whitehead 91] S. D. Whitehead, A complexity analysis of cooperative mechanisms in reinforcement learning, in Proc. of the 9th National Conf. on AI (AAAI-91), pp. 607–613, 1991

[Lin 92] L.-J. Lin, Self-improving reactive agents based on reinforcement learning,

planning and teaching. Machine Learning, 8, 293–321 ,1992

http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html

http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html

http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html

[Stone, Veloso 00] Peter Stone, Manuela Veloso, Multiagent systems: A survey from a machine learning perspective, Autonomous Robots, 8 (3), pp. 345-383 2000(versão original 1996/97)

[Wiering et al. 99] M. Wiering, B. Krose, F. Groen, Learning in multiagentsystems, Technical report, University of Amsterdam, 1999

[Hoen 06] Pieter Jan ’t Hoen, Karl Tuyls, Liviu Panait, Sean Luke, and Johannes La Poutre. An Overview of Cooperative and Competitive Multiagent Learning. In K. Tuyls, P.J. ’t Hoen, K. Verbeeck, and S. Sen, editors, Learning and Adaptation in Multi-Agent Systems, Lecture Notes in Artificial Intelligence, pp. 1–49, Springer Verlag, Berlin, 2006.

[Panait, Luke 05] L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art, in Autonomous Agents and Multi-Agent Systems, 2005

[Brooks 86] R. A. Brooks,"A Robust Layered Control System For A Mobile Robot", IEEE Journal Of Robotics And Automation, RA-2, April. pp. 14-23, 1986.

[Brooks 87] R. A. Brooks, "Planning is just a way of avoiding figuring out what to do next", Technical report, MIT Artificial Intelligence Laboratory, 1987.

[Brooks 90] R. A. Brooks, “Elephants Don’t Play Chess”, Robotics andAutonomous Systems 6, pp. 3-15, 1990

http://citeseer.ist.psu.edu/stone97multiagent.html




http://citeseer.ist.psu.edu/wiering00learning.html




http://www.cs.unimaas.nl/k.tuyls/publications/papers/Hoenbook06.pdf




http://www.ece.osu.edu/~fasiha/Brooks_Planning.html

http://www.ece.osu.edu/~fasiha/Brooks_Planning.html

[Littman 94] L. M. Littman, Markov games as a framework for multi-agent reinforcement learning. Proceedings of the Eleventh International Conference on Machine Learning (pp. 157–163). San Francisco: Morgan Kaufman, 1994.

[Bowling 00] Michael Bowling, Convergence problems of general-sum multiagentreinforcement learning, In Proceedings of the Seventeenth International Conference on Machine Learning (ICML) , pages 89--94. Morgan Kaufman, June 2000

[Haynes et al. 95] T. Haynes, S. Sen, D. Schoenefeld, and R.Wainwright, Evolvingmultiagent coordination strategieswith genetic programming, TechnicalReport UTULSA-MCS-95-04, The University of Tulsa, May 31, 1995.

[Pottes et al. 95] M. Potter, K. De Jong, and J. J. Grefenstette. A coevolutionaryapproach to learning sequential decision rules.In Proceedings from the SixthInternational Conference on Genetic Algorithms, pages 366–372.Morgan Kaufmann, Publishers, Inc., 1995.

[Bowling, Veloso 00] M. Bowling,M. Veloso. An analysis of stochastic game theory for multiagent reinforcement learning.Technical Report CMU-CS-00-165, Computer Science Department, Carnegie Mellon University, 2000.

[Wolpert et al. 99] D. Wolpert, K. Tumer, and J. Frank. Using collective intelligence to route internet traffic. In Advances in Neural Information Processing Systems -11. MIT Press, 1999.

http://www.cs.ualberta.ca/~bowling/papers/00icml.pdf






http://citeseer.ist.psu.edu/26626.html









http://citeseer.ist.psu.edu/potter95coevolutionary.html






























http://citeseer.ist.psu.edu/article/wolpert99using.html

http://citeseer.ist.psu.edu/article/wolpert99using.html



Documents

Aprendizagem por reforço