Transcript
Page 1: University of Massachusetts Amherst, Brown UniversityGeneralization in Deep Reinforcement Learning Acknowledgments — Thanks to Kaleigh Clary, Reilly Grant, and the anonymous reviewers

Generalization in Deep Reinforcement Learning

Acknowledgments — Thanks to Kaleigh Clary, Reilly Grant, and the anonymous reviewers for thoughtful comments and contributions. This material is based upon work supported by the United States Air Force under Contract No. FA8750-17-C-0120. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.

[1] Huang, S. Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P. Adversarial attacks on neural network policies. ArXiv preprint ArXiv:1702.02284, 2017.[2] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., and Fidjeland, A., Ostrovski, G. Human-level control through deep reinforcement learning. Nature, 2015.[3] Schaul, T., Quan, J., Antonoglou, I., Silver, D. Prioritized Experience Replay. International Conference on Learning Representations, 2016.[4] Van Hasselt, H., Guez, A., Silver, D. Deep Reinforcement Learning with Double Q-Learning. Proceedings of the Association for the Advancement of Artificial Intelligence, 2016.[5] Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., and Lanctot, M., and De Freitas, N. Dueling network architectures for deep reinforcement learning. International Conference on Machine Learning, 2016.

1. Motivation 3. Empirical MethodologyTowhatextentdotheaccomplishmentsofdeepRL agentsdemonstrategeneralization, andhowcanwerecognizesuchacapabilitywhenpresentedwithonlyablack-boxcontroller?

5. Conclusions

KDLSamWitty1, JunKiLee2, EmmaTosch1, AkankshaAtrey1, MichaelLittman2, DavidJensen1

1UniversityofMassachusettsAmherst, 2BrownUniversity

Forexample, anagenttrainedtoplaytheAtarigameofAmidarachieveslargerewardswhenevaluated from thedefault initial state, but small non-adversarial modificationsdramaticallydegradeperformance.

2. Recasting GeneralizationNaïveevaluationofapolicyonheld-outtrainingstatesonlymeasuresanagent’sabilitytouse dataafteritiscollected. Usingthismethod, wecouldincorrectlyclaimthatanagenthasgeneralized, evenifitonlyperformswellonasmallsubsetofstates.

In this grid-world example, the agent can take actions up-right, right, and down-right.

Wepartitiontheuniverseofpossibleinputstatesintothreesets, accordingtohowtheagentcanencounterthemfollowingitslearnedpolicy � from s0 � S 0, thesetofinitialstates.

• On-policystates, S on, canbeencounteredbyfollowing � fromsome s0.

• Off-policystates, S off, canbeencounteredbyfollowingany �� � �, thesetofallpolicyfunctions.

• Unreachablestates, S unreachable, cannotbeencounteredbyfollowingany �� � �,butarestillinthedomainofthestatetransitionfunction T (s, a, s�).

Wedefineaq-valuebasedagent’sgeneralizationabilitiesviathefollowing, where � and� aresmallpositivevalues. v�(s) istheoptimalstate-value, v�(s) istheactualstate-valuebyfollowing �, and v̂(s) istheestimatedstate-value. q�(s, a), q�(s, a), and q̂(s, a), arethecorrespondingstate-actionvalues.

Definition1(Repetition) AnRL agenthashighrepetitionperformance, GR, if � > |v̂(s) �v�(s)| and � > v�(s) � v�(s), �s � S on.

Definition2(Interpolation) AnRL agenthashighinterpolationperformance, GI , if � >|q̂(s, a) � q�(s, a)| and � > q�(s, a) � q�(s, a), �s � S off, a � A.

Definition3(Extrapolation) AnRL agenthashighextrapolationgeneralization, GE , if � >|q̂(s, a) � q�(s, a)| and � > q�(s, a) � q�(s, a), �s � S unreachable, a � A.

Givenaparameterizedsimulatorwecaninterveneonindividualcomponentsoflatentstateandforward-simulateanagent’strajectorythroughtheenvironment.

Player random start (PRS)

Added line segment(ALS)

Enemy removal(ER)

Enemy shifted(ES)

Default start Default death Modified start Modified death

Wecangenerateoff-policystatesbyhavingtheagenttake k randomactionsduringitstrajectory(k-OPA) orextractingstatesfromthetrajectoriesofalternativeagents(AS) andhumanplayers[1](HS).

4. Analysis Case-StudyTodemonstratethesesideasweimplementIntervenidar, afullyparameterizedversionoftheAtarigameofAmidar. Unlikepreviousworkonadversarialattacks[1], interventionsinIntervenidarchangethelatentstateitself, notonlytheagent’sperceptionofstate.

Wetrainthestate-of-the-artduelingnetworkarchitecture, doubleQ-lossfunction, andprioritizedexperiencereplay[3,4,5]usingthestandardpixel-basedAtariMDP specifica-tion[2]withthedefaultstartpositionoftheoriginalAmidargame. Afterconvergence,weexposetheagenttooff-policyandunreachablestates.

RLAB

• Weproposeanovelcharacterizationofablack-boxRL agent’sgeneralizationabil-itiesbasedonperformancefromon-policy, off-policy, andunreachablestates.

• WeprovideempiricalmethodsforevaluatingaRL agent’sgeneralizationabilitiesusingintervenableparameterizedsimulators.

• WedemonstratetheseempiricalmethodsusingIntervenidar, aparameterizedver-sionof theAtarigameofAmidar. Wefindthat thestate-of-the-artduelingDQNarchitecturefailstogeneralizetosmallchangesinlatentstate.