Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT...

Preview:

Citation preview

http://lamda.nju.edu.cn

Off-Policy Imitation Learning from Observations

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Objective of GAIL

- Requires expert action => Learning from observations - (BCO, GAILfO)

- On-policy (low sample efficiency) - off-policy algorithms rather than TRPO (SQIL, DAC) - offline reward estimation (PWIL)

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Learning from Demonstration Learning from Observation

Connection between LfD and LfO

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Connection between LfD and LfO

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

surrogate expert buffer: 𝐷𝑠 = {(𝑠, 𝑎)𝑖}

perform AIRL on 𝐷𝑆

Similar “Surrogate Distribution”

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

pseudo-reward

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

no on-policy expectation left

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Implementation Details

optimize the ratio using GAN

additional regularization

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

End