81
Copyright by Qiping Zhang 2021

Copyright by Qiping Zhang 2021

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Copyright by Qiping Zhang 2021

Copyright

by

Qiping Zhang

2021

Page 2: Copyright by Qiping Zhang 2021

The Thesis committee for Qiping ZhangCertifies that this is the approved version of the following thesis:

Interactive Learning from Implicit Human Feedback:

The EMPATHIC Framework

SUPERVISING COMMITTEE:

Peter Stone, Supervisor

Scott Niekum

Page 3: Copyright by Qiping Zhang 2021

Interactive Learning from Implicit Human Feedback:

The EMPATHIC Framework

by

Qiping Zhang

Thesis

Presented to the Faculty of the Graduate School

of The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science in Computer Science

The University of Texas at Austin

August 2021

Page 4: Copyright by Qiping Zhang 2021

To my parents, Mingjia Zhang and Xiaozhou Li.

Page 5: Copyright by Qiping Zhang 2021

Acknowledgments

I would first like to express my deepest appreciation to my advisor,

Dr. Peter Stone. Since I started to work with him at the beginning of my

master’s study, Peter has been patiently instructing me and guiding me into

the research world of reinforcement learning and robotics. No matter how busy

he was, Peter always managed to be responsive and available for any problems I

had with my research, participating in all meetings of our project with detailed

feedback every time. Without him, I could not have found my field of interest

and determined to pursue my path of academic research. Meanwhile, I am

extremely grateful for the precious lessons learned from him on how to be a

successful researcher and supervisor, which will be my lifelong treasure in my

future career. Peter has shown me what an ideal advisor should be like, and

I hope someday I could also become a nice mentor as he does.

I would also like to give my sincere thanks to my second reader, Dr.

Scott Niekum. As my co-advisor, Scott has been continuously providing in-

sightful ideas and feedback on our work, which would not have been possible

without his guidance and persistent help. Furthermore, I am extremely grate-

ful for his strong support and valuable suggestions when I was working on my

graduate school application.

I owe a deep sense of gratitude to the principal investigator of our lihf

v

Page 6: Copyright by Qiping Zhang 2021

project, Dr. Brad Knox. Under his supervision, we developed the empathic

framework from the very beginning of his conception. This work would not

have been well completed and successfully published without Brad’s instruc-

tions on every detail, including the experiment design and writings. I am also

thankful for his valuable advice on how to do a better research in many previ-

ous chats. And I wish him best of luck with his new journey at Google after

he finishes his job at Bosch.

I must thank my collaborator and great friend, Yuchen Cui, who has

been working together with me on this project over the past two years. I

sincerely appreciate all the time she spent discussing the research challenges

we were confronted with, shedding light on the concepts I was not familiar

with, and clearing up all my confusion while composing this work. Thank

you, Yuchen, for being a patient mentor of mine as a senior Ph.D. student,

and helping me improve multiple skills for academic research such as writing

and formulating ideas. I hope our work together is also a fun experience to

you, and has a nice contribution to your dissertation.

I am also very thankful to Dr. Alessandro Allievi for his continuous help

with our research, for both his advice on the project and his timely support

whenever any resource is needed for the experiments (I still remember his

phone call on Christmas day when we urgently needed access to an account).

In addition, I would like to express my gratitude to my dear friends.

Bo Liu and Yifeng Zhu have always been my closest labmates to talk with,

especially when I got a bit lost or felt too stressful with my research. Also,

vi

Page 7: Copyright by Qiping Zhang 2021

special thanks to Zhihong Luo for his guidance on conducting research, apply-

ing for graduate schools, and finding suitable advisors, together with the joy

and encouragement he brought.

I have hugely enjoyed the two years I spent in the Learning Agents Re-

search Group (LARG) during my master’s study. It is my genuine pleasure to

know all these great friends, who are all nice and smart people: Yunshu Du, Is-

han Durugkar, Justin Hart, Yuqian Jiang, Haresh Karnan, Sai Kiran, William

Macke, Reuth Mirsky, Brahma Pavse, Faraz Torabi, Xuesu Xiao, Harel Yedid-

sion, and Ruohan Zhang (in alphabetical order). Also, much thanks to the

other friends I met at UT: Suna Guo, Zhaoyuan He, Sahil Jain, Jordan Schnei-

der, Siming Yan, Chenxi Yang, and Zaiwei Zhang.

Finally, deepest thanks to my family, especially my parents Mingjia

Zhang and Xiaozhou Li, and my grandma Yuxin Miao, for their love and

support over these years that kept me going.

vii

Page 8: Copyright by Qiping Zhang 2021

Interactive Learning from Implicit Human Feedback:

The EMPATHIC Framework

Qiping Zhang, M.S.Comp.Sci.

The University of Texas at Austin, 2021SUPERVISOR: Peter Stone

Reactions such as gestures and facial expressions are an abundant, nat-

ural source of signal emitted by humans during interactions. An autonomous

agent could leverage an interpretation of such implicit human feedback to

improve its task performance at no cost to the human, contrasting with tradi-

tional agent teaching methods based on demonstrations or other intentionally

provided guidance. In this thesis, we first define the general problem of learning

from implicit human feedback, and propose a data-driven framework named

empathic as a solution, which includes two stages: (1) mapping implicit hu-

man feedback to corresponding task statistics; and (2) learning a task with the

constructed mapping. We first collect a human facial reaction dataset while

participants observe an agent execute a sub-optimal policy for a prescribed

training task. Then, we train a neural network to instantiate and demonstrate

the ability of the empathic framework to (1) infer reward ranking of events

from offline human reaction data in the training task; (2) improve the online

agent policy with live human reactions as they observe the training task; and

(3) generalize to a novel domain in which robot manipulation trajectories are

evaluated by the learned reaction mappings.

viii

Page 9: Copyright by Qiping Zhang 2021

Table of Contents

Acknowledgments v

Abstract viii

List of Tables xi

List of Figures xii

Chapter 1. Introduction 1

1.1 Research Questions and Contributions . . . . . . . . . . . . . . 2

1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2. Background 5

2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 5

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Interactive Reinforcement Learning . . . . . . . . . . . . 6

2.2.2 Facial Expression Recognition (FER) . . . . . . . . . . . 7

2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 3. The LIHF Problem and The EMPATHIC Frame-work 9

3.1 The LIHF Problem . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 The EMPATHIC Framework . . . . . . . . . . . . . . . . . . . 11

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Chapter 4. Instantiation of the EMPATHIC Framework 15

4.1 Data Collection Domains and Protocol . . . . . . . . . . . . . 15

4.1.1 Robotaxi . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.2 Robotic Sorting Task . . . . . . . . . . . . . . . . . . . 16

ix

Page 10: Copyright by Qiping Zhang 2021

4.1.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Reward Mapping Design . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Human Exploration of the Data . . . . . . . . . . . . . 18

4.2.2 Reaction Mapping Architecture . . . . . . . . . . . . . . 20

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 5. Reaction Mapping Results and Evaluation 25

5.1 Reward-ranking Performance in the Robotaxi Domain . . . . . 26

5.1.1 Learning Outcome of the Reaction Mapping . . . . . . . 27

5.1.2 Ablation Study for Predictive Model Design . . . . . . . 28

5.1.3 Preliminary Modeling of Other Task Statistics . . . . . 32

5.1.4 Effects of Different Belief Priors . . . . . . . . . . . . . 33

5.2 Online Learning in the Robotaxi Domain . . . . . . . . . . . . 37

5.3 Trajectory Ranking in Robotic Sorting Domain . . . . . . . . . 39

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Chapter 6. Conclusion and Future Work 43

Appendices 45

Appendix A. Supplemental Material 46

A.1 Experimental Domains and Data Collection Details . . . . . . 46

A.1.1 Robotaxi . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A.1.2 Robotic Sorting Task . . . . . . . . . . . . . . . . . . . 48

A.1.3 Experimental Design . . . . . . . . . . . . . . . . . . . . 50

A.1.4 Participant Recruitment . . . . . . . . . . . . . . . . . . 52

A.2 Annotations of Human Reactions . . . . . . . . . . . . . . . . 53

A.2.1 The Annotation Interface . . . . . . . . . . . . . . . . . 53

A.2.2 Visualizations of Annotated Data . . . . . . . . . . . . . 56

A.3 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 57

A.3.2 Data Split of k-fold Cross Validation for Random Search 57

A.3.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 59

x

Page 11: Copyright by Qiping Zhang 2021

List of Tables

4.1 Human proxy test result: average τ values across participantsare shown; a baseline that randomly picks rankings has an ex-pected τ value of 0. . . . . . . . . . . . . . . . . . . . . . . . . 20

xi

Page 12: Copyright by Qiping Zhang 2021

List of Figures

3.1 Graphical model for lihf (colored boxes and their identicallycolored labels represent conditional probability tables) . . . . . 10

3.2 Overview of empathic . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Robotaxi environment . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Robotic sorting task . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Human proxy’s view: semantics are hidden with color masks;the dark green circle is the agent; observer’s reaction is dis-played; detected face is enlarged; background is colored by lastpickup. The left frame is the same game state shown in Fig 4.1. 20

4.4 The feature extraction pipeline and architecture of the reaction map-ping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Sorted per-subject Kendall’s τ for Robotaxi reward-ranking task 26

5.2 Loss profiles for training final models for reward ranking evalu-ation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3 Sorted per-subject Kendall’s τ for Robotaxi reward-ranking taskon the human proxy test episodes . . . . . . . . . . . . . . . . 30

5.4 Loss profiles for training different models (each model has itsown set of best hyper parameters found through random searchexcept the end-to-end model) . . . . . . . . . . . . . . . . . . 31

5.5 Loss profiles for training with other task statistics . . . . . . . 34

5.6 Performance of reward inference starting from different priorsover reward rankings. . . . . . . . . . . . . . . . . . . . . . . . 35

5.7 Trials of informal online learning sessions in Robotaxi . . . . . 38

5.8 Sample plot of trajectory positivity score over aggregated frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.9 Sorted per-subject Kendall’s τ for evaluating robotic sortingtrajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.10 Overall trajectory ranking by mean positivity score across sub-jects (each entry is colored by the trajectory’s final return: greenfor +2, yellow for 0, and red for −1) . . . . . . . . . . . . . . 41

xii

Page 13: Copyright by Qiping Zhang 2021

A.1 Robotaxi environment . . . . . . . . . . . . . . . . . . . . . . . 48

A.2 Robotic task table layout with object labels, from the perspec-tive of the human observer . . . . . . . . . . . . . . . . . . . . 48

A.3 Robotic sorting task trajectories with optimality segmentation 49

A.4 View of the annotation interface. The corresponding trajectoryof Robotaxi is not displayed. . . . . . . . . . . . . . . . . . . . 54

A.5 Proportion of annotated gestures . . . . . . . . . . . . . . . . 54

A.6 Histograms of non-zero reward events around feature onset . . 55

A.7 Diagram of data split for subject k . . . . . . . . . . . . . . . 58

xiii

Page 14: Copyright by Qiping Zhang 2021

Chapter 1

Introduction

When human observers are interested in the outcome of an agent’s

behavior, they often provide reactions that are not necessarily intended to

communicate to the agent. However, information carried by such reactions

about the perceived quality of the agent’s performance could instead be used

for the improvement of task learning if an agent can perform effective inter-

pretation. Furthermore, since humans instinctively generate such reactions,

learning from implicit human feedback does not result in extra burden on the

users.

In this work, learning from implicit human feedback (lihf) is con-

sidered complementary to learning from explicit human teaching, which in-

cludes demonstrations [4], evaluative feedback [22, 23], or other communica-

tive modalities [2, 8, 11, 24, 36]. Though in most scenarios implicit feedback is

less informative and leads to more interpretation challenges than explicit sig-

nals, lihf induces no additional cost to human users by making use of existing

reactions.

In this thesis, we define the lihf problem, propose a general data-

driven framework to solve it, and implement and validate an instantiation of

1

Page 15: Copyright by Qiping Zhang 2021

this framework using facial expressions and head poses (referred to jointly as

facial reactions) as the modalities of human reactions.

1.1 Research Questions and Contributions

While great success has been witnessed in previous computer vision

research on basic facial expression recognition tasks [14, 15, 27], it is not trivial

for a learning agent to interpret human expressions. Major challenges exist in

the following aspects:

1. When the same facial expression is interpreted in various ways, an agent

could have distinct learning behaviors. For instance, a smile could mean

satisfaction, encouragement, amusement, or frustration [16]. As is sug-

gested by recent cognitive science research [9, 12, 18, 33], facial expres-

sions also play an important role in regulating social interactions and

signaling contingent social action. Hence, different contexts and individ-

uals will also lead to various interpretations of facial expressions.

2. There is generally a variable delay on the occurrence of human reac-

tions after an event takes place. Also, reactions could occur in advance,

when a human is anticipating an event. These cases result in additional

challenges of interpreting which event has induced the person’s reactions.

3. Natural human reactions can contain spontaneous micro-expressions con-

sisting of minor facial muscle movements lasting for less than 500 mil-

liseconds [34, 44], which can be hard to detect by computer vision sys-

2

Page 16: Copyright by Qiping Zhang 2021

tems trained with common datasets with only exaggerated facial expres-

sions [13, 28].

4. In many use cases, there exist other components in the human environ-

ments apart from the agent and its task environment. Thus, it becomes

more challenging to infer the focus of a person’s reactions.

Therefore, this thesis aims to answer the following question:

Can an agent learn a task directly from implicit human feed-

back, without requiring explicit signals from humans?

In this thesis, we approach lihf with data-driven modeling that creates

a general reaction mapping from implicit human feedback to task statistics.

In doing so, we make the following contributions:

1. We motivate and frame the general problem of Learning from Implicit

Human Feedback (lihf), with the aim of better exploiting under-utilized

data modality already existing in natural human reactions.

2. We propose a broad data-driven framework to solve this problem, called

Evaluative MaPping for Affective Task-learning via Human Implicit Cues

(empathic).

3. We experimentally validate an instantiation of the empathic framework,

using human facial reactions as the data modality, and rewards as the

target task statistic.

3

Page 17: Copyright by Qiping Zhang 2021

1.2 Thesis Outline

The thesis is organized as follows: Chapter 2 goes over the related

work and the background related to Markov Decision Processes. Chapter 3

defines the problem of Learning from Implicit Human Feedback and presents

the formulation of the two-stage empathic framework. Chapter 4 provides the

details of the task domains and the data collection process for instantiating the

framework, and then demonstrates the reaction mapping design, including the

human exploration of the data and the architecture of the reaction mapping

pipeline. Chapter 5 shows the empirical results to validate that the learned

mappings from our instantiation of stage 1 effectively enable task learning in

stage 2, by testing three hypotheses in different settings. Finally, Chapter 6

discusses future work and concludes.

The content of this thesis mainly follows our paper published at Con-

ference on Robot Learning [10] on the empathic framework, which is the col-

laborative work with Yuchen Cui, under the supervision of Alessandro Allievi,

Peter Stone, Scott Niekum, and W. Bradley Knox. In this thesis, both of us

have jointly contributed to the conceptual formulation of lihf and empathic,

the human exploration of data, the instantiation pipeline of empathic, and

the online learning in Robotaxi. Yuchen’s contributions cover more of the con-

struction and investigation of trajectory ranking in the robotic sorting domain,

preliminary modeling of other task statistics, and effects of different belief pri-

ors. My work includes more of the design and analysis of the Robotaxi domain,

together with the reward-ranking outcome of the learned reaction mapping.

4

Page 18: Copyright by Qiping Zhang 2021

Chapter 2

Background

This chapter introduces the basic concept of Markov Decision Pro-

cesses, which is used for formally specifying the Learning from Implicit

Human Feedback (lihf) problem, and discusses related work.

2.1 Markov Decision Processes

Markov Decision Processes (MDPs) are often used to model se-

quential decision making of autonomous agents. An MDP is given by the

tuple 〈S,A, T,R, d0, γ〉, where:

• S is a set of states; A is a set of actions an agent can take;

• T : S×A×S → [0, 1] is a probability function describing state transition

based on actions;

• R : S × A× S → R is a real-valued reward function;

• d0 is a starting state distribution and γ ∈ [0, 1) is the discount factor.

• π : S × A → [0, 1] is a policy mapping from any state and action to a

probability of choosing that action.

5

Page 19: Copyright by Qiping Zhang 2021

An agent aims to find a policy that maximizes the expected return

E [∑∞

t=0 γtrt] where rt is the reward at time step t.

2.2 Related Work

Our work relates closely to the growing literature of interactive rein-

forcement learning (RL), or human-centered RL [17, 22, 25, 26, 29, 31, 35, 37,

40, 47] where agents learn from interactions with humans, with or without pre-

defined environmental rewards. In the empathic framework, implicit human

feedback refers to any multi-modal evaluative signals humans naturally emit

during human-robot interactions, such as facial expressions and head gestures,

as well as other vocalization modalities not intended for explicit communica-

tion. Others’ usage of “implicit feedback” has referred to the implied feedback

when a human refrains from giving explicit feedback [20, 30], to human bio-

magnetic signals [43], or to facial expressions [3, 19, 38]. This thesis focuses

on predicting task statistics from human facial features. Hence, this work also

intersects with the research field of facial expression recognition.

2.2.1 Interactive Reinforcement Learning

The tamer framework proposed by Knox et al. [22, 23] is the first to

enable RL agents to learn from explicit human feedback without environmental

rewards, by explicitly modeling human feedback with button clicks, drawing

inspiration from animal clicker-training. Veeriah et al. [39] propose learning

a value function only based on human facial expressions and agent actions,

6

Page 20: Copyright by Qiping Zhang 2021

using manual negative feedback as supervision. However, the policy of the

RL agent depends on the trainer’s facial expression only, without actually

reasoning about the task state. Arakawa et al. [3] detect human emotions

by adopting an existing facial expression recognition system, and then apply

a predefined mapping from emotions to tamer feedback. Nonetheless, they

fail to optimize and make the mapping effective for the downstream task.

Likewise, Zadok et al. [46] improves exploration of an RL agent by biasing

its behavior to increase the probability of smiling for human demonstrators,

which is modeled within the task. Meanwhile, Li et al. [26] show the possibility

of learning only from human facial expressions by generalizing tamer, using

a deep neural network to map the facial expressions of the trainer to positive

or negative feedback.

The approach proposed in this work differs from prior work by learning

a direct mapping from facial reactions to task statistics, and is not dependent

on any specific state-actions. Without requiring explicit signal throughout the

RL tasks, our work is the first managing to learn from subjects not told to

react or serve as trainers.

2.2.2 Facial Expression Recognition (FER)

Facial expression recognition is a broad research field that spans multi-

ple areas, including psychology, neuroscience, cognitive science and computer

vision. Fasel and Luettin [15] provide an overview of traditional fer systems

and Li and Deng [27] introduces latest fer systems based on deep learning.

7

Page 21: Copyright by Qiping Zhang 2021

Instead of conducting fer explicitly, our proposed method maps extracted

facial features to reward values.

Our work is closely related to the problem of dynamic fer, in which

time-series data are taken as inputs for temporal predictions. As is discussed

by Li and Deng [27], modern fer systems often consist of two stages: data

pre-processing and predictive modeling with deep neural networks. Given our

small dataset, we directly extract certain facial features sufficiently informa-

tive for modeling human reactions, by employing an existing toolkit named

OpenFace 2.0 [5, 6, 45]. Furthermore, we also model the temporal aspect of

our task by extracting facial features in the frequency domain.

2.2.3 Summary

In this chapter, we established the basic notations and definitions in

Markov Decision Processes, which will be applied and further extended to

specify the lihf problem, described the two major fields relevant to our work,

and investigated the existing literature in these research fields.

8

Page 22: Copyright by Qiping Zhang 2021

Chapter 3

The LIHF Problem and The EMPATHIC

Framework

In this chapter, we provide the detailed concepts of the problem of

Learning from Implicit Human Feedback (lihf), which follows our defi-

nitions and notations in the background of Markov Decision Processes (MDPs).

We then describe the detailed structure of the empathic framework, consisting

of two stages and several necessary elements for instantiating in specific task

domains. This formulation, which is initially presented in our paper published

at Conference on Robot Learning [10], leads to our actual instantiation of the

framework and our proposed reaction mapping algorithm, to be discussed in

Chapter 4.

3.1 The LIHF Problem

The problem of Learning from Implicit Human Feedback (lihf)

asks how an agent can learn a task with information derived from human

reactions to its behavior.

We define lihf with the tuple 〈S,A, T,RH, XH,Ξ, d0, γ〉, in which S,

A, T , d0, and γ are defined identically as in MDPs. Observations x ∈ XH

9

Page 23: Copyright by Qiping Zhang 2021

containing implicit feedback from some human H are received by the agent.

An observation function Ξ denotes the conditional probability over XH of

observing x, given a trajectory of state-actions and the human’s hidden reward

functionRH. Furthermore, lihf states are generally defined more broadly than

task states, including all environmental and human factors that influence the

conditional probability of observing x.

The goal of an agent is to maximize the return under RH. How to

ground observations x ∈ XH containing implicit human feedback to evaluative

task statistics is at the core of solving lihf.

Figure 3.1: Graphical model for lihf (colored boxes and their identicallycolored labels represent conditional probability tables)

Figure 3.1 shows a graphical model for lihf to demonstrate how the

10

Page 24: Copyright by Qiping Zhang 2021

data generation process is modeled, whose formulation resembles the definition

of Partially Observable MDPs, whereas the partially observable variable is the

human’s reward function instead of state. We assume the human H’s reward

function RH is a temporally invariant component of the human’s internal state

SH. However, the observationXH containing implicit human feedback to agent

behavior can change over time since it is influenced by the human’s internal

model of the task domain and the agent’s behavior policy at a certain time

step.

Given an observation x ∈ XH, the current state s ∈ S, and the previous

action a ∈ A, the agent constructs its belief b ∈ B as a probabilistic memory

of arbitrary form and scope over the task domain. A belief could include, for

example, the probability distribution over possible reward functions, which the

agent could use to generate a policy (and therefore an action given the current

state) that maximizes expected return (aggregated single-step rewards r ∈ R)

under the unobserved human reward function RH. Note that reward is not

directly dependent on state and action – it is determined by the human entirely

(who can, for generality, internally maintain a history of states and actions,

and therefore can give non-Markovian rewards).

3.2 The EMPATHIC Framework

With the mathematical formulation of lihf defined, a data-driven so-

lution to the problem is then proposed to infer relevant task statistics from

human reactions. Fig. 3.2 presents the formulation of the empathic frame-

11

Page 25: Copyright by Qiping Zhang 2021

Figure 3.2: Overview of empathic

work, consisting of two stages: (1) learning a mapping from implicit human

feedback to corresponding task statistics; and (2) using the trained reaction

mappings for task learning.

In the first stage, human observers are incentivized to want an agent to

succeed – to align the person’s RH with a known task reward function R – and

they are then recorded while watching the agent’s behaviors. Task statistics

are computed from R for every timestep to serve as supervisory labels, which

are used for training a mapping from synchronized human reaction recordings.

Since the state-actions of a task are not inputs, the learned reaction mappings

could be conveniently deployed on other tasks.

In the second stage, a human observes an agent attempt a task with

sparse or no environmental reward, and the human observer’s reactions to its

behavior are input to the mapping trained in the previous stage, to predict

12

Page 26: Copyright by Qiping Zhang 2021

unknown task statistics for improving the agent’s policy, either directly or with

other approaches, such as guiding exploration or inferring the reward function

RH that describes the human’s utility.

Basically, any instantiation of empathic can be achieved through spec-

ifying these elements:

• the reaction modality and the target task statistic(s);

• the end-user population(s);

• training task(s) for stage 1 and deployment task(s) for stage 2;

• an incentive structure for stage 1 to align human interests with task

performance; and

• policies or RL algorithms to control the observed agent in both stages.

Any specific task or person can be part of either of these two stages.

Note that empathic is defined broadly enough to include instantiations with

varying degrees of personalization – from learning a single reaction mapping

applicable to all humans to training a person-specific model – and of across-

task generalization. Though a single reaction mapping will be generally useful,

we hypothesize that personalized training on certain users or tasks will achieve

even more effective mappings, which may also alleviate negative effects of

potential dataset bias from the first stage if it is used amongst underserved

populations.

13

Page 27: Copyright by Qiping Zhang 2021

3.3 Summary

In this chapter, we provided a formal definition of the lihf problem by

establishing the mathematical expressions, together with a graphical model,

which resembles the concept of Partially Observable MDPs. We further pre-

sented an overview of empathic, specifying the two stages and the necessary

elements for achieving any instantiation of the framework, which are used to

motivate the focus of Chapter 4 that details our instantiation, using facial

reactions as the modality for implicit human feedback.

14

Page 28: Copyright by Qiping Zhang 2021

Chapter 4

Instantiation of the EMPATHIC Framework

In this chapter, we describe the details of our instantiation of the em-

pathic framework. We first reveal the task domains in our experiments, to-

gether with the data collection protocols. Then, we illustrate the design of our

reaction mapping algorithm, by discussing the outcome of data exploration

using a human proxy test, and providing the mathematical formulas of the

reaction mapping architecture.

4.1 Data Collection Domains and Protocol

In this section we describe the experimental domains and data collection

process of our instantiation of empathic.

4.1.1 Robotaxi

We create Robotaxi as a simulated domain to collect implicit human

feedback data with known task statistics, instantiating the training task for

stage 1 of empathic presented in Chapter 3. Fig. 4.1 shows the visualization

viewed by the human observer. An agent (depicted as a yellow bus) acts in

a grid-based map. Rewards are connected to objects: +6 for picking up a

15

Page 29: Copyright by Qiping Zhang 2021

passenger; −1 for crashing into a roadblock; and −5 for crashing into a parked

car. Reward is 0 otherwise. An object disappears after the agent moves onto

it, and another object of the same type is spawned with a short delay at a

random unoccupied location. An episode starts with two objects of each type.

Figure 4.1: Robotaxi environment

4.1.2 Robotic Sorting Task

To test whether or not the learned reaction mapping is able to transfer

across task domains, a robotic manipulation task is further developed in our

paper published at Conference on Robot Learning by Cui et al. [10], which

instantiates the deployment task for stage 2 of empathic discussed in Chap-

ter 3. As is shown in Fig. 4.2, the robot arm aims at recycling the aluminum

cans into the bin, without sorting the other items. It will lead to a reward of

+2 when a can is recycled, and −1 for the remaining objects. 0 reward will be

given in all other cases. The short episodes consist of fixed pick-up trajecto-

ries. In each trajectory, there is at most one non-zero reward event. Appx.A.1

16

Page 30: Copyright by Qiping Zhang 2021

provides further details of the task domains, especially our instantiation of the

policies and RL algorithms to control the observed agent in both stages, which

is another necessary element for empathic discussed in Chapter 3.

Figure 4.2: Robotic sorting task

4.1.3 Data Collection

Human subjects were recruited to participate in the interactions with

both agents. We informed them that their financial payout would be propor-

tional to the agent’s total reward in the end, after which they started to observe

the agents performing the tasks. By making this payout rule, we align human

interests with the task performance and link human reactions to task statis-

tics, by creating a mapping from the in-game rewards to the real-world value

to humans, which instantiates the incentive structure for stage 1 of empathic

in Chapter 3. Furthermore, we told the participants that their “reactions are

being recorded for research purposes”, without letting them know how we plan

to use their facial features, in order to make sure the explicit teaching from

17

Page 31: Copyright by Qiping Zhang 2021

participants is minimized. By using this experimental design, our work differs

from previous works that involve explicit human training [3, 26, 39], aligning

with the research direction of the lihf problem by leveraging existing data

modality in human-robot interactions.

17 human participants observed 3 episodes of Robotaxi, and 14 of the

participants observed 7 episodes of the robotic trash sorting task. After obtain-

ing the human participant’s consent, we conducted user studies in an isolated

room, while people were asks to watch the agents perform predetermined tra-

jectories generated by suboptimal policies, with videos of their facial reactions

recorded. At the end of their sessions, the subjects were asked to complete

an exit survey. Further details are given in Appx.A.1, including the end-user

populations for instantiating empathic discussed in Chapter 3.

4.2 Reward Mapping Design

In this section, we describe the result of a human proxy test for data

exploration, and introduce the architecture of our reaction mapping pipeline.

4.2.1 Human Exploration of the Data

As an initial step after data collection, we perform human exploration

of the data, in which the authors of our paper published at Conference on

Robot Learning [10] serve as proxies for a mapping. As is shown in Fig. 4.3, a

semantically anonymized version of each agent trajectory is created and viewed

by the authors, alongside a synchronized recording of the human participant’s

18

Page 32: Copyright by Qiping Zhang 2021

reactions. Afterwards, we rank the reward values of the 3 object types, with

one truncated episode from each of the 17 participants watched by all human

proxies. Kendall’s rank correlation coefficient τ ∈ [−1, 1] [1] is then applied

to evaluate the quality of our inference in comparison with the ground truth,

in which a higher τ value corresponds to a higher correlation between two

rankings.

Table 4.1 shows the mean τ scores for the human proxy test across

17 human participants, with a mean for each author. We also compare each

human proxy’s 17 τ scores with the baseline value τ = 0 for uniformly random

reward ranking, and show the p-values computed with Wilcoxon signed-rank

tests [42]. The results demonstrate that 5 out of 6 human proxies surpassed

uniformly random reward ranking, with 1 author performing significantly bet-

ter, even after adjusting a p < 0.05 threshold for multiple testing to p < 0.0083

using a Bonferroni correction [41]. This person’s success indicates that al-

though humans have variable ability of perceiving implicit feedback signal,

sufficient information is contained in human facial reactions for reward rank-

ing.

Meanwhile, based on our experience throughout the proxy test, we iden-

tify 7 common reaction gestures that helped us with reward ranking inference:

smile, pout, eyebrow-raise, eyebrow-frown, (vertical) head nod, head shake,

and eye-roll. We also annotate the video data with frame onsets and offsets

of these 7 gestures, together with the positive, negative, or neutral sentiment

of the gesture, without watching the corresponding trajectories beforehand.

19

Page 33: Copyright by Qiping Zhang 2021

A detailed analysis of the annotation outcomes that provides insight for the

computational modeling is included in Appx.A.2.

Figure 4.3: Human proxy’s view: semantics are hidden with color masks; thedark green circle is the agent; observer’s reaction is displayed; detected faceis enlarged; background is colored by last pickup. The left frame is the samegame state shown in Fig 4.1.

Avg. τ p-value.569 .004.216 .185

Human .098 .319Proxies -.176 .179

.255 .123

.294 .059Avg. .209 .078

Table 4.1: Human proxy test result: average τ values across participants areshown; a baseline that randomly picks rankings has an expected τ value of 0.

4.2.2 Reaction Mapping Architecture

With the dataset of human facial reactions collected, we would like to

show the possibility of computationally modeling the implicit human feedback.

20

Page 34: Copyright by Qiping Zhang 2021

Figure 4.4: The feature extraction pipeline and architecture of the reaction mapping

To do so, a reaction mapping from temporal series of extracted facial features

(retrieved using a pre-trained toolkit) to the probability of reward categories

is constructed, by training a deep neural network for reward prediction with

supervised learning. Note that the input and output of this neural network

serve as our instantiation of the reaction modality and the target task statistics

for empathic respectively (as is listed in Chapter 3).

Fig. 4.4 shows the feature extraction pipeline and architecture of the

proposed deep network model to construct the reaction mapping. We use

OpenFace 2.0 [5, 6, 45] for facial feature extraction from human reaction videos

including 30 image frames per second. Head pose and activation of facial

action units (FAUs) are extracted from each frame with the toolkit. Further,

we detect head nods and shakes by maintaining a running average of extracted

head-pose features and subtracting it from each incoming feature vector. We

21

Page 35: Copyright by Qiping Zhang 2021

then use fast Fourier transform to compute the changing frequency of head-

pose, whose coefficients are taken as head-motion features.

With the set of facial features described above, we apply max pooling

on each dimension to merge the feature vectors of consecutive image frames.

This step outputs temporally aggregated feature vectors of the same size, also

enabling the sequence of input features to cover a sufficiently large time window

of human facial reactions. Further feature extraction details can be found in

Appx.A.3.1.

We denote the sequence of raw input image frames from time t0 to t

as {Xt0 , ..., Xt}, with t0 being the starting time of an episode, and t being

the time of the last image frame for calculating the T -th aggregated frame.

Aggregated FAU features ϕFAU ∈ Rm and head-motion features ϕhead ∈ Rn

are extracted by the feature extractor Φ: (ϕFAU , ϕhead)T = Φ({Xt0 , ..., Xt}).

The input for a data sample is a temporal window of sequential ag-

gregated features, which starts from the (T -k)-th and ends at the (T+`)-th

aggregated frame, while the corresponding label is the reward category (i.e.,

−5, −1, or +6) occurred during the time step containing the T -th aggregated

frame. Note that future reaction data is also included for prediction since some

reactions occur after a reward event. We then encode FAU features and the

head-motion features separately, by flattening the input temporal sequences

into a single vector and encoding with their own linear layers respectively.

After concatenating the two encodings, we input the resulting vector to a

multilayer perceptron (MLP).

22

Page 36: Copyright by Qiping Zhang 2021

An auxiliary task of predicting the corresponding annotations {A(T−k), ..., A(T+`)}

is also included to accelerate representation learning and serve as a regular-

izer, where each binary element of A refers to whether a reaction gesture

takes place. The anticipated prediction output is a single flattened vector

a ∈ {0, 1}10(k+`+1). Empirical results show that adding the auxiliary task

leads to the lowest test loss, yet it is not required in the reward-ranking task

for achieving better-than-random performance. Meanwhile, we also take into

account the reward ordinality by adding a binary classification loss that com-

bines the two negative reward classes (i.e., −5, −1), so that the prediction

outcomes with a wrong sign can be further penalized. These results will be

discussed in detail in Chapter 5.

Let gθ(·) represent the function of our MLP model, and z ∈ Rc denote

the output of the network, which is the raw log probabilities over the c reward

categories. The ground-truth one-hot classification label can be denoted by

y ∈ {0, 1}c. We then represent the output of the auxiliary task with o, while

ybin is the ground-truth binary class mentioned before, with zbin being the

corresponding binary prediction computed directly from z. Thus, we have:

(z,o)T = gθ({(ϕFAU , ϕhead)T−k, ..., (ϕFAU , ϕhead)T+`}) (4.1)

The loss for optimization is given by:

L(θ) = −y · log(softmax(z))− λ1ybin · log(softmax(zbin)) + λ2||a− o||2(4.2)

23

Page 37: Copyright by Qiping Zhang 2021

We train our network with Adam [21], and apply random search [7] to

find the best set of hyper-parameters, consisting of the input’s window size (k

and `), learning rate, dropout rate, loss coefficients (λ1 and λ2), together with

the depth and widths of the MLP. Our annotations of human facial reactions

provide the candidate window sizes for random search, which is further intro-

duced in Appx.A.2). We use k-fold cross validation to facilitate the random

search, in view of our small dataset, from which we also randomly sample one

episode of data given by each human participant to form a holdout set for final

evaluation. By evaluating across train-test data folds, we eventually select the

set of parameters with the lowest mean test loss. See Appx.A.3 for further

details of reaction mapping construction.

4.3 Summary

In this chapter, we revealed the design details of our empathic instan-

tiation. We first provided the task specifications and data collection protocols

for creating a human reaction dataset. Further, we presented the outcomes of

a human proxy test to show that the reactions contain sufficient information

for training a reaction mapping network and performing task learning. Finally,

we illustrated the feature extraction pipeline and architecture of the proposed

deep network model, which are used for constructing the reaction mappings.

24

Page 38: Copyright by Qiping Zhang 2021

Chapter 5

Reaction Mapping Results and Evaluation

To validate that the learned mappings from our instantiation of stage

1 effectively enable task learning in stage 2, we test the following hypotheses

(here we refer to observers from stage 1 who have created data in the training

set as “known subjects”), which are already proposed in our paper published

at Conference on Robot Learning by Cui et al. [10]:

• Hypothesis 1 [deployment setting is the same as training setting]. The

learned reaction mappings will outperform uniformly random reward

ranking, using reaction data from known subjects watching the Robo-

taxi task.

• Hypothesis 2 [generalizing H1 to online data from novel subjects]. The

learned reaction mappings will improve the online policy of a Robotaxi

agent via updates to its belief over reward functions, based on online

data from novel human observers;

• Hypothesis 3 [generalizing H1 to a different deployment task]. The

learned reaction mappings can be adapted to evaluate robotic-sorting-

task trajectories and will outperform uniformly random guessing on

25

Page 39: Copyright by Qiping Zhang 2021

return-based rankings of these trajectories, using reaction data from

known subjects.

Figure 5.1: Sorted per-subject Kendall’s τ for Robotaxi reward-ranking task

5.1 Reward-ranking Performance in the Robotaxi Do-main

In this section, we focus on the result of reward-ranking in the Robotaxi

domain. We first demonstrate the effectiveness of our learned reaction mapping

network in predicting the reward function used for the task, with ablation

study conducted. Further, we provide the result of preliminary modeling of

other task statistics. Lastly, we show the influence of different initial belief

priors on the inference of reward rankings.

26

Page 40: Copyright by Qiping Zhang 2021

5.1.1 Learning Outcome of the Reaction Mapping

To evaluate the effectiveness of the learned reaction mapping, we test

its performance on a reward-ranking task. Let q be the random variable for

reward event and x denote the detected human reactions. Let m represent the

discrete random variable over all possible reward functions, and we can regard

it as a reward ranking in the Robotaxi task. Our neural network gθ(·) models

P(q | x,m), which is the probability of a reward event given the human’s reac-

tion and a fixed reward ranking m. We aim to find the posterior distribution

over m: P(m | q, x). The proof for P(m | q, x) ∝ P(q | x,m)P(m) is given as

follows:

P(q, x,m) = P(q | x,m)P(x | m)P(m) (5.1)

= P(m | q, x)P(q, x) (5.2)

P(m | q, x)P(q | x)P(x) = P(q | x,m)P(x | m)P(m) (5.3)

x and m are conditionally independent since the human observes the reward,

therefore:

P(x | m) = P(x) (5.4)

Hence,

P(m | q, x)P(q | x) = P(q | x,m)P(m) (5.5)

P(q | x) is constant across all values ofm, therefore: P(m | q, x) ∝ P(q | x,m)P(m).

With the formula derived above,the target distribution P(m | q, x) can

be found given the learned mapping gθ(·) and a uniform prior over m. After

27

Page 41: Copyright by Qiping Zhang 2021

incorporating predicted mappings from all human reaction data in an episode,

we select the maximum a posteriori reward ranking as the estimation out-

come. Furthermore, the neural network is trained 4 times with the mean

result reported, to deal with the impact of randomness. The corresponding

loss profiles are shown in Fig. 5.2. The red dotted line marks the mean valida-

tion loss across 4 repetitions using the model selected with the lowest testing

loss.

The performance of the learned reaction mapping is evaluated with

Wilcoxon Signed-Rank test, on the holdout set of Robotaxi episodes. Fig. 5.1

shows that we get p = 0.0024 with the annotation-reliant auxiliary task and

p = 0.0207 without it, demonstrating that the mappings achieve significantly

better performance than uniformly random guessing (τ = 0), which supports

H1.

Similarly, Fig. 5.3 shows the reward ranking result on the human proxy

test episodes, in which we use the same Robotaxi episodes as the human proxies

viewed for evaluation, while the remaining episodes are used for training the

network. With few exceptions, it turns out that our model performs badly on

participants that the human proxies also find difficult and well on participants

the human proxies have good performance on.

5.1.2 Ablation Study for Predictive Model Design

An ablation study is further conducted to validate the effectiveness of

our model design, focusing on the use of auxiliary task and input features. We

28

Page 42: Copyright by Qiping Zhang 2021

(a) Holdout data as evaluation set

(b) Human proxy episodes as evaluation set

Figure 5.2: Loss profiles for training final models for reward ranking evaluation.

29

Page 43: Copyright by Qiping Zhang 2021

Figure 5.3: Sorted per-subject Kendall’s τ for Robotaxi reward-ranking taskon the human proxy test episodes

generate the loss profiles along the training epochs across 17 subject train-

test-validation sets, respectively using the proposed model, the model without

auxiliary loss, the model with only FAU features, and the model with only

head-motion features. The hyperparameters are determined by random search

performed independently on each scenario, and we select the model with the

lowest mean test loss using 17-fold cross validation based on each subject.

Fig. 5.4 indicates that our full model achieved the lowest test loss, mean and

variance of validation loss in comparison with all the other cases.

As the last test case, we also train an end-to-end model with a Resnet-

18 CNN as feature extractor and an LSTM model for feature processing in a

time window. Fig. 5.4 shows that the model fails to achieve a training loss

lower than that obtained by directly outputting the label distribution. In

this experiment, random hyperparameter search cannot be performed due to

30

Page 44: Copyright by Qiping Zhang 2021

(a) Proposed model (b) Model trained without auxiliarytask

(c) Only use FAU features as input (d) Only use head-motion features asinput

(e) End-to-end model (Resnet+LSTMarchitecture)

Figure 5.4: Loss profiles for training different models (each model has its ownset of best hyper parameters found through random search except the end-to-end model)

31

Page 45: Copyright by Qiping Zhang 2021

the high cost of training an end-to-end model, hence only manual tuning is

conducted, which serves as an important reason for failure. In addition, the

small size of dataset might also have been a constraint, which motivates our

use of OpenFace [5, 6, 45] for feature extraction to alleviate the modeling

burden.

5.1.3 Preliminary Modeling of Other Task Statistics

We also try modeling the following task statistics other than reward.

πb denotes the agent’s behavior policy, and π∗ represents the optimal policy

using the ground-truth reward function:

• Q-value of an action under optimal policy: Q∗(s, a) = R(s, a) + V ∗(s′)

• Optimality (0/1) of an action (1 is the indicator function): O(s, a) =

1[Q∗(s,a)](Q(s, a))

• Q-value of an action under the behavior policy: Qπb(s, a) = R(s, a) +

V ∗(s′)

• Advantage of an action under optimal policy: A∗(s, a) = Q∗(s, a)−V ∗(s)

• Advantage of an action under behavior policy: Aπb(s, a) = Qπb

(s, a) −

V πb(s)

• Surprise modeled as the difference in Q: S(s, a) = Qπb(s, a)−Q∗(s, a)

To compute the Robotaxi agent’s policy, we apply an approximate opti-

mal policy by assuming a static gridworld map for each time step and running

32

Page 46: Copyright by Qiping Zhang 2021

value iteration. Such a policy could be considered optimal given that no more

than 2 objects of the same type exist in the map. Monte Carlo rollouts are

then used for value and Q-value estimation.

We continue to use cross entropy loss for classifying optimality, while

mean square error is employed for the other continuous statistics, with auxil-

iary loss added as in reward prediction. Fig. 5.5 shows that neither the testing

nor the validation loss manages to get below the baseline that simply outputs

the mean of the labels, with the two losses increasing while the training loss

decreases.

The overfitting issue observed when training models for these task

statistics is considered the result of time-aliasing in the training data, in which

two adjacent training inputs in two consecutive timesteps are very similar yet

have different labels determined by the timestep’s task statistics, and could

be regarded as an important future research direction. Moreover, the discount

factor in our experiments treats Robotaxi as an infinite-horizon MDP. How-

ever, the episodes actually used are trajectories with finite length, which might

also lead to the failure in modeling the other statistics.

5.1.4 Effects of Different Belief Priors

As the last experiment on Robotaxi in our previous paper [10], we infer

over all possible reward rankings starting from various priors, to study how fast

our reaction mapping approach could recover from different (and potentially

wrong) initial beliefs. To do so, a pool of predictions from the learned Robotaxi

33

Page 47: Copyright by Qiping Zhang 2021

(a) O(s, a) (b) Q∗(s, a)

(c) Qπb(s, a) (d) A∗(s, a)

(e) Aπb(s, a) (f) S(s, a)

Figure 5.5: Loss profiles for training with other task statistics

34

Page 48: Copyright by Qiping Zhang 2021

Figure 5.6: Performance of reward inference starting from different priors overreward rankings.

reaction mapping on the holdout set is retrieved, from which we randomly

sample without replacement to update the reward belief. We test the following

cases of reward ranking priors:

• all uniform: uniform prior over all possible reward rankings (used in

all other experiments);

• P(worst ranking) = p: the reward mapping that ranks events in the

reverse of the correct ranking has prior probability mass p, and the rest

of the reward rankings uniformly share 1− p probability mass;

• P(best ranking) = p: the correct reward ranking has prior probabil-

ity mass p, and the rest of the reward rankings uniformly share 1 − p

35

Page 49: Copyright by Qiping Zhang 2021

probability mass; and

• P(second best ranking) = p: the reward ranking that correctly ranks

the positive-reward event first but incorrectly rank the two negative-

reward events has prior probability mass p, and the rest of the reward

rankings uniformly share 1− p probability mass.

The results are then evaluated with two metrics: the average Kendall’s τ

score of the most possible reward ranking, and the KL divergence between the

current belief and a soft true belief, in which we let the prior probability of a

certain reward ranking be exp(λτ)/Z, with τ representing the corresponding

Kendall’s τ value. Values of these two variables are computed and recorded

after different numbers of non-zero reward events are observed and used by

our model for prediction.

Fig. 5.6 shows the mean performance over 100 repetitions. It could

be observed that the case of P(second best ranking) = 0.9 is the most

difficult prior to recover from, in which the sequence of two negative rewards

is mistaken. After receiving enough data inputs, our model can converge to the

ground-truth reward ranking in 4 out of the 6 scenarios mentioned above, while

higher weights on the predicted likelihood generally leads to faster convergence.

In view of the fact that these final results are sensitive to the choice of belief

prior, initializing with a uniform distribution is more preferred when we have

no strong evidence for selecting a specific biased prior.

Throughout the study above we are using offline reaction data of human

36

Page 50: Copyright by Qiping Zhang 2021

subjects observing predetermined trajectories, while the learning performance

will differ if we instead use live human data to update the behavior policy

using the belief in real time, which will be discussed in the following section.

5.2 Online Learning in the Robotaxi Domain

Given a learned reaction mapping model, we can further use the online

human facial reactions to interactively update the belief over reward rankings,

hence optimizing the behavior policy by having the agent follow the most likely

reward function at every timestep. To do so, we employ the reaction mapping

trained in stage 1 on human observers who have never appeared in the training

data, to validate our model is able to improve agent policy in real time in stage

2. This experiment is conducted jointly in our paper published at Conference

on Robot Learning by Cui et al. [10].

With the social distancing during the COVID-19 pandemic, the au-

thors recruited their friends as participants and conducted the user study in

their own homes, resulting in two aspects of informality: 1) Participants were

unpaid, hence the human internal model RH and the real reward function R

are not as aligned as in the collected dataset; 2) Human observers didn’t get a

chance to control the Robotaxi agent before watching the trajectories, which is

included for better understanding the task domain. Though it is likely that the

overall performance is worsened by these constraints, we still observe positive

experiment outcomes, which can be found in Fig. 5.7.

In 9 out of 10 trials, the agent receives a positive final return (p =

37

Page 51: Copyright by Qiping Zhang 2021

(a) Probability (b) Entropy over rewards

(c) Return (d) Kendall’s τ values

Figure 5.7: Trials of informal online learning sessions in Robotaxi

0.0134 under binomial test), among which 8 trials achieve significantly higher

return than uniformly random policy. The belief also converged to the optimal

and the second optimal rankings (passenger has the highest rank) in 3 out of

10 trials with low entropy. Meanwhile, after only a small number of initial

timesteps, the mean Kendall’s τ value of reward ranking becomes higher than

that of random guessing. Furthermore, 7 of the 10 trials terminate with the

probability of reward mappings that correspond to optimal behaviors being

38

Page 52: Copyright by Qiping Zhang 2021

the highest. These results moderately support H2.

5.3 Trajectory Ranking in Robotic Sorting Domain

As is discussed by Cui et al. [10] in our paper published at Conference

on Robot Learning, we extend the trained Robotaxi reaction mapping and use

it in the domain of robotic trash sorting task, by only using the binary reward

classification loss. In this way, the binary prediction outcome can be regarded

as a “positivity score” for each aggregated frame input.

In the user study, we select 8 different predetermined trajectories, 7 of

which are combined into an episode and watched by each human subject. A

trajectory contains one of the three possible return events: 1) recycling a can

(+2); 2) recycling any other wrong item (−1); 3) no object placed in the trash

bin (0). Empirically, human reactions correspond to high-level behaviors in

the trajectories (i.e. pick and place object X ), instead of the low-level joint

torques. Therefore, the small time window size in the Robotaxi domain is no

longer applicable to include all necessary facial reactions for prediction.

Confronted with this additional challenge, we calculate the mean posi-

tivity score over all aggregated frames in each trajectory, which can be treated

as an extended action. We also make an assumption that the facial reactions

along each trajectory are generated by a single latent state, which reflects the

human’s internal modeling of the agent’s behaviors. Hence, we resolve the

temporal incompatibility with Robotaxi by using the per-trajectory positivity

score mentioned above as the overall scoring of the entire trajectory.

39

Page 53: Copyright by Qiping Zhang 2021

Figure 5.8: Sample plot of trajectory positivity score over aggregated frames

Figure 5.9: Sorted per-subject Kendall’s τ for evaluating robotic sorting tra-jectories

40

Page 54: Copyright by Qiping Zhang 2021

Figure 5.10: Overall trajectory ranking by mean positivity score across sub-jects (each entry is colored by the trajectory’s final return: green for +2, yellowfor 0, and red for −1)

The positivity scores along all 7 candidate trash-sorting trajectories

(observed by one participant) are shown by Fig. 5.8. The Kendall’s τ values

for trajectory rankings by each human participant is then presented in Fig. 5.9,

in which the per-trajectory mean positivity scores are computed across the

temporal domain. After further computing the mean positivity scores across

all human subjects, we show the final rankings of all objects in Fig. 5.10, where

Kendall’s τ independence test achieves τ = 0.70 with p = 0.034, significantly

outperforming uniformly random guessing whose τ = 0. This result supports

H3.

5.4 Summary

In this chapter, we presented and analyzed the experiment results of

our reaction mapping from the empathic framework. By proposing and val-

41

Page 55: Copyright by Qiping Zhang 2021

idating three hypotheses, we demonstrated that the learned mappings from

our instantiation of stage 1 effectively enable task learning in stage 2, both

when the deployment setting is the same as the training setting, and when we

generalize to online data from novel subjects and to a different deployment

task.

42

Page 56: Copyright by Qiping Zhang 2021

Chapter 6

Conclusion and Future Work

In this thesis we introduce the lihf problem and the empathic frame-

work for solving lihf. We demonstrate that our instantiation interprets human

facial reactions in both the training task and the deployment task, thus tak-

ing a significant step towards the ultimate goal of task learning from implicit

human feedback. It does so by successful application of a learned mapping

from human facial reactions to reward types for online agent learning and for

evaluating trajectories from a different domain. In this last section, we discuss

the limitations of our work and the potential aspects of future study.

• Experimental Design In this work, we demonstrate a single instan-

tiation of empathic, yet in the current domains agent behaviors have

no long-term consequences on the expected return, and the training and

testing tasks also have great similarity in terms of temporal character-

istics and reward structures, which altogether limit the generality of

the proposed framework in solving the lihf problem. Therefore, more

abundant task environments could be designed in our future work, to

further investigate facial reaction modeling for the prediction of long-

term returns, which closely relates to the changes in human observers’

43

Page 57: Copyright by Qiping Zhang 2021

expectation.

• Human Models In many real-world scenarios, human observers usu-

ally concentrate on their own main tasks, only occasionally switching

their attention and providing implicit human feedback to our task of in-

terest. Therefore, we aim to further generalize our work and enable the

empathic framework to accurately infer the relevance of human reac-

tions to the agent’s behavior. Moreover, our current instantiation still

needs the ability to efficiently model the latent human states that reflect

changing expectations of agent behavior in the lihf problem, instead of

simply focusing on the recent and anticipated agent experience.

• Data Modalities In this work, only discrete reward categories are

taken as targets of facial reaction mapping. Future extension may involve

more abundant implicit human feedback modality, including gaze and

gestures, to more effectively model various task statistics in a broader

range of practical tasks.

44

Page 58: Copyright by Qiping Zhang 2021

Appendices

45

Page 59: Copyright by Qiping Zhang 2021

Appendix A

Supplemental Material

A.1 Experimental Domains and Data Collection Details

A.1.1 Robotaxi

• Agent Transition Dynamics In the 8×8 grid-based map, the agent

has three actions available at each timestep: maintain direction, turn

left, or turn right. When the agent runs into the boundary of the map,

it is forced to turn left or right, in the direction of the farther boundary.

• Rewards There are three different types of objects associated with

non-zero rewards when encountered in the Robotaxi environment: if the

agent picks up a passenger, it gains a large reward of +6; if it runs into

a roadblock, it receives a small penalizing reward of −1; if the agent

crashes into a parked car, it receives a large penalizing reward of −5. All

other actions result in 0 reward.

• Object Regeneration At most 2 instances of the same object type are

present in the environment at any given time. An object disappears after

the agent moves onto it (a “pickup”), and another object of the same

type is spawned at a random unoccupied location 2 time steps after the

corresponding pickup.

46

Page 60: Copyright by Qiping Zhang 2021

• Agent Policy The agent executes a stochastic policy by choosing from

a set of 3 pseudo-optimal policies under 3 different reward mappings

from objects to the 3 reward values:

– Go for passenger: {passenger: +6, road-block: −1, parked-car: −5}

– Go for road-block: {passenger: −1, road-block: +6, parked-car: −5}

– Go for parked-car: {passenger: −1, road-block: −5, parked-car: +6}

The pseudo-optimal policies are computed in real time via value iteration

(discount factor γ = 0.95) on a static version of the current map, meaning

that objects neither disappear nor respawn when the agent moves onto

them. We simplify the state space in this manner because the true

state space is too large to evaluate and would create too large of a Q

function to store, yet this simplification finds an near-optimal policy that

almost always takes the shortest path to an object of the type that its

corresponding reward function considers to be of highest reward.

At the start of an episode, the agent selects 1 of these 3 policies. The

agent follows the selected policy, with a 0.1 probability at each time step

that the agent will reselect from the 3 policies. This 0.1 probability of

the agent changing its plans, in a rough sense, was included because

we speculated that it would help increase human reactions by making

the agent typically exhibit plan-based behavior but sometimes change

course, violating human expectations. All selections among the 3 policies

are done uniformly randomly.

47

Page 61: Copyright by Qiping Zhang 2021

(a) Car choice screen (b) Robotaxi game view

Figure A.1: Robotaxi environment

A.1.2 Robotic Sorting Task

Figure A.2: Robotic task table layout with object labels, from the perspectiveof the human observer

As is described in our paper published at Conference on Robot Learning

by Cui et al. [10], the robot in the trash sorting task executes predetermined

trajectories programmed through key-frame based kinesthetic teaching. The

7-DOF robotic arm is controlled at 40Hz with torque commands. The initial

table layout is shown by Fig. A.2, in which the objects in the green circles give

48

Page 62: Copyright by Qiping Zhang 2021

Figure A.3: Robotic sorting task trajectories with optimality segmentation

+2 rewards when moved into the bin and others give −1 rewards. Therefore,

the robot’s task is to sort the aluminum cans into the recycling bin, without

sorting the remaining objects by mistake.

We show the snapshots of the 8 arm trajectories used in the robotic

trash sorting task in Fig. A.3. Each of these trajectories consists of a fixed

sequence of torque commands, which might generate small variations in the

actual trajectories. Thus, we remove the results from trajectories with quali-

tative departure, such as failure in object grasping. 1 or 2 target objects are

49

Page 63: Copyright by Qiping Zhang 2021

involved in each episode, which terminates with a total reward of −1, 0, or

+2.

Each trajectory of an episode can be further segmented into reaching,

grasping, transferring, placing and retracting sub-trajectories. The relative op-

timality of these sub-trajectories can be determined by whether the projected

outcome is desired. For example reaching for a correct object and retracting

from picking up a wrong object are both considered optimal while reaching for

and transferring a wrong object are both sub-optimal. Fig. A.3 also contains

annotations of these sub-trajectories’ optimality under each trajectory. Note

that these segmentations are only for illustration, without actually being used

in our reaction mapping algorithm.

A.1.3 Experimental Design

The instructions we give the participants in Robotaxi are as follows:

– Hello human! Welcome to Robo Valley, an experimental city where

humans don’t work but make money through hiring robots!

– You’ll start with $12 for hiring Robotaxi, and after each session you will

be paid or fined according to the performance of the autonomous vehicle

or robot.

– Your initial $12 will be given to you in poker chips. After each session,

we will add or take away your chips based on your vehicle’s score. At

50

Page 64: Copyright by Qiping Zhang 2021

the end, you can exchange your final count of poker chips for an Amazon

gift card of the same dollar value.

– For the 3 sessions with a Robotaxi, you begin by choosing an autonomous

vehicle to lease.

– The cost to lease one of these vehicles will be $4 each session.

– The vehicle earns $6 for every passenger it picks up, but it will be fined

$1 each time it hits a roadblock and fined $5 each time it hits a parked

car.

– You will watch the Robotaxi earn money for you, and your reactions to

its performance will be recorded for research purposes.

– You will have a chance to practice driving in this world, but the amount

earned during the test session won’t count towards your final payout.

The instructions we give the participants in robotic sorting task are as

follows:

– For the robotic task, the robot is trying to sort recyclable cans out of a

set of objects.

– You will earn $2 for each correct item it sorts and get penalized for $1

for each wrong item it puts in the trash bin.

– You will watch the robot earn money for you, and your reactions to its

performance will be recorded for research purposes.

51

Page 65: Copyright by Qiping Zhang 2021

The participants first control the agent themselves for a test session to

make themselves familiar with the Robotaxi task, removing a source of human

reactions changing in ways we cannot easily model. For the agent-controlled

sessions, the participants select an agent at the beginning of each episode of

Robotaxi. Fig. A.1a shows the view of this agent selection. Unbeknownst to the

subject, their selection of a vehicle only affects its appearance, not its policy.

This vehicle choice was included in the experimental design as a speculatively

justified tactic to increase the subject’s emotional investment in the agent’s

success, thereby better aligning R and RH as well as increasing their reactions.

At the start of the session, participants are given $12, which they must

soon spend to purchase a Robotaxi agent before it begins its task. To make

their earnings and losses more tangible (and therefore, we speculate, elicit

greater reactions), participants are given poker chips equal to their current

total earnings. After each session they are paid or fined according to the

performance of the agent: their chips are increased or decreased based on the

score of Robotaxi. At the end of the entire experimental session, participants

exchange their final count of poker chips for an Amazon gift card of the same

dollar value.

A.1.4 Participant Recruitment

The participants we recruited are mostly college students in the com-

puter science department. Each participant filled out an exit survey of their

backgrounds after completing all episodes of observing an agent. The statistics

52

Page 66: Copyright by Qiping Zhang 2021

of these 17 subjects are given below:

• Gender: 10 participants are male and 7 are female.

• Age: The participants’ average age is 20. Ages range from 18 to 28

(inclusive).

• Robotics/AI background: 1 participant is not familiar with AI/robotics

technologies at all. 2 have neither worked in AI nor studied it technically,

but are familiar with AI and robotics. 13 have not worked in AI but have

taken classes related to AI or otherwise studied it technically. Only 1

has worked or done major research in robotics and/or AI.

• Ownership of robotics/AI-related products: 7 participants own robotics

or AI-related products, while 10 do not. The products include Google

Home, Roomba, and Amazon Echo.

A.2 Annotations of Human Reactions

A.2.1 The Annotation Interface

To gain a better understanding of the dataset and the lihf problem

as a whole, two of the authors annotated the collected dataset. These an-

notations are not intended to serve as ground truth and are only used as

labels for an auxiliary task in our training of reaction mappings. Therefore,

training/calibration of annotators, evaluation of annotators via inter-rator re-

liability scores, etc. are not important. The interface for annotating the data

53

Page 67: Copyright by Qiping Zhang 2021

Figure A.4: View of the annotation interface. The corresponding trajectoryof Robotaxi is not displayed.

Figure A.5: Proportion of annotated gestures

is displayed in Fig. A.4. A human annotator marks whether facial gestures

and head gestures are occurring in each frame, effectively marking the onset

and offset of such gestures. Annotation is performed without any visibility of

the corresponding game trajectory. The proportion of 7 reaction gestures in

the annotation is shown in Fig. A.5. Annotations provide several benefits in

this study: in our search for a modeling strategy, we found our first success-

ful reaction mapping while using annotations directly as the only supervisory

54

Page 68: Copyright by Qiping Zhang 2021

Figure A.6: Histograms of non-zero reward events around feature onset

labels; annotations provide labels for an auxiliary task to regularize training

of and speed representation learning by (both important for a small dataset)

the reaction mapping from the features extracted automatically via OpenFace

[5, 6, 45]; and an annotation-based analysis of our data helped us find a tem-

poral window of reaction data around an event that is effective for inferring

the reward types for that event.

55

Page 69: Copyright by Qiping Zhang 2021

A.2.2 Visualizations of Annotated Data

The annotations can be used to visualize the temporal relationship

between reaction onsets and events (rewards). Fig. A.6 shows example his-

tograms of reward events binned into time windows around feature onsets. As

demonstrated by Fig. A.6, the onsets of certain gestures such as eyebrow-frown

and head-nod correlate with negative or positive events respectively (peaking

around 1.47s before the onset). While smile accounts for a large portion of

overall gestures (Fig. A.5), it does not correlate strongly with either positive

or negative events, contradicting the assumption made by several prior studies

that smile should always be treated as positive feedback [3, 26, 39, 46]. While

this observation could be specific to our experimental setting or domain, it

agrees with established research on the emotional meanings of smiles as shown

in the work of Hoque et al. [16] and Niedenthal et al. [32].

In these histograms, the contours of red and yellow bars are strikingly

similar in most subplots of Fig. A.6, which suggests that although an individual

may react differently to the events that provide −1 and −5 reward, it may be

hard to distinguish between them through single gestures. We also find that

reactions (across all gestures) are likely to occur between 2.8s before and 3.6s

after an event (shown as a peak in the histograms), which we use as a prior

for designing the set of candidate time windows that random hyperparameter

search draws from (see Appx.A.3.3).

56

Page 70: Copyright by Qiping Zhang 2021

A.3 Model Design

A.3.1 Feature Extraction

The specific output data we use from OpenFace [5, 6, 45] are: [success,

AU01 c , AU02 c , AU04 c , AU05 c , AU06 c , AU07 c , AU09 c , AU10 c ,

AU12 c , AU14 c , AU15 c , AU17 c , AU20 c , AU23 c , AU25 c , AU26 c ,

AU28 c , AU45 c , AU01 r , AU02 r , AU04 r , AU05 r , AU06 r , AU07 r ,

AU09 r , AU10 r , AU12 r , AU14 r , AU15 r , AU17 r , AU20 r , AU23 r ,

AU25 r , AU26 r , AU45 r , pose Tx , pose Ty , pose Tz , pose Rx , pose Ry

, pose Rz ].

The AUx c signals are outputs from classifiers of activation of facial

action units (FAU) and AUx r are from regression model that are designed to

capture the intensity of the activation of facial action units. The pose T and

pose R signals are detected head translation and rotation with respect to the

camera pose. Since the camera pose and relative position of a person with the

camera varies from training time to application time, we explicitly model the

change in the detected person’s head pose by maintaining a running average

and subtract the average from all incoming pose features. We then use a

time window of the past 50 feature frames and compute the Fourier transform

coefficients as the head-motion features we feed into the neural network.

A.3.2 Data Split of k-fold Cross Validation for Random Search

During our search for hyperparameters that learn an effective reaction

mapping from data gathered in Robotaxi, we used a data-splitting method de-

57

Page 71: Copyright by Qiping Zhang 2021

Figure A.7: Diagram of data split for subject k

signed to avoid overfitting and to have relatively large training sets despite

our small dataset size. Recall that each participant observe and react to 3

episodes. Of these 3 episodes, 1 is randomly chosen as a holdout episode and is

not used for training or testing except for final evaluation. With the remaining

2 episodes per subject, we split data such that different train-test-validation

sets are created for each subject, as shown in Fig. A.7. Specifically, we con-

struct a train-test-validation set for each subject by assigning one episode of

the target subject as the validation set, randomly sampling half (either the

first half or the second) of an unused episode from each subject into the test

set, and using the remaining data in the training set.

For a target subject, a model is trained on the subject’s corresponding

training set and tested after each epoch on the test set. The epoch with the

best cross-validation loss is chosen as the early stopping point, and the model

trained at this epoch is then evaluated on the validation set. The performance

of a hyperparameter set is defined as the mean of the cross entropy losses

across each subject’s validation set. The hyperparameter set with the lowest

such mean cross entropy loss is selected for evaluation on the holdout set.

58

Page 72: Copyright by Qiping Zhang 2021

The data split for evaluation on the holdout set is similar but simpler. From

the 2 episodes per subject that are not in the holdout set, half an episode is

randomly sampled into the test set and the rest are in the training set. A

single model is trained (stopping with the lowest cross-entropy loss on test

set) and then evaluated on the holdout set.

A.3.3 Hyperparameters

Random search is used to find the best set of hyper-parameters, includ-

ing input window size (k and l), learning rate, dropout rate, loss coefficients

(λ1 and λ2), depth and widths of the MLP hidden layers. Fig. A.6 indicates

that reactions are likely to onset between 2.8s before and 3.6s after an event.

Therefore, we convert the corresponding range of temporal window into the

number of aggregated frames before and after a particular prediction point

(aggregated frame) and use that as the range to sample the input window.

Each set of randomly sampled parameters is evaluated on all 17 train-test

folds and the set with the lowest average test loss is selected. For each model,

the weights with the lowest test loss are saved and evaluated on the validation

set.

The best hyper-parameters found through random search are: {learning rate

= 0.001, batch size = 8, k = 0, l = 12, dropout rate = 0.6314, λ1 = 2, λ2 =

1}. Below is the best model architecture found through random search:

(facial_action_unit_input): Linear(in_features=455, out_features=64, bias=True)

(head_pose_input): Linear(in_features=702, out_features=32, bias=True)

(hidden): ModuleList(

59

Page 73: Copyright by Qiping Zhang 2021

(0): Linear(in_features=96, out_features=128, bias=True)

(1): BatchNorm1d(128, eps=1e-05, momentum=0.1)

(2): LeakyReLU(negative_slope=0.01)

(3): Dropout(p=0.63, inplace=False)

(4): Linear(in_features=128, out_features=128, bias=True)

(5): BatchNorm1d(128, eps=1e-05, momentum=0.1)

(6): LeakyReLU(negative_slope=0.01)

(7): Dropout(p=0.63, inplace=False)

(8): Linear(in_features=128, out_features=64, bias=True)

(9): BatchNorm1d(64, eps=1e-05, momentum=0.1)

(10): LeakyReLU(negative_slope=0.01)

(11): Dropout(p=0.63, inplace=False)

(12): Linear(in_features=64, out_features=8, bias=True)

(13): BatchNorm1d(8, eps=1e-05, momentum=0.1)

(14): LeakyReLU(negative_slope=0.01)

(15): Dropout(p=0.63, inplace=False))

(out): Linear(in_features=8, out_features=3, bias=True)

(auxiliary_task): Linear(in_features=128, out_features=130, bias=True)

60

Page 74: Copyright by Qiping Zhang 2021

Bibliography

[1] Herve Abdi. The kendall rank correlation coefficient. Encyclopedia of

Measurement and Statistics. Sage, Thousand Oaks, CA, pages 508–510,

2007.

[2] Henny Admoni and Brian Scassellati. Social eye gaze in human-robot

interaction: a review. Journal of Human-Robot Interaction, 6(1):25–63,

2017.

[3] Riku Arakawa, Sosuke Kobayashi, Yuya Unno, Yuta Tsuboi, and Shin-

ichi Maeda. Dqn-tamer: Human-in-the-loop reinforcement learning with

intractable feedback. arXiv preprint arXiv:1810.11748, 2018.

[4] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning.

A survey of robot learning from demonstration. Robotics and autonomous

systems, 57(5):469–483, 2009.

[5] Tadas Baltrusaitis, Marwa Mahmoud, and Peter Robinson. Cross-dataset

learning and person-specific normalisation for automatic action unit de-

tection. In 2015 11th IEEE International Conference and Workshops

on Automatic Face and Gesture Recognition (FG), volume 6, pages 1–6.

IEEE, 2015.

61

Page 75: Copyright by Qiping Zhang 2021

[6] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe

Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th

IEEE International Conference on Automatic Face & Gesture Recognition

(FG 2018), pages 59–66. IEEE, 2018.

[7] James Bergstra and Yoshua Bengio. Random search for hyper-parameter

optimization. The Journal of Machine Learning Research, 13(1):281–305,

2012.

[8] Sonia Chernova and Andrea L Thomaz. Robot learning from human

teachers. Synthesis Lectures on Artificial Intelligence and Machine Learn-

ing, 8(3):1–121, 2014.

[9] Carlos Crivelli and Alan J Fridlund. Facial displays are tools for social

influence. Trends in Cognitive Sciences, 22(5):388–399, 2018.

[10] Y Cui, Q Zhang, A Allievi, P Stone, S Niekum, and W Knox. The

empathic framework for task learning from implicit human feedback. In

Conference on Robot Learning, 2020.

[11] Yuchen Cui and Scott Niekum. Active reward learning from critiques.

In 2018 IEEE International Conference on Robotics and Automation

(ICRA), pages 6907–6914. IEEE, 2018.

[12] Matthew N Dailey, Carrie Joyce, Michael J Lyons, Miyuki Kamachi,

Hanae Ishi, Jiro Gyoba, and Garrison W Cottrell. Evidence and a compu-

62

Page 76: Copyright by Qiping Zhang 2021

tational explanation of cultural differences in facial expression recognition.

Emotion, 10(6):874, 2010.

[13] Adrian K Davison, Walied Merghani, and Moi Hoon Yap. Objective

classes for micro-facial expression recognition. Journal of Imaging, 4(10):

119, 2018.

[14] Paul Ekman. Facial expressions. Handbook of cognition and emotion, 16

(301):e320, 1999.

[15] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: a

survey. Pattern recognition, 36(1):259–275, 2003.

[16] Mohammed Ehsan Hoque, Daniel J McDuff, and Rosalind W Picard.

Exploring temporal patterns in classifying frustrated and delighted smiles.

IEEE Transactions on Affective Computing, 3(3):323–334, 2012.

[17] Charles Isbell, Christian R Shelton, Michael Kearns, Satinder Singh, and

Peter Stone. A social reinforcement learning agent. In Proceedings of

the fifth international conference on Autonomous agents, pages 377–384.

ACM, 2001.

[18] Rachael E Jack and Philippe G Schyns. The human face as a dynamic

tool for social communication. Current Biology, 25(14):R621–R634, 2015.

[19] Natasha Jaques, Jennifer McCleary, Jesse Engel, David Ha, Fred Bertsch,

Rosalind Picard, and Douglas Eck. Learning via social awareness: Im-

63

Page 77: Copyright by Qiping Zhang 2021

proving a deep generative sketching model with facial feedback. arXiv

preprint arXiv:1802.04877, 2018.

[20] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and

Geri Gay. Accurately interpreting clickthrough data as implicit feedback.

In ACM SIGIR Forum, volume 51, pages 4–11. Acm New York, NY, USA,

2017.

[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic

optimization. arXiv preprint arXiv:1412.6980, 2014.

[22] W Bradley Knox and Peter Stone. Interactively shaping agents via hu-

man reinforcement: The tamer framework. In Proceedings of the fifth

international conference on Knowledge capture, pages 9–16. ACM, 2009.

[23] W Bradley Knox, Peter Stone, and Cynthia Breazeal. Training a robot

via human feedback: A case study. In International Conference on Social

Robotics, pages 460–470. Springer, 2013.

[24] Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot

learning for manipulation: Challenges, representations, and algorithms.

arXiv preprint arXiv:1907.03146, 2019.

[25] Guangliang Li, Randy Gomez, Keisuke Nakamura, and Bo He. Human-

centered reinforcement learning: A survey. IEEE Transactions on

Human-Machine Systems, 2019.

64

Page 78: Copyright by Qiping Zhang 2021

[26] Guangliang Li, Hamdi Dibeklioglu, Shimon Whiteson, and Hayley Hung.

Facial feedback for reinforcement learning: a case study and offline anal-

ysis using the tamer framework. Autonomous Agents and Multi-Agent

Systems, 34(1):1–29, 2020.

[27] Shan Li and Weihong Deng. Deep facial expression recognition: A survey.

arXiv preprint arXiv:1804.08348, 2018.

[28] Xiaobai Li, Tomas Pfister, Xiaohua Huang, Guoying Zhao, and Matti

Pietikainen. A spontaneous micro-expression database: Inducement, col-

lection and baseline. In 2013 10th IEEE International Conference and

Workshops on Automatic face and gesture recognition (fg), pages 1–6.

IEEE, 2013.

[29] Jinying Lin, Zhen Ma, Randy Gomez, Keisuke Nakamura, Bo He, and

Guangliang Li. A review on interactive reinforcement learning from hu-

man social feedback. IEEE Access, 8:120757–120765, 2020.

[30] Robert Loftin, Bei Peng, James MacGlashan, Michael L Littman,

Matthew E Taylor, Jeff Huang, and David L Roberts. Learning some-

thing from nothing: Leveraging implicit human feedback strategies. In

The 23rd IEEE International Symposium on Robot and Human Interac-

tive Communication, pages 607–612. IEEE, 2014.

[31] James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts,

Matthew E Taylor, and Michael L Littman. Interactive learning from

65

Page 79: Copyright by Qiping Zhang 2021

policy-dependent human feedback. arXiv preprint arXiv:1701.06049,

2017.

[32] Paula M Niedenthal, Martial Mermillod, Marcus Maringer, and Ursula

Hess. The simulation of smiles (sims) model: Embodied simulation and

the meaning of facial expression. Behavioral and brain sciences, 33(6):

417, 2010.

[33] Jaak Panksepp and Douglas Watt. What is basic about basic emotions?

lasting lessons from affective neuroscience. Emotion review, 3(4):387–396,

2011.

[34] Tomas Pfister, Xiaobai Li, Guoying Zhao, and Matti Pietikainen. Recog-

nising spontaneous facial micro-expressions. In 2011 international con-

ference on computer vision, pages 1449–1456. IEEE, 2011.

[35] Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi,

Jason P Carey, and Richard S Sutton. Online human training of a my-

oelectric prosthesis controller via actor-critic reinforcement learning. In

2011 IEEE International Conference on Rehabilitation Robotics, pages

1–7. IEEE, 2011.

[36] Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia.

Active preference-based learning of reward functions. In Robotics: Science

and Systems, 2017.

66

Page 80: Copyright by Qiping Zhang 2021

[37] Halit Bener Suay and Sonia Chernova. Effect of human guidance and

state space size on interactive reinforcement learning. In 2011 Ro-Man,

pages 1–6. IEEE, 2011.

[38] Vivek Veeriah. Beyond clever hans: Learning from people without their

really trying. 2018.

[39] Vivek Veeriah, Patrick M Pilarski, and Richard S Sutton. Face valuing:

Training user interfaces with facial expressions and reinforcement learn-

ing. arXiv preprint arXiv:1606.02807, 2016.

[40] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone.

Deep tamer: Interactive agent shaping in high-dimensional state spaces.

In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[41] Eric W Weisstein. Bonferroni correction. https://mathworld. wolfram.

com/, 2004.

[42] RF Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical

trials, pages 1–3, 2007.

[43] Duo Xu, Mohit Agarwal, Faramarz Fekri, and Raghupathy Sivakumar.

Playing games with implicit human feedback.

[44] Wen-Jing Yan, Qi Wu, Jing Liang, Yu-Hsin Chen, and Xiaolan Fu. How

fast are the leaked facial expressions: The duration of micro-expressions.

Journal of Nonverbal Behavior, 37(4):217–230, 2013.

67

Page 81: Copyright by Qiping Zhang 2021

[45] Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe

Morency. Convolutional experts constrained local model for 3d facial

landmark detection. In Proceedings of the IEEE International Confer-

ence on Computer Vision Workshops, pages 2519–2528, 2017.

[46] Dean Zadok, Daniel McDuff, and Ashish Kapoor. Affect-based in-

trinsic rewards for learning general representations. arXiv preprint

arXiv:1912.00403, 2019.

[47] Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H Ballard, and Peter

Stone. Leveraging human guidance for deep reinforcement learning tasks.

arXiv preprint arXiv:1909.09906, 2019.

68