17
1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian ˇ Soˇ si´ c, Abdelhak M. Zoubir and Heinz Koeppl Abstract—Learning from demonstration (LfD) is the process of building behavioral models of a task from demonstrations provided by an expert. These models can be used e.g. for system control by generalizing the expert demonstrations to previously unencountered situations. Most LfD methods, however, make strong assumptions about the expert behavior, e.g. they assume the existence of a deterministic optimal ground truth policy or require direct monitoring of the expert’s controls, which limits their practical use as part of a general system identification framework. In this work, we consider the LfD problem in a more general setting where we allow for arbitrary stochastic expert policies, without reasoning about the optimality of the demonstrations. Following a Bayesian methodology, we model the full posterior distribution of possible expert controllers that explain the provided demonstration data. Moreover, we show that our methodology can be applied in a nonparametric context to infer the complexity of the state representation used by the expert, and to learn task-appropriate partitionings of the system state space. Index Terms—learning from demonstration, policy recognition, imitation learning, Bayesian nonparametric modeling, Markov chain Monte Carlo, Gibbs sampling, distance dependent Chinese restaurant process 1 I NTRODUCTION L EARNING FROM DEMONSTRATION (LfD) has become a viable alternative to classical reinforcement learning as a new data-driven learning paradigm for building behav- ioral models based on demonstration data. By exploiting the domain knowledge provided by an expert demonstrator, LfD-built models can focus on the relevant parts of a sys- tem’s state space [1] and hence avoid the need of tedious ex- ploration steps performed by reinforcement learners, which often require an impractically high number of interactions with the system environment [2] and always come with the risk of letting the system run into undesired or unsafe states [3]. In addition to that, LfD-built models have been shown to outperform the expert in several experiments [4], [5], [6]. However, most existing LfD methods come with strong requirements that limit their practical use in real-world sce- narios. In particular, they often require direct monitoring of the expert’s controls (e.g. [5], [7], [8]) which is possible only under laboratory-like conditions, or they need to interact with the target system via a simulator, if not by controlling the system directly (e.g. [9]). Moreover, many methods are restricted to problems with finite state spaces (e.g. [10]), or they compute only point estimates of the relevant system parameters without providing any information about their level of confidence (e.g. [9], [11], [12]). Last but not least, the expert is typically assumed to follow an optimal determin- istic policy (e.g. [13]) or to at least approximate one, based on some presupposed degree of confidence in the optimal Adrian ˇ Soˇ si´ c is a member of the Signal Processing Group and an associate member of the Bioinspired Communication Systems Lab, Technische Uni- versit¨ at Darmstadt, Germany. E-mail: [email protected] Abdelhak M. Zoubir is the head of the Signal Processing Group, Technische Universit¨ at Darmstadt, Germany. E-mail: [email protected] Heinz Koeppl is the head of the Bioinspired Communication Systems Lab and a member of the Centre for Cognitive Science, Technische Universit¨ at Darmstadt, Germany. E-mail: [email protected] behavior (e.g. [14]). While such an assumption may be reasonable in some situations (e.g. for problems in robotics involving a human demonstrator [1]), it is not appropriate in many others, such as in multi-agent environments, where an optimal deterministic policy often does not exist [15]. In fact, there are many situations in which the assumption of a deterministic expert behavior is violated. In a more general system identification setting, our goal could be, for instance, to detect the deviation of an agent’s policy from its known nominal behavior, e.g. for the purpose of fault or fraud detection (note that the term “expert” is slightly mis- leading in this context). Also, there are situations in which we might not want to reason about the optimality of the demonstrations; for instance, when studying the exploration strategy of an agent who tries to model its environment (or the reactions of other agents [16]) by randomly triggering different events. In all these cases, existing LfD methods can at best approximate the behavior of the expert as they presuppose the existence of some underlying deterministic ground truth policy. In this work, we present a novel approach to LfD in order to address the above-mentioned shortcomings of existing methods. Central to our work is the problem of policy recognition, that is, extracting the (possibly stochastic and non-optimal) policy of a system from observations of its behavior. Taking a general system identification view on the problem, our goal is herein to make as few assumptions about the expert behavior as possible. In particular, we con- sider the whole class of stochastic expert policies, without ever reasoning about the optimality of the demonstrations. As a result of this, our hypothesis space is not restricted to a certain class of ground truth policies, such as deterministic or softmax policies (c.f. [14]). This is in contrast to inverse reinforcement learning approaches (see Section 1.2), which interpret the observed demonstrations as the result of some preceding planing procedure conducted by the expert which arXiv:1605.01278v4 [stat.ML] 4 Aug 2017

1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

1

A Bayesian Approach to Policy Recognitionand State Representation Learning

Adrian Sosic, Abdelhak M. Zoubir and Heinz Koeppl

Abstract—Learning from demonstration (LfD) is the process of building behavioral models of a task from demonstrations provided byan expert. These models can be used e.g. for system control by generalizing the expert demonstrations to previously unencounteredsituations. Most LfD methods, however, make strong assumptions about the expert behavior, e.g. they assume the existence of adeterministic optimal ground truth policy or require direct monitoring of the expert’s controls, which limits their practical use as part of ageneral system identification framework. In this work, we consider the LfD problem in a more general setting where we allow forarbitrary stochastic expert policies, without reasoning about the optimality of the demonstrations. Following a Bayesian methodology,we model the full posterior distribution of possible expert controllers that explain the provided demonstration data. Moreover, we showthat our methodology can be applied in a nonparametric context to infer the complexity of the state representation used by the expert,and to learn task-appropriate partitionings of the system state space.

Index Terms—learning from demonstration, policy recognition, imitation learning, Bayesian nonparametric modeling, Markov chainMonte Carlo, Gibbs sampling, distance dependent Chinese restaurant process

F

1 INTRODUCTION

L EARNING FROM DEMONSTRATION (LfD) has become aviable alternative to classical reinforcement learning as

a new data-driven learning paradigm for building behav-ioral models based on demonstration data. By exploitingthe domain knowledge provided by an expert demonstrator,LfD-built models can focus on the relevant parts of a sys-tem’s state space [1] and hence avoid the need of tedious ex-ploration steps performed by reinforcement learners, whichoften require an impractically high number of interactionswith the system environment [2] and always come with therisk of letting the system run into undesired or unsafe states[3]. In addition to that, LfD-built models have been shownto outperform the expert in several experiments [4], [5], [6].

However, most existing LfD methods come with strongrequirements that limit their practical use in real-world sce-narios. In particular, they often require direct monitoring ofthe expert’s controls (e.g. [5], [7], [8]) which is possible onlyunder laboratory-like conditions, or they need to interactwith the target system via a simulator, if not by controllingthe system directly (e.g. [9]). Moreover, many methods arerestricted to problems with finite state spaces (e.g. [10]), orthey compute only point estimates of the relevant systemparameters without providing any information about theirlevel of confidence (e.g. [9], [11], [12]). Last but not least, theexpert is typically assumed to follow an optimal determin-istic policy (e.g. [13]) or to at least approximate one, basedon some presupposed degree of confidence in the optimal

• Adrian Sosic is a member of the Signal Processing Group and an associatemember of the Bioinspired Communication Systems Lab, Technische Uni-versitat Darmstadt, Germany. E-mail: [email protected]

• Abdelhak M. Zoubir is the head of the Signal Processing Group, TechnischeUniversitat Darmstadt, Germany. E-mail: [email protected]

• Heinz Koeppl is the head of the Bioinspired Communication Systems Laband a member of the Centre for Cognitive Science, Technische UniversitatDarmstadt, Germany. E-mail: [email protected]

behavior (e.g. [14]). While such an assumption may bereasonable in some situations (e.g. for problems in roboticsinvolving a human demonstrator [1]), it is not appropriatein many others, such as in multi-agent environments, wherean optimal deterministic policy often does not exist [15].In fact, there are many situations in which the assumptionof a deterministic expert behavior is violated. In a moregeneral system identification setting, our goal could be, forinstance, to detect the deviation of an agent’s policy fromits known nominal behavior, e.g. for the purpose of fault orfraud detection (note that the term “expert” is slightly mis-leading in this context). Also, there are situations in whichwe might not want to reason about the optimality of thedemonstrations; for instance, when studying the explorationstrategy of an agent who tries to model its environment (orthe reactions of other agents [16]) by randomly triggeringdifferent events. In all these cases, existing LfD methodscan at best approximate the behavior of the expert as theypresuppose the existence of some underlying deterministicground truth policy.

In this work, we present a novel approach to LfD in orderto address the above-mentioned shortcomings of existingmethods. Central to our work is the problem of policyrecognition, that is, extracting the (possibly stochastic andnon-optimal) policy of a system from observations of itsbehavior. Taking a general system identification view on theproblem, our goal is herein to make as few assumptionsabout the expert behavior as possible. In particular, we con-sider the whole class of stochastic expert policies, withoutever reasoning about the optimality of the demonstrations.As a result of this, our hypothesis space is not restricted to acertain class of ground truth policies, such as deterministicor softmax policies (c.f. [14]). This is in contrast to inversereinforcement learning approaches (see Section 1.2), whichinterpret the observed demonstrations as the result of somepreceding planing procedure conducted by the expert which

arX

iv:1

605.

0127

8v4

[st

at.M

L]

4 A

ug 2

017

Page 2: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

2

they try to invert. In the above-mentioned case of faultdetection, for example, such an inversion attempt will gen-erally fail since the demonstrated behavior can be arbitrarilyfar from optimal, which renders an explanation of the datain terms of a simple reward function impossible.

Another advantage of our problem formulation is thatthe resulting inference machinery is entirely passive, in thesense that we require no active control of the target systemnor access to the action sequence performed by the expert.Accordingly, our method is applicable to a broader range ofproblems than targeted by most existing LfD frameworksand can be used for system identification in cases wherewe cannot interact with the target system. However, ourobjective in this paper is twofold: we not only attempt toanswer the question how the expert performs a given taskbut also to infer which information is used by the expert tosolve it. This knowledge is captured in the form of a jointposterior distribution over possible expert state representa-tions and corresponding state controllers. As the complexityof the expert’s state representation is unknown a priori,we finally present a Bayesian nonparametric approach toexplore the underlying structure of the system space basedon the available demonstration data.

1.1 Problem statementGiven a set of expert demonstrations in the form of a systemtrajectory s = (s1, s2, . . . , sT ) ∈ ST of length T , whereS denotes the system state space, our goal is to determinethe latent control policy used by the expert to generate thestate sequence.1 We formalize this problem as a discrete-time decision-making process (i.e. we assume that the expertexecutes exactly one control action per trajectory state) andadopt the Markov decision process (MDP) formalism [17]as the underlying framework describing the dynamics ofour system. More specifically, we consider a reduced MDP(S,A, T , π) which consists of a countable or uncountablesystem state space S , a finite set of actions A containing|A| elements, a transition model T : S × S × A → R≥0

where T (s′ | s, a) denotes the probability (density) assignedto the event of reaching state s′ after taking action a instate s, and a policy π modeling the expert’s choice ofactions.2 In the following, we assume that the expert policyis parametrized by a parameter ω ∈ Ω, which we callthe global control parameter of the system, and we writeπ(a | s,ω), π : A × S × Ω → [0, 1], to denote the expert’slocal policy (i.e. the distribution of actions a played by theexpert) at any given state s under ω. The set Ω is called theparameter space of the policy, which specifies the class offeasible action distributions. The specific form of Ω will bediscussed later.

Using a parametric description for π is convenient asit shifts the recognition task from determining the possiblyinfinite set of local policies at all states in S to inferring theposterior distribution p(ω | s), which contains all informa-tion that is relevant for predicting the expert behavior,

p(a | s∗, s) =

Ωπ(a | s∗,ω)p(ω | s) dω.

1. The generalization to multiple trajectories is straightforward asthey are conditionally independent given the system parameters.

2. This reduced model is sometimes referred to as an MDP\R (seee.g. [9], [18], [19]) to emphasize the nonexistence of a reward function.

Herein, s∗ ∈ S is some arbitrary query point and p(a | s∗, s)is the corresponding predictive action distribution. Since thelocal policies are coupled through the global control param-eter ω as indicated by the above integral equation, inferringω means not only to determine the individual local policiesbut also their spatial dependencies. Consequently, learningthe structure of ω from demonstration data can be alsointerpreted as learning a suitable state representation forthe task performed by the expert. This relationship will bediscussed in detail in the forthcoming sections. In Section 3,we further extend this reasoning to a nonparametric policymodel whose hypothesis class finally covers all stochasticpolicies on S .

For the remainder of this paper, we make the commonassumptions that the transition model T as well as thesystem state space S and the action set A are known. Theassumption of knowing S follows naturally because wealready assumed that we can observe the expert acting in S .In the proposed Bayesian framework, the latter assumptioncan be easily relaxed by considering noisy or incomplete tra-jectory data. However, as this would not provide additionalinsights into the main principles of our method, we do notconsider such an extension in this work.

The assumption of knowing the transition dynamics T isa simplifying one but prevents us from running into modelidentifiability problems: if we do not constrain our systemtransition model in some reasonable way, any observed statetransition in S could be trivially explained by a correspond-ing local adaptation of the assumed transition model Tand, thus, there would be little hope to extract the true ex-pert policy from the demonstration data. Assuming a fixedtransition model is the easiest way to resolve this modelambiguity. However, there are alternatives which we leavefor future work, for example, using a parametrized family oftransition models for joint inference. This extension can beintegrated seamlessly into our Bayesian framework and isuseful in cases where we can constrain the system dynamicsin a natural way, e.g. when modeling physical processes.Also, it should be mentioned that we can tolerate deviationsfrom the true system dynamics as long as our model T issufficiently accurate to extract information about the expertaction sequence locally, because our inference algorithmnaturally processes the demonstration data piece-wise inthe form of one-step state transitions (st, st+1) (see algo-rithmic details in Section 2 and results in Section 4.2). Thisis in contrast to planning-based approaches, where smallmodeling errors in the dynamics can accumulate and yieldconsistently wrong policy estimates [8], [20].

The requirement of knowing the action set A is lessstringent: if A is unknown a priori, we can still assume a po-tentially rich class of actions, as long as the transition modelcan provide the corresponding dynamics (see example inSection 4.2). For instance, we might be able to provide amodel which describes the movement of a robotic arm evenif the maximum torque that can be generated by the systemis unknown. Figuring out which of the hypothetical actionsare actually performed by the expert and, more importantly,how they are used in a given context, shall be the task ofour inference algorithm.

Page 3: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

3

1.2 Related work

The idea of learning from demonstration has now beenaround for several decades. Most of the work on LfD hasbeen presented by the robotics community (see [1] for asurvey), but recent advances in the field have triggered de-velopments in other research areas, such as cognitive science[21] and human-machine interaction [22]. Depending on thesetup, the problem is referred to as imitation learning [23],apprenticeship learning [9], inverse reinforcement learning[13], inverse optimal control [24], preference elicitation [21],plan recognition [25] or behavioral cloning [5]. Most LfDmodels can be categorized as intentional models (withinverse reinforcement learning models as the primary ex-ample), or sub-intentional models (e.g. behavioral cloningmodels). While the latter class only predicts an agent’sbehavior via a learned policy representation, intentionalmodels (additionally) attempt to capture the agent’s beliefsand intentions, e.g. in the form of a reward function. For thisreason, intentional models are often reputed to have bettergeneralization abilities3; however, they typically require acertain amount of task-specific prior knowledge in orderto resolve the ambiguous relationship between intentionand behavior, since there are often many ways to solvea certain task [13]. Also, albeit being interesting from apsychological point of view [21], intentional models targeta much harder problem than what is actually required inmany LfD scenarios. For instance, it is not necessary tounderstand an agent’s intention if we only wish to analyzeits behavior locally.

Answering the question whether or not an intention-based modeling of the LfD problem is advantageous, is outof the scope of this paper; however, we point to the com-prehensive discussion in [26]. Rather, we present a hybridsolution containing both intentional and sub-intentional ele-ments. More specifically, our method does not explicitly cap-ture the expert’s goals in the form of a reward function butinfers a policy model directly from the demonstration data;nonetheless, the presented algorithm learns a task-specificrepresentation of the system state space which encodes thestructure of the underlying control problem to facilitate thepolicy prediction task. An early version of this idea canbe found in [27], where the authors proposed a simplemethod to partition a system’s state space into a set of so-called control situations to learn a global system controllerbased on a small set of informative states. However, theirframework does not incorporate any demonstration dataand the proposed partitioning is based on heuristics. A moresophisticated partitioning approach utilizing expert demon-strations is shown in [11]; yet, the proposed expectation-maximization framework applies to deterministic policiesand finite state spaces only.

The closest methods to ours can be probably found in[19] and [10]. The authors of [19] presented a nonparametricinverse reinforcement learning approach to cluster the ex-pert data based on a set of learned subgoals encoded in theform of local rewards. Unfortunately, the required subgoal

3. The rationale behind this is that an agent’s intention is alwaysspecific to the task being performed and can hence serve as a compactdescription of it [13]. However, if the intention of the agent is misunder-stood, then also the subsequent generalization step will trivially fail.

assignments are learned only for the demonstration set and,thus, the algorithm cannot be used for action predictionat unvisited states unless it is extended with a non-trivialpost-processing step which solves the subgoal assignmentproblem. Moreover, the algorithm requires an MDP solver,which causes difficulties for systems with uncountable statespaces. The sub-intentional model in [10], on the other hand,can be used to learn a class of finite state controllers directlyfrom the expert demonstrations. Like our framework, thealgorithm can handle various kinds of uncertainty aboutthe data but, again, the proposed approach is limited todiscrete settings. In the context of reinforcement learning,we further point to the work presented in [28] whoseauthors follow a nonparametric strategy similar to ours, tolearn a distribution over predictive state representations fordecision-making.

1.3 Paper outline

The outline of the paper is as follows: In Section 2, weintroduce our parametric policy recognition framework andderive inference algorithms for both countable and uncount-able state spaces. In Section 3, we consider the policy recog-nition problem from a nonparametric viewpoint and pro-vide insights into the state representation learning problem.Simulation results are presented in Section 4 and we givea conclusion of our work in Section 5. In the supplement,we provide additional simulation results, a note on thecomputational complexity of our model, as well as an in-depth discussion on the issue of marginal invariance andthe problem of policy prediction in large states spaces.

2 PARAMETRIC POLICY RECOGNITION

2.1 Finite state spaces: the static model

First, let us assume that the expert system can be modeledon a finite state space S and let |S| denote its cardinality.For notational convenience, we represent both states andactions by integer values. Starting with the most generalcase, we assume that the expert executes an individualcontrol strategy at each possible system state. Accordingly,we introduce a set of local control parameters or local con-trollers θi|S|i=1 by which we describe the expert’s choice ofactions. More specifically, we model the executed actionsas categorical random variables and let the jth element ofθi represent the probability that the expert chooses actionj at state i. Consequently, θi lies in the (|A| − 1)-simplex,which we denote by the symbol ∆ for brevity of notation, i.e.θi ∈ ∆ ⊆ R|A|. Summarizing all local control parameters ina single matrix, Θ ∈ Ω ⊆ ∆|S|, we obtain the global controlparameter of the system as already introduced in Section 1.1,which compactly captures the expert behavior. Note thatwe denote the global control parameter here by Θ insteadof ω, for reasons that will become clear later. Each action ais thus characterized by the local policy that is induced bythe control parameter of the underlying state,

π(a | s = i,Θ) = CAT(a | θi).

For simplicity, we will write π(a | θi) since the state infor-mation is used only to select the appropriate local controller.

Page 4: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

4

Considering a finite set of actions, it is convenient toplace a symmetric Dirichlet prior on the local control pa-rameters,

pθ(θi | α) = DIR(θi | α · 1|A|),

which forms the conjugate distribution to the categoricaldistribution over actions. Here, 1|A| denotes the vector ofall ones of length |A|. The prior is itself parametrized by aconcentration parameter α which can be further describedby a hyperprior pα(α), giving rise to a Bayesian hierarchicalmodel. For simplicity, we assume that the value of α is fixedfor the remainder of this paper, but the extension to a fullBayesian treatment is straightforward. The joint distributionof all remaining model variables is, therefore, given as

p(s,a,Θ | α) = p1(s1)

|S|∏

i=1

pθ(θi | α) . . . (1)

. . . ×T−1∏

t=1

T (st+1 | st, at)π(at | θst),

where a = (a1, a2, . . . , aT−1) denotes the latent actionsequence taken by the expert and p1(s1) is the initial statedistribution of the system. Throughout the rest of the paper,we refer to this model as the static model. The correspondinggraphical visualization is depicted in Fig. 1.

2.1.1 Gibbs sampling

Following a Bayesian methodology, our goal is to determinethe posterior distribution p(Θ | s, α), which contains allinformation necessary to make predictions about the ex-pert behavior. For the static model in Eq. (1), the requiredmarginalization of the latent action sequence a can becomputed efficiently because the joint distribution factorizesover time instants. For the extended models presented inlater sections, however, a direct marginalization becomescomputationally intractable due to the exponential growthof latent variable configurations. As a solution to this prob-lem, we follow a sampling-based inference strategy whichis later on generalized to more complex settings.

For the simple model described above, we first approxi-mate the joint posterior distribution p(Θ,a | s, α) over bothcontrollers and actions using a finite number of Q samples,and then marginalize over a in a second step,

p(Θ | s, α) =∑

a

p(Θ,a | s, α) (2)

≈∑

a

1

Q

Q∑

q=1

δΘqaq(Θ,a)

=

1

Q

Q∑

q=1

δΘq(Θ),

where (Θq,aq) ∼ p(Θ,a | s, α), and δx(·) denotesDirac’s delta function centered at x. This two-step approachgives rise to a simple inference procedure since the jointsamples (Θq,aq)Qq=1 can be easily obtained from aGibbs sampling scheme, i.e. by sampling iteratively fromthe following two conditional distributions,

p(at | a−t, s,Θ, α) ∝ T (st+1 | st, at)π(at | θst),p(θi | Θ−i, s,a, α) ∝ pθ(θi | α)

t:st=i

π(at | θi).

st−1 st st+1

at−1 at at+1

Θα z

Fig. 1: Graphical model of the policy recognition frame-work. The underlying dynamical structure is that of anMDP whose global control parameter Θ is treated as arandom variable with prior distribution parametrized by α.The indicator node z is used for the clustering model inSection 2.2. Observed variables are highlighted in gray.

Herein, a−t and Θ−i refer to all actions/controllers exceptat and θi, respectively. The latter of the two expressionsreveals that, in order to sample θi, we need to consideronly those actions played at the corresponding state i.Furthermore, the first expression shows that, given Θ, allactions at can be sampled independently of each other.Therefore, inference can be done in parallel for all θi. Thiscan be also seen from the posterior distribution of the globalcontrol parameter, which factorizes over states,

p(Θ | s,a, α) ∝|S|∏

i=1

pθ(θi | α)∏

t:st=i

π(at | θi). (3)

From the conjugacy of pθ(θi | α) and π(at | θi), it followsthat the posterior over θi is again Dirichlet distributed withupdated concentration parameter. In particular, denoting byφi,j the number of times that action j is played at state i forthe current assignment of actions a,

φi,j :=∑

t:st=i

1(at = j), (4)

and by collecting these quantities in the form of vectors, i.e.φi := [φi,1, . . . , φi,|A|], we can rewrite Eq. (3) as

p(Θ | s,a, α) =

|S|∏

i=1

DIR(θi | φi + α · 1|A|). (5)

2.1.2 Collapsed Gibbs samplingChoosing a Dirichlet distribution as prior model for thelocal controllers is convenient as it allows us to arrive atanalytic expressions for the conditional distributions thatare required to run the Gibbs sampler. As an alternative,we can exploit the conjugacy property of pθ(θi | α) andπ(at | θi) to marginalize out the control parameters duringthe sampling process, giving rise to a collapsed samplingscheme. Collapsed sampling is advantageous in two differ-ent respects: first, it reduces the total number of variablesto be sampled and, hence, the number of computationsrequired per Gibbs iteration; second, it increases the mixingspeed of the underlying Markov chain that governs thesampling process, reducing the correlation of the obtained

Page 5: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

5

samples and, with it, the variance of the resulting policyestimate.

Formally, collapsing means that we no longer approxi-mate the joint distribution p(Θ,a | s, α) as done in Eq. (2),but instead sample from the marginal density p(a | s, α),

p(Θ | s, α) =∑

a

p(Θ | s,a, α)p(a | s, α)

≈∑

a

p(Θ | s,a, α)

1

Q

Q∑

q=1

δaq(a)

=1

Q

Q∑

q=1

p(Θ | s,aq, α), (6)

where aq ∼ p(a | s, α). In contrast to the previousapproach, the target distribution is no longer representedby a sum of Dirac measures but described by a product ofDirichlet mixtures (compare Eq. (5)). The required samplesaq can be obtained from a collapsed Gibbs sampler with

p(at | a−t, s, α) ∝∫

∆|S|p(s,a,Θ | α) dΘ

∝ T (st+1 | st, at)∫

∆pθ(θst | α)

t′:st′=st

π(at′ | θst) dθst .

It turns out that the above distribution provides an easysampling mechanism since the integral part, when viewedas a function of action at only, can be identified as the con-ditional of a Dirichlet-multinomial distribution. This distri-bution is then reweighted by the likelihood T (st+1 | st, at)of the observed transition. The final (unnormalized) weightsof the resulting categorical distribution are hence given as

p(at = j | a−t, s, α) ∝ T (st+1 | st, at = j) · (ϕt,j + α), (7)

where ϕt,j counts the number of occurrences of action jamong all actions in a−t played at the same state as at (thatis, st). Explicitly,

ϕt,j :=∑

t′:st′=stt′ 6=t

1(at′ = j).

Note that these values can be also expressed in terms of thesufficient statistics introduced in the last section,

ϕt,j = φst,j − 1(at = j).

As before, actions played at different states may be sampledindependently of each other because they are generated bydifferent local controllers. Consequently, inference about Θagain decouples for all states.

2.2 Towards large state spaces: a clustering approach

While the methodology introduced so far provides a meansto solve the policy recognition problem in finite state spaces,the presented approaches quickly become infeasible forlarge spaces as, in the continuous limit, the number ofparameters to be learned (i.e. the size of Θ) will growunbounded. In that sense, the presented methodology isprone to overfitting because, for larger problems, we willnever have enough demonstration data to sufficiently coverthe whole system state space. In particular, the static model

θ1θ2

θ3

θ4

θ5

S

Fig. 2: Schematic illustration of the clustering model. Thestate space S is partitioned into a set of clusters Ck, eachgoverned by its own local control parameter θk.

makes no assumptions about the structure of Θ but treatsall local policies separately (see Eq. (5)); hence, we are notable to generalize the demonstrated behavior to regions ofthe state space that are not directly visited by the expert.Yet, we would certainly like to predict the expert behavioralso at states for which there is no trajectory data avail-able. Moreover, we should expect a well-designed model toproduce increasingly accurate predictions at regions closerto the observed trajectories (with the precise definition of“closeness” being left open for the moment).

A simple way to counteract the overfitting problem, ingeneral, is to restrict the complexity of a model by limitingthe number of its free parameters. In our case, we can avoidthe parameter space to grow unbounded by consideringonly a finite number of local policies that need to be sharedbetween the states. The underlying assumption is that, ateach state, the expert selects an action according to oneof K local policies, with corresponding control parame-ters θkKk=1. Accordingly, we introduce a set of indicatoror cluster assignment variables, zi|S|i=1, zi ∈ 1, . . . ,K,which map the states to their local controllers (Fig. 1).Obviously, such an assignment implies a partitioning of thestate space (Fig. 2), resulting in the following K clusters,

Ck := i : zi = k, k ∈ 1, . . . ,K.Although we motivated the clustering of states by the

problem of overfitting, partitioning a system’s space is notonly convenient from a statistical point of view; mappingthe inference problem down to a lower-dimensional spaceis also reasonable for practical reasons as we are typicallyinterested in understanding an agent’s behavior on a certaintask-appropriate scale. The following paragraphs discussthese reasons in detail:

• In practice, the observed trajectory data will always benoisy since we can take our measurements only up to acertain finite precision. Even though we do not explicitlyconsider observation noise in this paper, clustering the dataappears reasonable in order to robustify the model againstsmall perturbations in our observations.

• Considering the LfD problem from a control perspective,the complexity of subsequent planning steps can be poten-tially reduced if the system dynamics can be approximatelydescribed on a lower-dimensional manifold of the statespace, meaning that the system behavior can be well repre-sented by a smaller set of informative states (c.f. finite state

Page 6: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

6

controllers [29], control situations [27]). The LfD problemcan then be interpreted as the problem of learning a (near-optimal) controller based on a small set of local policies thattogether provide a good approximation of the global agentbehavior. What remains is the question how we can findsuch a representation. The clustering approach describedabove offers one possible solution to this problem.

• Finally, in any real setup, it is reasonable to assume thatthe expert itself can only execute a finite-precision policydue to its own limited sensing abilities of the system statespace. Consequently, the demonstrated behavior is goingto be optimal only up to a certain finite precision becausethe agent is generally not able to discriminate betweenarbitrary small differences of states. An interesting questionin this context is whether we can infer the underlying staterepresentation of the expert by observing its reactions to theenvironment in the form of the resulting state trajectory. Wewill discuss this issue in detail in Section 3.

By introducing the cluster assignment variables zi, thejoint distribution in Eq. (1) changes into

p(s,a, z,Θ | α) = p1(s1)K∏

k=1

pθ(θk | α) . . . (8)

. . . ×T−1∏

t=1

T (st+1 | st, at)π(at | θzst )pz(z),

where z = (z1, z2, . . . , z|S|) denotes the collection of allindicator variables and pz(z) is the corresponding priordistribution to be further discussed in Section 2.2.3. Notethat the static model can be recovered as a special case ofthe above when each state describes its own cluster, i.e. bysetting K = |S| and fixing zi = i (hence the name static).

In contrast to the static model, we now require both theindicator zi and the corresponding control parameter θziin order to characterize the expert’s behavior at a givenstate i. Accordingly, the global control parameter of themodel is ω = (Θ, z) with underlying parameter spaceΩ ⊆ ∆K×1, . . . ,K|S| (see Section 1.1), and our target dis-tribution becomes p(Θ, z | s, α). In what follows, we derivethe Gibbs and the collapsed Gibbs sampler as mechanismsfor approximate inference in this setting.

2.2.1 Gibbs samplingAs shown by the following equations, the expressions forthe conditional distributions over actions and controllerstake a similar form to those of the static model. Here, theonly difference is that we no longer group the actions bytheir states but according to their generating local policiesor, equivalently, the clusters Ck,p(at | a−t, s, z,Θ, α) ∝ T (st+1 | st, at) · π(at | θzst ),

p(θk | Θ−k, s,a, z, α) ∝ pθ(θk | α)∏

t:zst=k

π(at | θk)

= pθ(θk | α)∏

t:st∈Ckπ(at | θk).

The latter expression again takes the form of a Dirichletdistribution with updated concentration parameter,

p(θk | Θ−k, s,a, z, α) = DIR(θk | ξk + α · 1|A|),

where ξk := [ξk,1 , . . . , ξk,|A|], and ξk,j denotes the numberof times that action j is played at states belonging to clusterCk in the current assignment of a. Explicitly,

ξk,j :=∑

t:zst=k

1(at = j) =∑

i∈Ck

t:st=i

1(at = j), (9)

which is nothing but the sum of the φi,j ’s of the correspond-ing states,

ξk,j =∑

i∈Ckφi,j .

In addition to the actions and control parameters, we nowalso need to sample the indicators zi, whose conditionaldistributions can be expressed in terms of the correspondingprior model and the likelihood of the triggered actions,

p(zi | z−i, s,a,Θ, α) ∝ p(zi | z−i)∏

t:st=i

π(at | θzi). (10)

2.2.2 Collapsed Gibbs samplingAs before, we derive the collapsed Gibbs sampler bymarginalizing out the control parameters,

p(zi | z−i, s,a, α) ∝∫

∆K

p(s,a, z,Θ | α) dΘ (11)

∝ p(zi | z−i)∫

∆K

K∏

k=1

pθ(θk | α)T−1∏

t=1

π(at | θzst ) dΘ

∝ p(zi | z−i)∫

∆K

K∏

k=1

pθ(θk | α)

|S|∏

i′=1

t:st=i′

π(at | θzi′ ) dΘ

∝ p(zi | z−i)∫

∆K

K∏

k=1

pθ(θk | α)∏

i′:zi′=k

t:st=i′

π(at | θk) dΘ

∝ p(zi | z−i)K∏

k=1

∆pθ(θk | α)

t:st∈Ckπ(at | θk) dθk

.

Here, we first grouped the actions by their associated statesand then grouped the states themselves by the clustersCk. Again, this distribution admits an easy samplingmechanism as it takes the form of a product of Dirichlet-multinomials, reweighted by the conditional prior distri-bution over indicators. In particular, we observe that allactions played at some state i appear in exactly one ofthe K integrals of the last equation. In other words, bychanging the value of zi (i.e. by assigning state i to anothercluster), only two of the involved integrals are affected: theone belonging to the previously assigned cluster, and theone of the new cluster. Inference about the value of zi canthus be carried out using the following two sets of sufficientstatistics:

• φi,j : the number of actions j played at state i,• ψi,j,k: the number of actions j played at states as-

signed to cluster Ck, excluding state i.

The φi,j ’s are the same as in Eq. (4) and their definition isrepeated here just as a reminder. For the ψi,j,k’s, on the otherhand, we find the following explicit expression,

ψi,j,k :=∑

i′∈Cki′ 6=i

t:st=i′

1(at = j),

Page 7: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

7

which can be also written in terms of the statistics used forthe ordinary Gibbs sampler,

ψi,j,k = ξk,j − 1(i ∈ Ck) · φi,j .By collecting these quantities in a vector, i.e. ψi,k :=[ψi,1,k, . . . , ψi,|A|,k], we end up with the following simpli-fied expression,

p(zi = k | z−i, s,a, α) ∝ p(zi = k | z−i) . . .

. . . ×K∏

k′=1

DIRMULT(ψi,k′ + 1(k′ = k) · φi | α).

Further, we obtain the following result for the conditionaldistribution of action at,

p(at | a−t, s, z, α) ∝ T (st+1 | st, at) . . .

. . . ×∫

∆pθ(θzst | α)

t′:zst′

=zst

π(at′ | θzst ) dθzst .

By introducing the sufficient statistics ϑt,j, which countthe number of occurrences of action j among all states thatare currently assigned to the same cluster as st (i.e. thecluster Czst ), excluding at itself,

ϑt,j :=∑

t′:zst′

=zstt′ 6=t

1(at = j),

we finally arrive at the following expression,

p(at = j | a−j , s, z, α) ∝ (ϑt,j + α) · T (st+1 | st, at = j).

As for the static model, we can establish a relationship be-tween the statistics used for the ordinary and the collapsedsampler,

ϑt,j = ξzst ,j − 1(at = j).

2.2.3 Prior modelsIn order to complete our model, we need to specify a priordistribution over indicator variables pz(z). The followingparagraphs present three such candidate models:

Non-informative priorThe simplest of all prior models is the non-informativeprior over partitionings, reflecting the assumption that, apriori, all cluster assignments are equally likely and that theindicators zi are mutually independent. In this case, pz(z)is constant and, hence, the term p(zi | z−i) in Eq. (10) andEq. (11) disappears, so that the conditional distribution ofindicator zi becomes directly proportional to the likelihoodof the inferred action sequence.

Mixing priorAnother simple yet expressive prior can be realized by the(finite) Dirichlet mixture model. Instead of assuming thatthe indicator variables are independent, the model uses a setof mixing coefficients q = [q1, . . . , qK ], where qk representsthe prior probability that an indicator variable takes onvalue k. The mixing coefficients are themselves modeled bya Dirichlet distribution, so that we finally have

q ∼ DIR(q | γK· 1K),

zi | q ∼ CAT(zi | q),

where γ is another concentration parameter, controlling thevariability of the mixing coefficients. Note that the indicatorvariables are still conditionally independent given the mixingcoefficients in this model. More specifically, for a fixed q, theconditional distribution of a single indicator in Eq. (10) andEq. (11) takes the following simple form,

p(zi = k | z−i, q) = qk.

If the value of q is unknown, we have two options to includethis prior into our model. One is to sample q additionally tothe remaining variables by drawing values from the follow-ing conditional distribution during the Gibbs procedure,

p(q | s,a, z,Θ, α) ∝ DIR(q | γK· 1K)

|S|∏

i=1

CAT(zi | q)

∝ DIR(q | ζ +γ

K· 1K),

where ζ := [ζ1 , . . . , ζK ], and ζk denotes the number ofvariables zi that map to cluster Ck,

ζk =

|S|∑

i=1

1(zi = k).

Alternatively, we can again make use of the conjugacy prop-erty to marginalize out the mixing proportions q during theinference process, just as we did for the control parametersin previous sections. The result is (additional) collapsing inq. In this case, we simply replace the factor p(zi = k | z−i)in the conditional distribution of zi by

p(zi = k | z−i, γ) ∝ (ζ(−i)k +

γ

K), (12)

where ζ(−i)k is defined like ζk but without counting the

current value of indicator zi,

ζ(−i)k :=

|S|∑

i′=1i′ 6=i

1(zi = k) = ζk − 1(zi = k).

A detailed derivation is omitted here but follows the samestyle as for the collapsing in Section 2.1.2.

Spatial priorBoth previous prior models assume (conditional) indepen-dence of the indicator variables and, hence, make no specificassumptions about their dependency structure. However,we can also use the prior model to promote a certain type ofspatial state clustering. A reasonable choice is, for instance,to use a model which preferably groups “similar” statestogether (in other words, a model which favors clusteringsthat assign those states the same local control parameter).Similarity of states can be expressed, for example, by amonotonically decreasing decay function f : [0,∞)→ [0, 1]which takes as input the distance between two states. Therequired pairwise distances can be, in turn, defined via somedistance metric χ : S × S → [0,∞).

In fact, apart from the reasons listed in Section 2.2, thereis an additional motivation, more intrinsically related to thedynamics of the system, why such a clustering can be useful:given that the transition model of our system admits locallysmooth dynamics (which is typically the case for real-world

Page 8: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

8

systems), the resulting optimal control policy often turnsout to be spatially smooth, too [11]. More specifically, underan optimal policy, two nearby states are highly likely toexperience similar controls; hence, it is reasonable to assumea priori that both share the same local control parameter.For the policy recognition task, it certainly makes sense toregularize the inference problem by encoding this particularstructure of the solution space into our model. The Pottsmodel [30], which is a special case of a Markov randomfield with pairwise clique potentials [31], offers one way todo this,

pz(z) ∝|S|∏

i=1

exp

2

|S|∑

j=1j 6=i

f(di,j)δ(zi, zj)

).

Here, δ denotes Kronecker’s delta, di,j := χ(si, sj), i, j ∈1, . . . , |S|, are the state similarity values, and β ∈ [0,∞)is the (inverse) temperature of the model which controlsthe strength of the prior. From this equation, we can easilyderive the conditional distribution of a single indicatorvariable zi as

p(zi | z−i) ∝ exp

|S|∑

j=1j 6=i

f(di,j)δ(zi, zj)

). (13)

This completes our inference framework for finite spaces.

2.3 Countably infinite and uncountable state spaces

A major advantage of the clustering approach presented inthe last section is that, due to the limited number of localpolicies to be learned from the finite amount of demon-stration data, we can now apply the same methodology tostate spaces of arbitrary size, including countably infiniteand uncountable state spaces. This extension had beenpractically impossible for the static model because of theoverfitting problem explained in Section 2.2. Nevertheless,there remains a fundamental conceptual problem: a directextension of the model to these spaces would imply thatthe distribution over possible state partitionings becomes aninfinite-dimensional object (i.e., in the case of uncountablestate spaces, a distribution over functional mappings fromstates to local controllers), requiring an infinite number ofindicator variables. Certainly, such an object is non-trivial tohandle computationally.

However, while the number of latent cluster assign-ments grows unbounded with the size of the state space,the amount of observed trajectory data always remainsfinite. A possible solution to the problem is, therefore, toreformulate the inference task on a reduced state spaceS := s1, s2, . . . , sT containing only states along the ob-served trajectories. Reducing the state space in this waymeans that we need to consider only a finite set of indicatorvariables ztTt=1, one for each expert state s ∈ S , whichalways induces a model of finite size. Assuming that nostate is visited twice, we may further use the same index

set for both variable types.4 In order to limit the complexityof the dependency structure of the indicator variables forlarger data sets, we further let the value of indicator ztdepend only on a subset of the remaining variables z−t asdefined by some neighborhood rule N . The resulting jointdistribution is then given as

pz(z | s) ∝T∏

t=1

exp

2

t′∈Nt

f(dt,t′)δ(zt, zt′)

),

which now implicitly depends on the state sequence sthrough the pairwise distances dt,t′ := χ(st, st′), t, t′ ∈1, . . . , T (hence the conditioning on s).

The use of a finite number of indicator variables alongthe expert trajectories obviously circumvents the above-mentioned problem of representational complexity. Nev-ertheless, there are some caveats associated with this ap-proach. First of all, using a reduced state space model raisesthe question of marginal invariance [32]: if we added a newtrajectory point to the data set, would this change our beliefabout the expert policy at previously visited states? In par-ticular, how is this different from modeling that new pointtogether with the initial ones in the first place? And further,what does such a reduced model imply for unvisited states?Can we still use it to make predictions about their localpolicies? These questions are, in fact, important if we planto use our model to generalize the expert demonstrations tonew situations. For a detailed discussion on this issue, thereader is referred to the supplement. Here, we focus on theinferential aspects of the problem, which means to identifythe system parameters at the given trajectory states.

Another (but related) issue resulting from the reducedmodeling approach is that we lose the simple generativeinterpretation of the process that could be used to explainthe data generation beforehand. In the case of finite statespaces, we could think of a trajectory as being constructedby the following, step-wise mechanism: first, the prior pz(z)is used to generate a set of indicator variables for all states.Independently, we pick some value for α from pα(α) andsample K local control parameters from pθ(θk | α). Tofinally generate a trajectory, we start with an initial states1, generated by p1(s1), select a random action a1 fromπ(a1 | s1,θzs1 ) and transition to a new state s2 according toT (s2 | s1, a1), where we select another action a2, and so on.Such a directed way of thinking is possible since the finitemodel naturally obeys a causal structure where later statesdepend on earlier ones and the decisions made there. Fur-thermore, the cluster assignments and the local controllerscould be generated in advance and isolated from each otherbecause they were modeled marginally independent.

For the reduced state space model, this interpretation nolonger applies as the model has no natural directionality. Infact, its variables depend on each other in a cyclic fashion:altering the value of a particular indicator variable (say, theone corresponding to the last trajectory point) will have

4. Note that we make this assumption for notational convenienceonly and that it is not required from a mathematical point of view.Nonetheless, for uncountable state spaces the assumption is reasonablesince the event of reaching the same state twice has zero probabilityfor most dynamic models. In the general case, however, the indicatorvariables require their own index set to ensure that each system state isassociated with exactly one cluster, even when visited multiple times.

Page 9: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

9

an effect on the values of all remaining indicators due totheir spatial relationship encoded by the “prior distribution”pz(z | s). Changing the values of the other indicators,however, will influence the actions being played at therespective states which, in turn, alters the probability ofending up with the observed trajectory in the first place and,hence, the position and value of the indicator variable westarted with. Explaining the data generation of this modelusing a simple generative process is, therefore, not possible.

Nevertheless, the individual building blocks of ourmodel (that is, the policy, the transition model, etc.) togetherform a valid distribution over the model variables, whichcan be readily used for parameter inference. For the reasonsexplained above, it makes sense to define this distribution inthe form of a discriminative model, ignoring the underlyinggenerative aspects of the process. This is sufficient since wecan always condition on the observed state sequence s,

p(a,Θ, z | s, α) =1

Zspz(z | s)

K∏

k=1

pθ(θk | α) . . .

. . . × p1(s1)T−1∏

t=1

T (st+1 | st, at)π(at | θzst ).

Herein, Zs is a data-dependent normalizing constant. Thestructure of this distribution is illustrated by the factor graphshown in the supplement (Fig. S-1), which highlights thecircular dependence between the variables. Note that, forany fixed state sequence s, this distribution indeed encodesthe same basic properties as the finite model in Eq. (8).In particular, the conditional distributions of all remainingvariables remain unchanged, which allows us to apply thesame inference machinery that we already used in the finitecase. For a deeper discussion on the difference between thetwo models, we again point to the supplement.

3 NONPARAMETRIC POLICY RECOGNITION

In the last section, we presented a probabilistic policy recog-nition framework for modeling the expert behavior usinga finite mixture of K local policies. Basically, there are twosituations when such a model is useful:

• either, we know the true number of expert policies,• or, irrespective of the true behavioral complexity, we

want to find an approximate system description interms of at most K distinct control situations [27](c.f. finite state controllers [29]).

In all other cases, we are faced with the non-trivial prob-lem of choosing K . In fact, the choice of K should notjust be considered a mathematical necessity to performinference in our model. By selecting a certain value forK we can, of course, directly control the complexity classof potentially inferred expert controllers. However, from asystem identification point of view, it is more reasonable toinfer the required granularity of the state partitioning fromthe observed expert behavior itself, instead of enforcing aparticular model complexity. This way, we can gain valuableinformation about the underlying control structure and staterepresentation used by the expert, which offers a possibilityto learn a state partitioning of task-appropriate complexitydirectly from the demonstration data. Hence, the problem

of selecting the right model structure should be consideredas part of the inference problem itself.

From a statistical modeling perspective, there are twocommon ways to approach this problem. One is to make useof model selection techniques in order to determine the mostparsimonious model that is in agreement with the observeddata. However, choosing a particular model complexity stillmeans that we consider only one possible explanation forthe data, although other explanations might be likewiseplausible. For many inference tasks, including this one, themore elegant approach is to keep the complexity flexibleand, hence, adaptable to the data. Mathematically, this canbe achieved by assuming a potentially infinite set of modelparameters (in our case controllers) from which we activateonly a finite subset to explain the particular data set at hand.This alternative way of thinking opens the door to the richclass of nonparametric models, which provide an integratedframework to formulate the inference problem over bothmodel parameters and model complexity as a joint learningproblem.

3.1 A Dirichlet process mixture modelThe classical way to nonparametric clustering is to usea Dirichlet process mixture model (DPMM) [33]. Thesemodels can be obtained by starting from a finite mixturemodel and letting the number of mixture components (i.e.the number of local controllers) approach infinity. In ourcase, we start with the clustering model from Section 2.2,using a mixing prior over indicator variables,

q ∼ DIR(q | γK · 1K)

θk ∼ DIR(θk | α · 1|A|)s1 ∼ p1(s1)

zi | q ∼ CAT(zi | q)

at | st,Θ, z ∼ π(at | θzst )

st+1 | st, at ∼ T (st+1 | st, at) .

(14)

From these equations, we arrive at the corresponding non-parametric model as K goes to infinity. For the theoreticalfoundations of this limit, the reader is referred to the moregeneral literature on Dirichlet processes, such as [33], [34].In this paper, we restrict ourselves to providing the resultingsampling mechanisms for the policy recognition problem.

In a DPMM, the mixing proportions q of the localparameters are marginalized out (that is, we use a col-lapsed sampler). The resulting distribution over partition-ings is described by a Chinese restaurant process (CRP) [35]which can be derived, for instance, by considering the limitK →∞ of the mixing process induced by the Gibbs updatein Eq. (12),

p(zi = k | z−i, γ) ∝ζ

(−i)k if k ∈ 1, . . . ,K∗,γ if k = K∗ + 1.

(15)

Here, K∗ denotes the number of distinct entries in z−iwhich are represented by the numerical values 1, . . . ,K∗.In this model, a state joins an existing cluster (i.e. a group ofstates whose indicators have the same value) with probabil-ity proportional to the number of states already contained inthat cluster. Alternatively, it may create a new cluster withprobability proportional to γ.

From the model equations (14) it is evident that, given aparticular setting of indicators, the conditional distributionsof all other variable types remain unchanged. Effectively, we

Page 10: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

10

only replaced the prior model pz(z) by the CRP. Hence, wecan apply the same Gibbs updates for the actions and con-trollers as before and need to rederive only the conditionaldistributions of the indicator variables under considerationof the above defined process. According to Eq. (15), weherein need to distinguish whether an indicator variabletakes a value already occupied by other indicators (i.e. itjoins an existing cluster) or it is assigned a new value (i.e.it creates a new cluster). Let θkK

k=1 denote the set ofcontrol parameters associated with z−i. In the first case(k ∈ 1, . . . ,K∗), we can then write

p(zi = k | z−i, s,a, θk′K∗

k′=1, α, γ)

= p(zi = k | z−i, att:st=i,θk, α, γ)

∝ p(zi = k | z−i,θk, α, γ)p(att:st=i | zi = k,z−i,θk, α, γ)

∝ p(zi = k | z−i, γ)p(att:st=i | θk)

∝ ζ(−i)k ·

t:st=i

π(at | θk).

In the second case (k = K∗ + 1), we instead obtain

p(zi = K∗ + 1 | z−i, s,a, θkK∗

k=1, α, γ)

= p(zi = K∗ + 1 | z−i, att:st=i, α, γ)

∝ p(zi = K∗ + 1 | z−i, α, γ) . . .

. . . × p(att:st=i | zi = K∗ + 1, z−i, α, γ)

∝ p(zi = K∗ + 1 | z−i, γ)p(att:st=i | zi = K∗ + 1, α)

∝ γ ·∫

∆p(att:st=i | θK∗+1)pθ(θK∗+1 | α) dθK∗+1

∝ γ ·∫

t:st=i

π(at | θK∗+1)pθ(θK∗+1 | α) dθK∗+1

∝ γ · DIRMULT(φi | α).

If a new cluster is created, we further need to initialize thecorresponding control parameter θK∗+1 by performing therespective Gibbs update, i.e. by sampling from

p(θK∗+1 | z, s,a, θkK∗

k=1, α, γ)

= p(θK∗+1 | att:zst=K∗+1, α)

∝ pθ(θK∗+1 | α)p(att:zst=K∗+1 | θK∗+1)

∝ pθ(θK∗+1 | α) ·∏

t:zst=K∗+1

π(at | θK∗+1)

∝ DIR(θK∗+1 | ξK∗+1 + α · 1|A|).Should a cluster get unoccupied during the sampling pro-cess, the corresponding control parameter may be removedfrom the stored parameter set θk and the index set fork needs to be updated accordingly. Note that this samplingmechanism is a specific instance of Algorithm 2 described in[33]. A collapsed variant can be derived in a similar fashion.

3.2 Policy recognition using the distance-dependentChinese restaurant processIn the previous section, we have seen that the DPMM canbe derived as the nonparametric limit model of a finitemixture using a set of latent mixing proportions q for theclusters. Although the DPMM allows us to keep the numberof active controllers flexible and, hence, adaptable to thecomplexity of the demonstration data, the CRP as the un-derlying clustering mechanism does not capture any spatial

s1

s2

s3 s4

s5 s6

s7 s8

s9

c1

c2

c3c4

c5

c6c7

c8

c9

S

Fig. 3: Schematic illustration of the ddCRP-based clusteringapplied to the reduced state space model in Section 2.3.Each trajectory state is connected to some other state ofthe sequence. The connected components of the resultinggraph implicitly define the state clustering. Coloring of thebackground illustrates the spatial cluster extrapolation (seeSection A in the supplement). Note that the underlyingdecision-making process is assumed to be discrete in time;the continuous gray line shown in the figure is only toindicate the temporal ordering of the trajectory states.

dependencies between the indicator variables. In fact, in theCRP, the indicators zi are coupled only via their relativefrequencies (see Eq. (15)) but not through their individuallocations in space, resulting in an exchangeable collectionof random variables [35]. In fact, one could argue that thespatial structure of the clustering problem is a priori ignored.

The fact that DPMMs are nevertheless used for spatialclustering tasks can be explained by the particular formof data likelihood models that are typically used for themixture components. In a Gaussian mixture model [36], forinstance, the spatial clusters emerge due to the unimodalnature of the mixture components, which encodes the local-ity property of the model needed to obtain a meaningfulspatial clustering of the data. For the policy recognitionproblem, however, the DPMM is not able to exploit anyspatial information via the data likelihood since the cluster-ing of states is performed at the level of the inferred actioninformation (see Eq. (10)) and not on the state sequenceitself. Consequently, we cannot expect to obtain a smoothclustering of the system state space, especially when theexpert policies are overlapping (i.e. when they share one ormore common actions) so that the action information aloneis not sufficient to discriminate between policies. For un-countable state spaces, this problem is further complicatedby the fact that we observe at most one expert state transitionper system state. Here, the spatial context of the data is theonly information which can resolve this ambiguity.

In order to facilitate a spatially smooth clustering, wetherefore need to consider non-exchangeable distributionsover partitionings. More specifically, we need to design ourmodel in such a way that, whenever a state s is “close” tosome other state s′ and assigned to some cluster Ck, then,a priori, s′ should belong to the same cluster Ck with highprobability. In that sense, we are looking for the nonpara-metric counterpart of the Potts model. One model with suchproperties is the distance-dependent Chinese restaurant

Page 11: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

11

process (ddCRP) [32].5 As opposed to the traditional CRP,the ddCRP explicitly takes into account the spatial structureof the data. This is done in the form of pairwise distancesbetween states, which can be obtained, for instance, bydefining an appropriate distance metric on the state space(see Section 2.2.3). Instead of assigning states to clusters asdone by the CRP, the ddCRP assigns states to other statesaccording to their pairwise distances. More specifically, theprobability that state i gets assigned to state j is defined as

p(ci = j |D, ν) ∝ν if i = j,

f(di,j) otherwise,(16)

where ν ∈ [0,∞) is called the self-link parameter of the pro-cess,D denotes the collection of all pairwise state distances,and ci is the “to-state” assignment of state i, which can bethought of as a directed edge on the graph defined on theset of all states (see Fig. 3). Accordingly, i and j in Eq. (16)can take values 1, . . . , |S| for the finite state space modeland 1, . . . , T for our reduced state space model. The stateclustering is then obtained as a byproduct of this mappingvia the connected components of the resulting graph (seeFig. 3 again).

Replacing the CRP by the ddCRP and following thesame line of argument as in [32], we obtain the requiredconditional distribution of the state assignment ci as

p(ci=j |c−i,s,a,α,D, ν)∝

ν j = i,

f(di,j) no clusters merged,f(di,j)·L Czi and Czj merged,

where we use the shorthand notation

L =DIRMULT(ξzi + ξzj | α)

DIRMULT(ξzi | α)DIRMULT(ξzj | α)

for the data likelihood term. The ξk,j ’s are defined as inEq. (9) but are based on the clustering which arises when weignore the current link ci. The resulting Gibbs sampler is acollapsed one as the local control parameters are necessarilymarginalized out during the inference process.

4 SIMULATION RESULTS

In this section, we present simulation results for two types ofsystem dynamics. As a proof-of-concept, we first investigatethe case of uncountable state spaces which we consider themore challenging setting for reasons explained earlier. Tocompare our framework with existing methods, we furtherprovide simulation results for the standard grid worldbenchmark (see e.g. [9], [11], [19]). It should be pointedout, however, that establishing a fair comparison betweenLfD models is generally difficult due to their differentworking principles (e.g. reward prediction vs. action predic-tion), objectives (system identification vs. optimal control),requirements (e.g. MDP solver, knowledge of the expert’sdiscount factor, countable vs. uncountable state space), and

5. Note that the authors of [32] avoid calling this model nonparamet-ric since it cannot be cast as a mixture model originating from a randommeasure. However, we stick to this term in order to make a cleardistinction to the parametric models in Section 2, and to highlight thefact that there is no parameter K determining the number of controllers.

Fig. 4: Schematic illustration of the expert policy used inSection 4.1, which applies eight local controllers to sixteendistinct regions. A sample trajectory is shown in color.

assumptions (e.g. deterministic vs. stochastic expert behav-ior). Accordingly, our goal is rather to demonstrate theprediction abilities of the considered models than to pushthe models to their individual limits. Therefore, and to re-duce the overall computational load, we tuned most modelhyper-parameters by hand. Our code is available at https://github.com/AdrianSosic/BayesianPolicyRecognition.

4.1 Example 1: uncountable state spaceAs an illustrative example, we consider a dynamical sys-tem which describes the circular motion of an agent ona two-dimensional state space. The actions of the agentcorrespond to 24 directions that divide the space of possibleangles [0, 2π) into equally-sized intervals. More specifically,action j corresponds to the angle (j − 1) 2π

24 . The transitionmodel of the system is defined as follows: for each selectedaction, the agent first makes a step of length µ = 1 inthe intended direction. The so-obtained position is thendistorted by additive zero-mean isotropic Gaussian noise ofvariance σ2. This defines our transition kernel as

T (st+1 | st, at = j) = N (st+1 | st + µ · ej , σ2I), (17)

where st, st+1 ∈ R2, ej denotes the two-dimensional unitvector pointing in the direction of action j, and I is thetwo-dimensional identity matrix. The overall goal of ouragent is to describe a circular motion around the origin inthe best possible manner allowed by the available actions.However, due to limited sensory information, the agent isnot able to observe its exact position on the plane but canonly distinguish between certain regions of the state space,as illustrated by Fig. 4. Also, the agent is unsure about theoptimal control strategy, i.e. it does not always make optimaldecisions but selects its actions uniformly at random from asubset of actions, consisting of the optimal one and the twoactions pointing to neighboring directions (see Fig. 4 again).To increase the difficulty of the prediction task, we furtherlet the agent change the direction of travel whenever thecritical distance of r = 5 to the origin is exceeded.

Having defined the expert behavior, we generate 10sample trajectories of length T = 100. Herein, we assumea motion noise level of σ = 0.2 and initialize the agent’s

Page 12: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

12

position uniformly at random on the unit circle. An exampletrajectory is shown in Fig. 4. The obtained trajectory data isfed into the presented inference algorithms to approximatethe posterior distribution over expert controllers, and thewhole experiment is repeated in 100 Monte Carlo runs.

For the spatial models, we use the Euclidean metric tocompute the pairwise distances between states,

χ(s, s′) = ||s− s′||2. (18)

The corresponding similarity values are calculated using aGaussian-shaped kernel. More specifically,

fPotts(d) = exp

(− d

2

σ2f

)

for the Potts model and

fddCRP(d) = (1− ε)fPotts(d) + ε

for the ddCRP model, with σf = 1 and a constant offset ofε = 0.01 which ensures that states with large distances canstill join the same cluster. For the Potts model, we furtheruse a neighborhood structure containing the eight closesttrajectory points of a state. This way, we ensure that, inprinciple, each local expert policy may occur at least oncein the neighborhood of a state. The concentration parameterfor the local controls is set to α = 1, corresponding to auniform prior belief over local policies.

A major drawback of the Potts model is that posteriorinference about the temperature parameter β is complicateddue to the nonlinear effect of the parameter on the nor-malization of the model. Therefore, we manually selecteda temperature of β = 1.6 based on a minimization ofthe average policy prediction error (discussed below) viaparameter sweeping. As opposed to this, we extend theinference problem for the ddCRP to the self-link parameter νas suggested in [32]. For this, we use an exponential prior,

pν(ν) = EXP(ν | λ),

with rate parameter λ = 0.1, and applied the independenceMetropolis-Hastings algorithm [37] using pν(ν) as proposaldistribution with an initial value of ν = 1. In all oursimulations, the sampler quickly converged to its stationarydistribution, yielding posterior values for ν with a mean of0.024 and a standard deviation of 0.023.

To locally compare the predicted policy with the groundtruth at a given state, we compute their earth mover’sdistance (EMD) [38] with a ground distance metric measur-ing the absolute angular difference between the involvedactions. To track the learning progress of the algorithms,we calculate the average EMD over all states of the giventrajectory set at each Gibbs iteration. Herein, the local policypredictions are computed from the single Gibbs sample ofthe respective iteration, consisting of all sampled actions, in-dicators and – in case of non-collapsed sampling – the localcontrol parameters. The resulting mean EMDs and standarddeviations are depicted in Fig. 5. The inset further showsthe average EMD computed at non-trajectory states whichare sampled on a regular grid (depicted in the supplement),reflecting the quality of the resulting spatial prediction.

As expected, the finite mixture model (using the truenumber of local policies, a collapsed mixing prior, and

10 0 10 1 10 2

Gibbs iteration

0.4

0.6

0.8

1

1.2

1.4

aver

age

EM

D

mixing priorPotts (complete)Potts (collapsed)ddCRP

10 0 10 1 10 20.40.60.8

11.21.4

Fig. 5: Average policy prediction error at the simulatedtrajectory states (main figure) and at non-trajectory states(inset). Shown are the empirical mean values and standarddeviations, estimated from 100 Monte Carlo runs.

γ = 1) is not able to learn a reasonable policy representationfrom the expert demonstrations since it does not explore thespatial structure of the data. In fact, the resulting predictionerror shows only a slight improvement as compared toan untrained model. In contrast to this, all spatial modelscapture the expert behavior reasonably well. In agreementwith our reasoning in Section 2.1.2, we observe that thecollapsed Potts model mixes significantly faster and has asmaller prediction variance than the non-collapsed version.However, the ddCRP model gives the best result, bothin terms of mixing speed (see [32] for an explanation ofthis phenomenon) and model accuracy. Interestingly, this isdespite the fact that the ddCRP model additionally needs toinfer the number of local controllers necessary to reproducethe expert behavior. The corresponding posterior distribu-tion, which shows a pronounced peak at the true number,is depicted in the supplement. There, we also provide addi-tional simulation results which give insights into the learnedstate partitioning and the resulting spatial policy predictionerror. The results reveal that all expert motion patterns canbe identified by our algorithm.

4.2 Example 2: finite state spaceIn this section, we compare the prediction capabilities of ourmodel to existing LfD frameworks, in particular: the max-imum margin method in [9] (max-margin), the maximumentropy approach in [12] (max-ent), and the expectation-maximization algorithm in [11] (EM). For the comparison,we restrict ourselves to the ddCRP model which showedthe best performance among all presented models.

As a first experiment, we compare all methods on a finiteversion of the setting in Section 4.1, which is obtained bydiscretizing the continuous state space into a regular gridS = (x, y) ∈ Z2 : |x|, |y| ≤ 10, resulting in a total of 441states. The transition probabilities are chosen proportionalto the normal densities in Eq. (17) sampled at the grid points.Here, we used a noise level of σ = 1 and a reduced numberof eight actions. Probability mass “lying outside” the finitegrid area is shifted to the closest border states of the grid.

Figure 6a delineates the average EMD over the numberof trajectories (each of length T = 10) provided for training.

Page 13: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

13

10 1 10 2 10 3 10 4

number of trajectories

0

0.5

1

1.5

2av

erag

eEM

D

max-entmax-marginEMddCRP

(a) learning curves (circular policy)

10 1 10 2 10 3 10 4

number of trajectories

0

0.5

1

1.5

aver

age

EM

D

max-entmax-marginEMddCRP

(b) learning curves (MDP policy)

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5max-entmax-marginEMddCRP

(c) model robustness (MDP policy)

Fig. 6: Average EMD values for the prediction task described in Section 4.2: (a) circular policy (b,c) MDP policy. Shown arethe empirical mean values and standard deviations, estimated from 100 Monte Carlo runs. The EMD values are computedbased on (a) the predicted action distributions and (b,c) the predicted next-state distributions. Note that the curves ofmax-ent (purple) and max-margin (yellow) in subfigure (a) lie on top of each other.

We observe that neither of the two intentional models (max-ent and max-margin) is able to capture the demonstrated ex-pert behavior. This is due to the fact that the circular expertmotion cannot be explained by a simple state-dependentreward structure but requires a more complex state-actionreward model, which is not considered in the original for-mulations [9], [12]. While the EM model is indeed able tocapture the general trend of the data, the prediction is lessaccurate as compared to that of the ddCRP model, since itcannot reproduce the stochastic nature of the expert policy.In fact, this difference in performance will become evenmore pronounced for expert policies which distribute theirprobability mass on a larger subset of actions. Therefore,the ddCRP model outperforms all other models since theprovided expert behavior violates their assumptions.

To analyze how the ddCRP competes against the othermodels in their nominal situations, we further compare allalgorithms on a standard grid world task where the expertbehavior is obtained as the optimal response to a simplestate-dependent reward function. Herein, each state on thegrid is assigned a reward with a chance of 1%, which is thendrawn from a standard normal distribution. Worlds whichcontain no reward are discarded. The discount factor 0.9,which is used to compute the expert policy (see [17]), isprovided as additional input for the intentional models.The results are shown in Figure 6b, which illustrates thatthe intention-based max-margin method outperforms allother methods for small amounts of training data. The sub-intentional methods (EM and ddCRP), on the other hand,yield better asymptotic estimates and smaller predictionvariances. It should be pointed out that the three referencemethods have a clear advantage over the ddCRP in this casebecause they assume a deterministic expert behavior a prioriand do not need to infer this piece of information from thedata. Despite this additional challenge, the ddCRP modelyields a competitive performance.

Finally, we compare all approaches in terms of theirrobustness against modeling errors. For this purpose, werepeat the previous experiment with a fixed number of 1000trajectories but employ a different transition model for infer-ence than used for data generation. More specifically, we uti-lize an overly fine-grained model consisting of 24 directions,assuming that the true action set is unknown, as suggested

in Section 1.1. Additionally, we perturb the assumed modelby multiplying (and later renormalizing) each transitionprobability with a random number generated according tof(u) = tan(π4 (u + 1)), with u ∼ UNIFORM(−η, η) andperturbation strength η ∈ [0, 1]. Due the resulting modelmismatch, a comparison to the ground truth policy basedon the predicted action distribution becomes meaningless.Instead, we compute the Euclidean EMDs between thetrue and the predicted next-state distributions, which weobtain by marginalizing the actions of the true/assumedtransition model with respect to the true/learned policy.Figure 6c depicts the resulting prediction performance fordifferent perturbation strengths η. The results confirm thatour approach is not only less sensitive to modeling errors asargued in Section 1.1; also, the prediction variance is notablysmaller than those of the intentional models.

5 CONCLUSION

In this work, we proposed a novel approach to LfD byjointly learning the latent control policy of an observedexpert demonstrator together with a task-appropriate rep-resentation of the system state space. With the describedparametric and nonparametric models, we presented twoformulations of the same problem that can be used eitherto learn a global system controller of specified complexity,or to infer the required model complexity from the ob-served expert behavior itself. Simulation results for bothcountable and uncountable state spaces and a comparisonto existing frameworks demonstrated the efficacy of ourapproach. Most notably, the results showed that our methodis able to learn accurate predictive behavioral models insituations where intentional methods fail, i.e. when theexpert behavior cannot be explained as the result of a simpleplanning procedure. This makes our method applicable toa broader range of problems and suggests its use in a moregeneral system identification context where we have no suchprior knowledge about the expert behavior. Additionally,the task-adapted state representation learned through ourframework can be used for further reasoning.

Page 14: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

14

REFERENCES

[1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A surveyof robot learning from demonstration,” Robotics and AutonomousSystems, vol. 57, no. 5, pp. 469–483, 2009.

[2] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian pro-cesses for data-efficient learning in robotics and control,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 37,no. 2, pp. 408–423, 2015.

[3] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learningin reinforcement learning,” in Proc. 22nd International Conference onMachine Learning, 2005, pp. 1–8.

[4] D. Michie, M. Bain, and J. Hayes-Miches, “Cognitive models fromsubcognitive skills,” IEE Control Engineering Series, vol. 44, pp. 71–99, 1990.

[5] C. Sammut, S. Hurst, D. Kedzier, and D. Michie, “Learning to fly,”in Proc. 9th International Workshop on Machine Learning, 1992, pp.385–393.

[6] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aer-obatics through apprenticeship learning,” The International Journalof Robotics Research, 2010.

[7] D. A. Pomerleau, “Efficient training of artificial neural networksfor autonomous navigation,” Neural Computation, vol. 3, no. 1, pp.88–97, 1991.

[8] C. G. Atkeson and S. Schaal, “Robot learning from demonstra-tion,” in Proc. 14th International Conference on Machine Learning,vol. 97, 1997, pp. 12–20.

[9] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inversereinforcement learning,” in Proc. 21st International Conference onMachine Learning, 2004.

[10] A. Panella and P. J. Gmytrasiewicz, “Nonparametric Bayesianlearning of other agents’ policies in interactive POMDPs,” inProc. International Conference on Autonomous Agents and MultiagentSystems, 2015, pp. 1875–1876.

[11] A. Sosic, A. M. Zoubir, and H. Koeppl, “Policy recognition via ex-pectation maximization,” in Proc. 41st IEEE International Conferenceon Acoustics, Speech and Signal Processing, 2016.

[12] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maxi-mum entropy inverse reinforcement learning.” in Proc. 23rd AAAIConference on Artificial Intelligence, 2008, pp. 1433–1438.

[13] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcementlearning.” in Proc. 17th International Conference on Machine Learning,2000, pp. 663–670.

[14] D. Ramachandran and E. Amir, “Bayesian inverse reinforcementlearning,” Proc. 20th International Joint Conference on Artical Intelli-gence, vol. 51, pp. 2586–2591, 2007.

[15] S. D. Parsons, P. Gymtrasiewicz, and M. J. Wooldridge, Game theoryand decision theory in agent-based systems. Springer Science &Business Media, 2012, vol. 5.

[16] K. Hindriks and D. Tykhonov, “Opponent modelling in automatedmulti-issue negotiation using Bayesian learning,” in Proc. 7thInternational Joint Conference on Autonomous Agents and MultiagentSystems, 2008, pp. 331–338.

[17] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 1998.

[18] U. Syed and R. E. Schapire, “A game-theoretic approach to ap-prenticeship learning,” in Advances in Neural Information ProcessingSystems, 2007, pp. 1449–1456.

[19] B. Michini and J. P. How, “Bayesian nonparametric inverse rein-forcement learning,” in Machine Learning and Knowledge Discoveryin Databases. Springer, 2012, pp. 148–163.

[20] C. G. Atkeson and J. C. Santamaria, “A comparison of direct andmodel-based reinforcement learning,” in International Conferenceon Robotics and Automation, 1997.

[21] C. A. Rothkopf and C. Dimitrakakis, “Preference elicitation and in-verse reinforcement learning,” in Machine Learning and KnowledgeDiscovery in Databases. Springer, 2011, pp. 34–48.

[22] O. Pietquin, “Inverse reinforcement learning for interactive sys-tems,” in Proc. 2nd Workshop on Machine Learning for InteractiveSystems, 2013, pp. 71–75.

[23] S. Schaal, “Is imitation learning the route to humanoid robots?”Trends in Cognitive Sciences, vol. 3, no. 6, pp. 233–242, 1999.

[24] K. Dvijotham and E. Todorov, “Inverse optimal control withlinearly-solvable MDPs,” in Proc. 27th International Conference onMachine Learning, 2010, pp. 335–342.

[25] E. Charniak and R. P. Goldman, “A Bayesian model of planrecognition,” Artificial Intelligence, vol. 64, no. 1, pp. 53–79, 1993.

[26] B. Piot, M. Geist, and O. Pietquin, “Learning from demonstrations:Is it worth estimating a reward function?” in Machine Learning andKnowledge Discovery in Databases. Springer, 2013, pp. 17–32.

[27] M. Waltz and K. Fu, “A heuristic approach to reinforcementlearning control systems,” IEEE Transactions on Automatic Control,vol. 10, no. 4, pp. 390–398, 1965.

[28] F. Doshi-Velez, D. Pfau, F. Wood, and N. Roy, “Bayesian non-parametric methods for partially-observable reinforcement learn-ing,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 37, no. 2, pp. 394–407, 2015.

[29] N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling, “Learningfinite-state controllers for partially observable environments,” inProc. 15th Conference on Uncertainty in Artificial Intelligence, 1999,pp. 427–436.

[30] R. B. Potts, “Some generalized order-disorder transformations,”in Mathematical Proceedings of the Cambridge Philosophical Society,vol. 48, no. 01, 1952, pp. 106–109.

[31] D. Koller and N. Friedman, Probabilistic Graphical Models: Principlesand Techniques. MIT press, 2009.

[32] D. M. Blei and P. I. Frazier, “Distance dependent Chinese restau-rant processes,” The Journal of Machine Learning Research, vol. 12,pp. 2461–2488, 2011.

[33] R. M. Neal, “Markov chain sampling methods for Dirichlet processmixture models,” Journal of Computational and Graphical Statistics,vol. 9, no. 2, pp. 249–265, 2000.

[34] T. S. Ferguson, “A Bayesian analysis of some nonparametricproblems,” The Annals of Statistics, pp. 209–230, 1973.

[35] D. J. Aldous, Exchangeability and related topics. Springer, 1985.[36] C. E. Rasmussen, “The infinite Gaussian mixture model,” in Ad-

vances in Neural Information Processing Systems. MIT Press, 2000,pp. 554–560.

[37] S. Chib and E. Greenberg, “Understanding the Metropolis-Hastings algorithm,” The American Statistician, vol. 49, no. 4, pp.327–335, 1995.

[38] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributionswith applications to image databases,” in 6th International Confer-ence on Computer Vision, 1998, 1998, pp. 59–66.

Adrian Sosic is a member of the Signal Pro-cessing Group and an associate member ofthe Bioinspired Communication Systems Lab atTechnische Universitat Darmstadt. Currently, heis working towards his Ph.D. degree under thesupervision of Prof. Abdelhak M. Zoubir andProf. Heinz Koeppl. His research interests centeraround topics from machine learning and (in-verse) reinforcement learning, with a focus onprobabilistic inference, multi-agent systems, andBayesian nonparametrics.

Abdelhak M. Zoubir is professor at the De-partment of Electrical Engineering and Informa-tion Technology at Technische Universitat Darm-stadt, Germany. His research interest lies in sta-tistical methods for signal processing with em-phasis on bootstrap techniques, robust detectionand estimation, and array processing appliedto telecommunications, radar, sonar, automotivemonitoring and biomedicine.

Heinz Koeppl is professor at the Department ofElectrical Engineering and Information Technol-ogy at Technische Universitat Darmstadt, Ger-many. His research interests include Bayesianinference methods for biomolecular data andmethods for reconstructing large-scale biologicalor technological multi-agent systems from obser-vational data.

Page 15: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

1

A Bayesian Approach to Policy Recognitionand State Representation Learning

(Supplement)Adrian Sosic, Abdelhak M. Zoubir and Heinz Koeppl

F

A MARGINAL INVARIANCE & POLICY PREDICTIONIN LARGE STATE SPACES

When we extended our reasoning to large state spacesin Section 2.3 using a reduced state space model (seeFig. S-1), we inevitably arrived at the following questions:By modeling the expert behavior only along observedtrajectories, what does the resulting model imply for theremaining states of the state space? Can we still use it forpredicting their local policies? The purpose of this sectionis to provide an in-depth discussion on the implications ofthis reduced modeling approach in the context of policyprediction.

When investigating the above-mentioned questions from aprobabilistic perspective (i.e. by analyzing the induced jointdistribution of our model), it turns out that they are stronglyrelated to what is known as marginal invariance [1] (some-times also referred to as marginalization property or simplyconsistency [2]). This property states that a model is consis-tent in the sense that it always provides the same marginaldistributions for any subset of its variables, irrespective ofthe initial model size. In other words, a marginally invariantpolicy model yields the same answer for the given trajectorypoints, even if we include additional states into our reducedset S for which we have not observed any demonstrations.

For our spatial models, that is, the Potts model andthe ddCRP, it can be shown that this consistency propertyis indeed lacking (see [1] for a detailed discussion). Thismeans that we cannot expect to get compatible results whenconducting our reduced model inference on two data setsof different sizes. On the contrary, making predictions fornew states would always require to rerun our Gibbs sampleron the augmented data set, including all additional states.This brings us to the following practical dilemma: imaginean on-line policy recognition scenario where we observe anexpert controlling our system. After a certain period of time,we are asked to take over control, using the experience wehave acquired during the observation period. Each controlcommand, whether performed by the expert or by us, willtrigger a new state transition, meaning that new data pointsarrive sequentially one after another. Consequently, it isimpossible to decide in advance which states to include in

s1

s2

s3 s4

s5s6

s7 s8

s9

z1

z2

z3 z4

z5 z6

z7 z8

z9

S

(a) reduced state space model

st

T

st+1

pz

T

st+2

at at+1 at+2

zt zt+1 zt+2

ππ

π

Θpθ

α

(b) corresponding factor graph

Fig. S-1: (a) Illustration of the reduced state space model,which operates on the space S = s1, s2, . . . , sT of visitedtrajectory states. Note that the underlying decision-makingprocess is assumed to be discrete in time; the continuousgray line shown in the figure is only to indicate the temporalordering of the trajectory states. (b) Corresponding factorgraph, highlighting the circular dependence between thevariables. The factors are defined by the same buildingblocks that are used for the finite state space model. Ob-served variables are shaded in gray.

Page 16: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

2

-15 -10 -5 0 5 10 15

-15

-10

-5

0

5

10

15 0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

EM

D

(a) mean prediction error-15 -10 -5 0 5 10 15

-15

-10

-5

0

5

10

150.2

0.4

0.6

0.8

1

EM

D

(b) standard deviation-15 -10 -5 0 5 10 15

-15

-10

-5

0

5

10

15 1

2

3

4

5

6

7

8

cont

rolle

r

(c) sample partitioning

0 5 10 15 20number of controllers

0

0.1

0.2

0.3

0.4

pro

bab

ility

(d) inferred model complexity

0

0.2

0

pro

bab

ility

0.4

8

0.6

5 7610 5415 3actions 220 125co

ntrolle

rs

(e) posterior sample of the local controllers

Fig. S-2: Simulation results for the ddCRP model on the continuous state space task described in Section 4.1. (a) Meanvalues of the spatial policy prediction error. (b) Standard deviations of the spatial policy prediction error. (c) Examplepartitioning of the state space, based on the local controllers depicted in sub-figure (e). The expert partitioning is shownin Fig. 4 of the main paper. (d) Posterior distribution of the number of local controllers. (e) Posterior sample of the localcontrollers found by the model. The results in (a,b,d) are based on 100 Monte Carlo runs while (c,e) are obtained from asingle posterior sample. The figures in the top row were rendered using a spatial resolution of 2000x2000.

our reduced space S and which not. A rigorous approach inthe above-described sense would thus require to recalibratethe model after each state transition – a costly operation.

However, it is evident that the resulting data set isnaturally divided into two disjoint parts, namely the expertdemonstrations and the subsequent states reached duringexecution of the learned policy. Clearly, transitions occurringafter the learning phase should by no means affect ourbelief about the expert policy and, hence, they should becompletely discarded from the model. The easiest way toachieve this is, indeed, to “freeze” the model after thedemonstration phase and to use the learned parameters toextrapolate the gathered policy information to surroundingstates. This can be done, for instance, by retaining thestructure of the involved spatial prior model to compute theresulting maximum a posteriori estimates for the extrapolatedindicators of the new states, based on the inferred modelparameters. In the case of the ddCRP, this coincides withthe nearest-neighbor estimate (see Eq. (16)),

cnew = argmaxt∈1,...,T

f(dt,new) = argmint∈1,...,T

dt,new. (S-1)

Herein, cnew is the estimate for the indicator of the newstate and dt,new denotes the distance of that state to the tthtrajectory point.

Now, one could argue that the comfort of retaining afinite model structure for modeling inference problems oncountably infinite or uncountable state spaces comes at thecost of not being able to provide a consistent posteriorpredictive distribution. However, the reduced state spaceapproach allows us to incorporate the spatial informationof the data in a fairly natural manner (i.e. in the form ofpairwise distances), providing an easy way to model theexpert behavior. Furthermore, our results demonstrate thatthe reduced model is able to capture the relevant spatialproperties of a policy sufficiently accurate in order to makeprofound predictions about unseen states (see also sub-sequent section). Whether there exist alternative tractablemodels with similar properties remains to be seen.

B ADDITIONAL SIMULATION RESULTS

In this section, we provide additional simulation resultsfor the ddCRP model on the continuous state space taskdescribed in Section 4.1.

Figure S-2a visualizes the spatial EMD prediction errorsof the trained model in the form of a heat map, whichcompares the ground truth expert policy at non-trajectorypoints with the mean prediction provided by our model.The test points are placed on a regular grid of size 2000x2000centered around the origin. The required indicator variables

Page 17: 1 A Bayesian Approach to Policy Recognition and State ... · 1 A Bayesian Approach to Policy Recognition and State Representation Learning Adrian Soˇ ˇsi c, Abdelhak M. Zoubir and

3

at the interpolated states are computed according to Eq. (S-1). In line with our expectation, the prediction error reachesits maximum at the policy boundaries but is comparablysmall within each policy region, indicating a good modelfit. Note that the “windmill shape” of the error can beexplained as a result of the reduced state space approach incombination with the inherent asymmetry of the used datageneration scheme: regions of the state space containing tra-jectory endings are locally underrepresented in the data set(see example trajectory in Fig. 4 in the paper); this increasesthe chance of assigning the end points of a trajectory to thecluster of the preceding region, resulting in a smearing ofthe previous cluster into the next region.

Also, we can observe that the variance of the error(Fig. S-2b) reaches its maximum at the transition regionsand generally grows with the distance to the supportingtrajectory data, reflecting the increasing prediction uncer-tainty at cluster boundaries and regions far from the expertdemonstrations. Both figures were computed based on thelearned policy representations of 100 Monte Carlo runs. Fig-ure S-2c illustrates an example state partitioning of one suchexperiment, using the inferred local controllers depicted inFig. S-2e. The result reveals that all expert motion patternscould be identified by our model. Note, however, that thetwo figures correspond to a single Gibbs sample of theprocess, which is not representative for the whole posteriordistribution. Averaging over several experiments as done inFig. S-2a and Fig. S-2b is not possible at the sample level dueto the varying dimensionality of the corresponding policyrepresentations (i.e. the number of learned controllers). Eventaking averages over samples of equal dimensionality isnot meaningful due to the multimodality of the posteriordistribution, which arises from the inherent symmetry ofthe representation (i.e. interchanging two local controllerstogether with their corresponding indices yields the samemodel). Hence, averaging samples is possible only at theprediction level.

Finally, Fig. S-2d depicts the posterior distribution of thenumber of local controllers used by the model, which showsa pronounced peak at the true number used by the expert.

C COMPUTATIONAL COMPLEXITY

The overall computational cost of performing inference inour model depends largely on two factors: the complexityper Gibbs iteration and the mixing speed of the underlyingMarkov chain. Each Gibbs iteration consists of up to threestages: 1) sampling T categorical action variables at fromthe set 1, . . . , |A|, where T is the size of the demonstra-tion set; 2) ddCRP model: sampling NS categorical stateassignments ci from the set 1, . . . , NS, where NS isthe number of states (i.e. |S| or |S|); remaining models:sampling NS categorical partition assignments zi fromthe set 1, . . . ,K, where K is the number of local con-trollers; 3) for non-collapsed models: sampling K Dirichlet-distributed control parameters θk on the (|A| − 1)-simplex.

Collapsing the control parameters generally improvesthe mixing speed of the chain (see Fig. 5 in the paper) butrequires that action variables belonging to the same clusterbe updated sequentially; hence, a non-collapsed strategy can

be advantageous for larger data sets. Sampling the variablesat, θk and zi is computationally cheap because theinvolved action likelihoods T (s′ | s, a) as well as theneighborhood structure N (Potts model) and the similarityvalues f(di,j) can be pre-computed. The most demandingoperation is the update of ci, which requires trackingthe connected components of the underlying ddCRP graph.Using an appropriate graph representation, this can be donein polylogarithmic worst case time [3].

REFERENCES

[1] D. M. Blei and P. I. Frazier, “Distance dependent Chinese restaurantprocesses,” The Journal of Machine Learning Research, vol. 12, pp.2461–2488, 2011.

[2] C. E. Rasmussen, “Gaussian processes for machine learning.” MITPress, 2006.

[3] B. M. Kapron, V. King, and B. Mountjoy, “Dynamic graph con-nectivity in polylogarithmic worst case time,” in Proc. 24th AnnualACM-SIAM Symposium on Discrete Algorithms, 2013, pp. 1131–1142.