15
OPERATIONS RESEARCH Vol. 58, No. 1, January–February 2010, pp. 214–228 issn 0030-364X eissn 1526-5463 10 5801 0214 inf orms ® doi 10.1287/opre.1090.0697 © 2010 INFORMS Partially Observable Markov Decision Processes: A Geometric Technique and Analysis Hao Zhang Marshall School of Business, University of Southern California, Los Angeles, California 90089, [email protected] This paper presents a novel framework for studying partially observable Markov decision processes (POMDPs) with finite state, action, observation sets, and discounted rewards. The new framework is solely based on future-reward vectors associated with future policies, which is more parsimonious than the traditional framework based on belief vectors. It reveals the connection between the POMDP problem and two computational geometry problems, i.e., finding the vertices of a convex hull and finding the Minkowski sum of convex polytopes, which can help solve the POMDP problem more efficiently. The new framework can clarify some existing algorithms over both finite and infinite horizons and shed new light on them. It also facilitates the comparison of POMDPs with respect to their degree of observability, as a useful structural result. Subject classifications : dynamic programming: Markov; analysis of algorithms: computational complexity; mathematics: combinatorics; computers/computer science: artificial intelligence. Area of review : Optimization. History : Received January 2008; revision received July 2008; accepted November 2008. Published online in Articles in Advance July 29, 2009. 1. Introduction and Literature Review Markov decision processes (MDPs) provide one of the fun- damental models in operations research, in which a deci- sion maker controls the evolution of a dynamic system. Although this model is mature, with well-developed the- ories, as in Puterman (1994), it is based on the assump- tion that the state of the system can be perfectly observed. Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit POMDP model is commonly attributed to Drake (1962), and it attracted the attention of researchers and practitioners in operations research, computer science, and beyond. However, this problem is well known for its computational complexity. The pioneering work of Sondik (1971) and Smallwood and Sondik (1973) addressed this issue first, and since then the study of POMDPs has mainly followed two directions: (1) finding computationally feasi- ble algorithms for the general model over both finite and infinite horizons, and (2) finding structural results for spe- cial POMDP models with well-defined applications, such as machine replacement and quality control problems. On both fronts, the research encountered significant obstacles. Strictly speaking, the problem does not have an effi- ciently computable solution because computing an optimal policy is PSPACE-complete, as shown by Papadimitriou and Tsitsiklis (1987), and finding an -optimal policy is NP-hard, as pointed out by Lusena et al. (2001). How- ever, because of its modeling power and potentially wide applications, the problem continues to draw attention from the operations research and computer science communities. A central concept underlying most existing methods in the POMDP literature is the “belief vector,” which is a dis- tribution of the system state (assuming a finite state space). Because the system state cannot be observed, the decision maker maintains a belief about the state and updates it after each (imperfect) observation. It is known that the belief vector is a sufficient statistic of the complete information history (Aoki 1965, Astrom 1965, Bertsekas 1976). Thus, the problem can be formulated by dynamic programming based on belief vectors. The optimality condition implies that the part of an optimal policy from any period onward maximizes the expected value-to-go with respect to the belief vector at the beginning of that period. A belief vec- tor of an n-state system belongs to the (n 1)-dimensional simplex of n , which is the main source of difficulty in this approach. Much of the literature discusses techniques to replace this uncountable set with a finite or countable set. In this paper, we propose a new perspective and frame- work for analyzing the POMDP problem, based on some strong geometric properties of the problem and free from the belief vectors. We show that the key step of the prob- lem formulation is equivalent to an existing problem in computational geometry, the “Minkowski sum” of convex polytopes. This connection opens the door for using the state-of-the-art Minkowski-sum algorithms (e.g., Fukuda 2004) to solve the POMDP problem, which may improve the computational efficiency of the latter substantially. 214 INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s). Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

  • Upload
    others

  • View
    44

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

OPERATIONS RESEARCHVol. 58, No. 1, January–February 2010, pp. 214–228issn 0030-364X eissn 1526-5463 10 5801 0214

informs ®

doi 10.1287/opre.1090.0697© 2010 INFORMS

Partially Observable Markov Decision Processes:A Geometric Technique and Analysis

Hao ZhangMarshall School of Business, University of Southern California, Los Angeles, California 90089,

[email protected]

This paper presents a novel framework for studying partially observable Markov decision processes (POMDPs) withfinite state, action, observation sets, and discounted rewards. The new framework is solely based on future-reward vectorsassociated with future policies, which is more parsimonious than the traditional framework based on belief vectors. It revealsthe connection between the POMDP problem and two computational geometry problems, i.e., finding the vertices ofa convex hull and finding the Minkowski sum of convex polytopes, which can help solve the POMDP problem moreefficiently. The new framework can clarify some existing algorithms over both finite and infinite horizons and shed newlight on them. It also facilitates the comparison of POMDPs with respect to their degree of observability, as a usefulstructural result.

Subject classifications : dynamic programming: Markov; analysis of algorithms: computational complexity; mathematics:combinatorics; computers/computer science: artificial intelligence.

Area of review : Optimization.History : Received January 2008; revision received July 2008; accepted November 2008. Published online in Articles in

Advance July 29, 2009.

1. Introduction and Literature ReviewMarkov decision processes (MDPs) provide one of the fun-damental models in operations research, in which a deci-sion maker controls the evolution of a dynamic system.Although this model is mature, with well-developed the-ories, as in Puterman (1994), it is based on the assump-tion that the state of the system can be perfectly observed.Partially observable Markov decision processes (POMDPs)extend the MDPs by relaxing this assumption. The firstexplicit POMDP model is commonly attributed to Drake(1962), and it attracted the attention of researchers andpractitioners in operations research, computer science, andbeyond. However, this problem is well known for itscomputational complexity. The pioneering work of Sondik(1971) and Smallwood and Sondik (1973) addressed thisissue first, and since then the study of POMDPs has mainlyfollowed two directions: (1) finding computationally feasi-ble algorithms for the general model over both finite andinfinite horizons, and (2) finding structural results for spe-cial POMDP models with well-defined applications, suchas machine replacement and quality control problems. Onboth fronts, the research encountered significant obstacles.Strictly speaking, the problem does not have an effi-ciently computable solution because computing an optimalpolicy is PSPACE-complete, as shown by Papadimitriouand Tsitsiklis (1987), and finding an -optimal policy isNP-hard, as pointed out by Lusena et al. (2001). How-ever, because of its modeling power and potentially wide

applications, the problem continues to draw attention fromthe operations research and computer science communities.

A central concept underlying most existing methods inthe POMDP literature is the “belief vector,” which is a dis-tribution of the system state (assuming a finite state space).Because the system state cannot be observed, the decisionmaker maintains a belief about the state and updates it aftereach (imperfect) observation. It is known that the beliefvector is a sufficient statistic of the complete informationhistory (Aoki 1965, Astrom 1965, Bertsekas 1976). Thus,the problem can be formulated by dynamic programmingbased on belief vectors. The optimality condition impliesthat the part of an optimal policy from any period onwardmaximizes the expected value-to-go with respect to thebelief vector at the beginning of that period. A belief vec-tor of an n-state system belongs to the (n−1)-dimensionalsimplex of n, which is the main source of difficulty in thisapproach. Much of the literature discusses techniques toreplace this uncountable set with a finite or countable set.

In this paper, we propose a new perspective and frame-work for analyzing the POMDP problem, based on somestrong geometric properties of the problem and free fromthe belief vectors. We show that the key step of the prob-lem formulation is equivalent to an existing problem incomputational geometry, the “Minkowski sum” of convexpolytopes. This connection opens the door for using thestate-of-the-art Minkowski-sum algorithms (e.g., Fukuda2004) to solve the POMDP problem, which may improvethe computational efficiency of the latter substantially.

214

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 2: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 215

Other steps of the problem formulation are also related to acomputational geometry problem, i.e., identifying the ver-tices of the convex hull of a point set. Geometric intuitionscan enhance our understanding of the problem and facilitatealgorithm design and structure exploration.

The belief vectors are completely suppressed in thisframework. We view the part of a policy from any periodonward as a stand-alone object of its own, hereafter referredto as a “continuation policy.” Because the system stateis hidden, a continuation policy yields an expected value-to-go from each possible state, which gives rise to avector of expected values-to-go, hereafter referred to asa “continuation-value vector.” We present a dynamic pro-gramming formulation based on these continuation-valuevectors. The backward induction yields a collection ofcontinuation-value vectors in each period that form a so-called “continuation-value frontier,” where no vector on thefrontier dominates another (or, weakly greater than the otherin all dimensions). For a finite-horizon problem, the back-ward induction ends when the continuation-value frontier inthe first period is found, and each vector on that frontier isoptimal for certain distributions of the initial system state.This is the only place where the (initial) belief vector comesinto play in this framework. For an infinite-horizon prob-lem with discounting, there exists a unique continuation-value frontier, which is the same in every period and can beapproximated by successive value iterations or policy iter-ations. We note in the paper that the traditional frameworkbased on belief vectors and the new framework based oncontinuation-value vectors can be reconciled.

Some researchers have used the notion of continuationpolicies and continuation-value vectors in the POMDP lit-erature before. However, this is the first time these conceptshave been systematically treated and embedded in a rigor-ous framework. To the best knowledge of the author, theconnection between the POMDP problem and the compu-tational geometry problems, the concept of continuation-value frontier, the adoption of the Hausdorff metric, thenew dynamic programming formulation, and the conceptualdistinction between the so-called “finite-state controllers”and “finite-memory policies” are new to the literature. Thenew framework also facilitates a rigorous pursuit of thenotion of observability. For a class of POMDPs sharing thesame underlying MDP, we define a partial order throughtheir observation matrices, which translates into a partialorder of their value frontiers.

In the remainder of this section, we provide a brief lit-erature review to offer a bird’s-eye view of the rich his-tory of the problem. Monahan (1982), Lovejoy (1991b),and White (1991) provide excellent surveys of the solu-tion techniques and applications of the model up to 1991,and Poupart (2005) includes a review of the literature upto 2005. Note that we only discuss the POMDP problemwith discounted rewards in this paper. For analyses of theaverage-reward version of the problem, we refer the reader

to Platzman (1980), Fernández-Gaucherand et al. (1991),Hsu et al. (2006), and Yu and Bertsekas (2008).

Based on researchers’ affiliations, the literature can bebroadly divided into two parts—that in operations researchand that in computer science—and we begin with the firstpart. A significant stream of research aims at the solu-tion and structure of the general POMDP problem. Sondik(1971) and Smallwood and Sondik (1973) prove that thevalue functions (over belief vectors) are piecewise linearand convex and can be found through a value iteration algo-rithm that establishes the foundation for many existing algo-rithms today. Monahan (1982) improves this algorithm bysystematically removing the redundant elements in definingthe value functions. Sondik (1978) presents a policy itera-tion algorithm for the infinite-horizon model, focusing on aspecial type of policies called “finitely transient policies.”Platzman (1977) presents another policy iteration algorithmbased on finite-memory policies. However, these policy iter-ation algorithms are not easy to comprehend (using the cri-teria of Lovejoy 1991b, they are not “transparent”), whichexplains the lack of follow-up research. White and Scherer(1989) present three successive approximation algorithmsto solve the infinite-horizon problem, and in (1994) theyalso investigate finite-memory policies. White (1979, 1980)provides sufficient conditions for the value functions andoptimal policies to be monotone for some special POMDPs,which are strengthened by Lovejoy (1987). Lovejoy (1991a)approximates the belief space by a finite grid of points andconstructs upper- and lower-bound solutions.

Another important stream of research focuses on specialapplication-driven models, especially the machine mainte-nance and quality control problems. In such problems, thestate of a machine is unobservable, but is partially reflectedin its output. The state can be revealed by a costly inspec-tion or replacement (the latter resets the machine state). Theproblem is to find a minimum-cost maintenance policy. Werefer the reader to Monahan (1982) for a detailed descrip-tion of this problem and a summary of the earlier work byEckles (1968), Ross (1971), Ehrenfeld (1976), Wang (1977),and White (1977, 1979). Grosfeld-Nir (1996, 2007) finds theoptimal control-limit policies for a two-state replacementmodel, and Anily and Grosfeld-Nir (2006) study a com-bined production and inspection problem in which the sec-ond stage, inspection, is modeled as a POMDP. Lane (1989)presents an application of the POMDP problem for fisher-men, and Treharne and Sox (2002) examine several subop-timal policies for an inventory control problem in which thedemand process is generated by a hidden Markov process.

The computer science community also has a long tradi-tion of studying this problem. Because a POMDP can beconsidered as a special probabilistic automaton (Paz 1971),the study is related to the automata theory. Since the 1990s,there has been a surge of research activities when the modelfound applications in artificial intelligence (Cassandra et al.1994, Kaelbling et al. 1998). To this date, the model hasbeen applied to robot navigation (Littman et al. 1995,

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 3: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis216 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

Cassandra et al. 1996, Thrun 2000, Montemerlo et al. 2002),preference elicitation (Boutilier 2002), stochastic resourceallocation (Meuleau et al. 1998, Marbach et al. 2000), andspoken-dialogue systems (Paek and Horvitz 2000, Zhanget al. 2001), among others.

A large part of the computer science literature on thePOMDP problem focuses on the computational aspect ofthe problem, and numerous algorithms have been proposed,most of which take the value iteration approach. Sondik’s(1971) algorithm is often referred to as the “one-pass”algorithm. Cheng (1988) proposes a “linear-support” algo-rithm that systematically builds up the value functions,and Cassandra et al. (1994) introduce a “witness” algo-rithm focusing on belief vectors that can identify a missingpiece of a value function. Zhang and Liu (1996) presentan “incremental pruning” algorithm that decomposes eachvalue iteration into three steps and removes the redun-dancy in each step. Feng and Zilberstein (2004) propose a“region-based incremental pruning” algorithm that dividesthe belief space into smaller regions and performs inde-pendent pruning in each region. Some algorithms also takethe policy iteration approach. Hansen (1998a, b) studiesa special type of policy named “finite-state controller,”and Meuleau et al. (1999) study “finite-memory policies.”Poupart and Boutilier (2004) presents a “bounded policy-iteration” algorithm to tackle large-scale models.

This paper is organized as follows. Section 2 introducesthe POMDP model and some basic facts. Section 3 focuseson the new framework for the problem, based on its geo-metric properties, and §4 reveals the connection of the keystep of the formulation to the Minkowski-sum problem.Section 5 is dedicated to the basic properties of the infinite-horizon model and the policy iteration approach to solve it.We investigate the degree of observability in §6 and discusssome future research topics in the final section. Proofs oftechnical results are provided in the appendix.

2. Model Description and DecompositionA partially observable Markov decision process (POMDP)describes the problem faced by a decision maker who con-trols an underlying Markov decision process, but can onlyobserve imperfect signals of the system state. The modelcan be formally described below. We largely follow thenotation of Lovejoy (1991b). In addition, throughout thispaper, a vector will be in column format by default, whosetranspose will be marked by a “ ′ ” symbol.

(1) The state space is a finite set, denoted by X =12 n. (2) The action space is a finite set, denotedby A. (3) The observation space is also a finite set, =12 m. (4) The state transition matrices are givenby Paa∈A. For each action a, Pa is an n × n matrix paij i j∈X , where each element paij = Pr j i a is the prob-ability that the system will move from state i to state j fol-lowing action a. (5) The observation matrices are given byRaa∈A. For each a, Ra is an n × m matrix rajj∈X∈,

where each element raj = Pr j a gives the probabil-ity that signal will be observed after action a is takenand the system moves to the new state j . Each row of Ra

sums to one. For convenience in the subsequent analysis, thediagonal matrix formed from the th column of Ra will bedenoted by Ra = diag ra1 r

a2 r

an. (6) The reward

vectors are given by gaa∈A. Each ga is an n-dimensionalvector gai i∈X , where each element gai is the expectedreward in a single period given state i and action a. (7) Theterminal reward vector (for T <) is gT+1, whose elementgT+1 i is the final reward of the system in state i at the end ofperiod T . (8) The distribution of the initial state is given byan n-dimensional vector 0. (9) The horizon of the modelis T , which can be finite or infinite, and the periodsare indexed by t = 1 T . (10) The discount factor is ∈ 01 if T < or ∈ 01 if T =.

It has been shown (Sondik 1971) that the problem has adynamic programming formulation based on the belief ofthe underlying system state. A belief is a distribution of thesystem state, denoted by an n-dimensional vector . LetVt be the expected future reward from period t onwardgiven belief . It satisfies the following equation:

Vt

=maxa∈A

′ga+∑

∈Pr aVt+1 a

(1)

In the expression, Pra=∑i

∑j ip

aij r

aj is the prob-

ability of observing after state distribution (belief) andaction a. The function a gives the posterior belief after prior belief , action a, and observation . Accord-ing to Bayes’ rule, j =

∑i ip

aij r

aj/Pr a. Recall

that Ra = diag ra1 ra2 r

an. Then, in matrix form,

′ = ′PaRa Pr a (2)

Sondik (1971) proves that for any finite t, Vt is piece-wise linear and convex and thus admits the form Vt =max ∈!t

′ , where !t is a finite set of n-dimensionalvectors. The point set !t can be recursively determined asfollows. Substituting Vt+1 =max ∈!t+1

′ into Equa-tion (1), we obtain

Vt

=maxa∈A

′ga+∑

∈Pr aVt+1

( ′PaRa Pr a

)

=maxa∈A

′ga+∑

∈Pr a max

∈!t+1

′PaRa Pr a

=maxa∈A

′ga+∑

∈max ∈!t+1

′PaRa

t = 12 T (3)

VT+1 = ′gT+1 !T+1 = gT+1 (4)

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 4: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 217

Because Vt can be expressed as max ∈!t ′ , the

above expression determines !t from !t+1 implicitly.However, the iteration is a difficult task—there seems to beno natural way of enumerating the members of !t , and thesize of !t grows exponentially fast with iterations.

The value function Vt can be decomposed, as inZhang and Liu (1996):

V at "= max ∈!t+1

′PaRa (5)

V at =∑∈V at " (6)

Vt =maxa∈A

′ga+V at (7)

The intermediate functions V at " and V at are alsopiecewise linear and convex. Therefore, the problembecomes one of finding the minimum sets of -vectors thatdescribe V at ", V

at , and Vt , respectively. We will

refer to this three-step iteration (essentially any iterationderived from Equation (1)) as a belief-value iteration, indi-cating the role of belief vectors in the formulation. As notedin the literature, the first and third steps in the iteration arerelatively easy, whereas the second step, finding the min-imum -set for V at , is the most time consuming. Sig-nificant efforts have been devoted to this step, and variousexisting algorithms mainly differ in this step.

3. A Geometric FrameworkAlthough many researchers have recognized the central roleof the -vectors, few have studied them independently fromthe belief vectors. In this section, we take a dual per-spective and generate the -vectors in a systematic way,free from the belief vectors. We show that an -vector isessentially the expected value-to-go (or continuation-value)vector associated with a continuation policy (part of a com-plete policy from any period t onward). We propose a newdynamic programming formulation based on this view. Thedevelopment in this section leads the way to a reductionof the time complexity of the critical step, to be discussedin §4.

3.1. Duality Between Value Functions andConvex Hulls

Various dual relationships between hyperplanes and pointshave been investigated in the convex analysis literature(Rockafellar 1970) and computational geometry literature(Berg et al. 2000, Boissonnat and Yvinec 1998). In a typ-ical setting, a hyperplane in the primal n space is anaffine function # u = u′h + d, where u is an n− 1-dimensional vector, h is a given vector, and d is a givenscalar. Such a hyperplane corresponds to a point hd inthe dual n space. The upper envelope (with respect tothe #-axis) of a set of hyperplanes #k u= u′hk+dkk∈Kdefines a piecewise-linear and convex function # u =maxk∈Ku′hk + dk. The minimum set of hyperplanes that

completely determine the upper envelope are called sup-porting hyperplanes. One can show that (Berg et al. 2000)there is a one-to-one correspondence between the support-ing hyperplanes of an upper envelope in the primal spaceand the vertices of an upper convex hull (with respect tothe d-axis) in the dual space.

However, this result must be modified for the POMDPproblem because the primal space is ) × , where ) = ∈ n*

∑ni=1i = 1 and i 0 for all i is an n− 1-

dimensional simplex, and a hyperplane in this space is ofthe form V = ′ , given an n-dimensional vector . Wedefine the corresponding dual space as the n space of those -vectors. Figure 1 illustrates the dual relationship: panel(a) depicts a piecewise-linear and convex function V =max1 + 4231 + 352451 + 25251 + 2 inthe primal space, and panel (b) depicts the correspondingpoints 14 335 4525 51 in the dual space.

To provide a precise description of this duality, we intro-duce the following definitions.

Definition 1. Given a point set S ⊂ n, the generatedconvex hull is the set Co S ≡

∑n+1j=1 ,jvj *

∑n+1j=1 ,j = 1

Figure 1. Duality between (a) a piecewise-linear andconvex function V =max ∈! ′ in theprimal space, and (b) the positive convex hullgenerated by the point set! in the dual space.

0

•(1, 4)

•(3, 3.5)

• (4.5, 2.5)

• (5,1)

1

1

(b)

11

V

0

V(π) = 5π1+ π2

V(π) = 4.5π1+ 2.5π2

V(π) = 3π1+ 3.5π2

V(π) = π1+ 4π21

2

3

4

5

0

(a)

V (π)

π1π2

ω1

ω2

2

3

4

2 3 4 5

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 5: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis218 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

and vj ∈ S,j 0 ∀ j; the surface of the convex hullwith positive outernormal directions, or simply the positiveconvex hull PCO, is the set PCo S ≡ cl ∈ Co S:∃ ∈)+ ′ ′v ∀v ∈Co S, where )+ = ∈n:∑n

i=1i = 1 and i > 0 ∀ i, and cl B is the closure of B.

Definition 2. The pairwise addition of two point sets !1

and !2 is the set !1 +!2 ≡ 1 + 2* 1 ∈!1 2 ∈!2.

In the definition of PCO, the closure is redundant if Sis a finite set. It may be needed when S has a smooth sur-face, e.g., a sphere. However, it is still an open questionwhether smooth surfaces can arise in the POMDP prob-lem with finite states, finite actions, and finite observationsover the infinite horizon. With the above definitions, thedual relationships pertinent to the POMDP problem can beformally stated, as follows.

Lemma 1. Suppose that ! ⊂ n is closed and bounded.(1) A piecewise-linear and convex function V =max ∈! ′ , ∈ ), is dual to the set PCo !. Moreprecisely, for any ∈), there is a ∈ PCo ! such thatV = ′ , and conversely, for any ∈ PCo !, there isa ∈) such that V = ′ . (2) Given two piecewise-linear and convex functions V1 = max ∈!1

′ andV2 = max ∈!2

′ , the function V1 + V2 isdual to PCo !1 + !2. (3) Given the above V1 and V2 , the function maxV1 V2 is dual toPCo !1 ∪!2.

The generality of set ! introduces some technical sub-tlety at the boundary of PCo !, as can be seen in theproof (in the appendix). If ! is a finite set, the dualitycan be obtained more directly. We can easily see that thereis a one-to-one correspondence between the linear piecesof V , if any, and the vertices of PCo !. Parts (1)and (2) of the lemma are illustrated in Figures 1 and 2,respectively.

3.2. POMDP Problem in the Dual Space

Lemma 1 implies that the -vectors representing the func-tions V at ", V

at , and Vt in expressions (5)–(7)

are the vertices of some positive convex hulls. Thus, thebelief-value iteration (5)–(7) has the following counterpartin the dual space, obtaining the set !t from the set !t+1:

Continuation-Value Iteration(1) !a

t = PCoEx PaRa * ∈ !t+1, a ∈ A, ∈;

(2) !at = PCoEx !a

t 1+!at 2+· · ·+!a

t m a ∈A;(3) !t = PCoEx

⋃a∈Aga+ * ∈!a

t .

We call this three-step iteration a continuation-value iter-ation because the -vectors are the “continuation-valuevectors” defined in the next subsection. The set PCoEx !contains the extreme points of the positive convex hullgenerated from the point set !. Viewed as an operator,

Figure 2. Duality between V1 + V2 andPCo !1 +!2.

V

7

1

V2(π)

V1(π)

V1(π) + V2(π)

6

5

4

3

2

1

0

(a)

(b)

0

••

1 2 3 4 65

1

2

3

4

5

6

7

PCo(Ω1 + Ω2)

PCo(Ω2)

PCo(Ω1)•

••

π1

ω1

ω2

PCoEx corresponds to a standard problem in computa-tional geometry—removing redundant points from the con-vex hull of !. Recall that x is an extreme point of a convexset C if x = ,y + 1 − ,z for any , ∈ 01 and y z ∈C\x, and that extreme points and vertices are the samein the special case of convex polytopes. Lemma 1 impliesthe following result immediately:

Theorem 1. The belief-value iteration and continuation-value iteration are equivalent.

Like the belief-value iteration, the continuation-valueiteration can be aggregated, as follows.

Lemma 2. The continuation-value iteration is equivalent tothe following iteration:

!t = PCoEx

( ⋃a∈A

ga+∑

∈PaRa at+1 *

at+1 ∈!t+1 ∀ ∈) (8)

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 6: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 219

In the above expression, the matrix Pa can be movedoutside of the sum operator. Thus, the continuation-valueiteration has an alternative form in which Pa only appearsin the last step.

Alternative Continuation-Value Iteration(1) !a

t = PCoEx Ra * ∈ !t+1, a ∈ A, ∈;

(2) !at = PCoEx !a

t 1+ !at 2+· · ·+ !a

t m, a ∈A;(3) !t = PCoEx

⋃a∈Aga+Pa * ∈ !a

t .

We note that the first two steps of the alternative iterationare independent of the action if the observations are, whichis an advantage of this procedure. Now, we briefly com-pare the first step of the two continuation-value iterations,because their last two steps are similar in terms of compu-tation. Let !t+1 be the vertex set of some positive convexhull. In step 1 of the continuation-value iteration, after a lin-ear transformation PaRa , the set of points PaRa * ∈ !t+1 still forms a convex hull, but it may not havepositive outernormal directions or even full dimensions.In either case, the operator PCoEx removes the redun-dant points. In step 1 of the alternative continuation-valueiteration, if Ra has a full rank, the points Ra * ∈!t+1 still form a positive convex hull with no redun-dancy; otherwise, the resulting point set has reduced dimen-sions and typically contains redundant points.

Similar to the belief-value iteration, the most time-consuming step in the two continuation-value iterations isthe second one. As illustrated in Figure 2(b), the pair-wise addition of two point sets is exponentially large, i.e.,!1 +!2 = !1 · !2. The corresponding situation in theprimal space is illustrated in Figure 2(a). There are asmany redundant points in !1 + !2 as there are redun-dant hyperplanes to define V1 + V2 . Almost all algo-rithms that tackle a general POMDP problem strive to trimdown the number of hyperplanes or dual points. Althoughthere seems to be no difference between the belief- andcontinuation-value iterations regarding algorithm efficiencyat this moment, we will see in §4 that it is more convenientto explore the geometric properties in the dual space.

3.3. Continuation Policies andContinuation-Value Vectors

In this subsection, we connect the -vectors with poli-cies. To perform backward induction, we define a pol-icy recursively. Although recursive representations ofPOMDP policies exist in the literature, such as the “pol-icy graph” in Cassandra et al. (1994), the “finite-state con-troller” in Hansen (1998a), and the “conditional plan” inPoupart (2005), a systematic treatment as provided here isuseful.

Definition 3. We refer to the beginning of period t astime t. In a POMDP problem with T periods, theobservation history up to time t is the sequence of obser-vations 1 t−1, denoted by t−1, where 0 is the

Figure 3. Tree representation of a deterministic policy.

• • •

1 2 m

• • •

1 2 m

a1(θ0)

a2(θ1)

a3(θ2)

θ1

θ2

θ3

empty sequence by default. Let t−1 be the set of t−1.A deterministic policy 4 is a sequence of mappingsat*

t−1 → At=1T , or a1 0 aT

T−1T−1∈T−1 .A deterministic continuation policy from time t is thepart of 4 from time t onward, with recursive representation4t

t−1= at t−1 4t+1 t−1 tt∈.

A policy can be represented as a tree, as in Figure 3.Every node of the tree has two interpretations: Viewed fromthe top down, it represents the history of observations t−1

and is associated with an action at t−1; viewed from the

bottom up, it corresponds to a continuation policy.The observation history is the only information needed

to define a policy because the state history is unavailableand the action history is jointly determined by the obser-vation history and the policy itself. This definition deviatesfrom the tradition of including actions in the history as well(Monahan 1982, Lovejoy 1991b, White 1991). The actionsare needed to update the belief of the system state fromthe initial belief 0. If the belief process is suppressed,however, the actions no longer need to be recorded. Thedetachment of the belief process from the core formula-tion is the key to creating a more parsimonious dynamicprogramming expression, as in Theorem 2 below.

Because the system state is unknown, the expected value-to-go under a continuation policy 4t is not a scalar buta vector, namely, the continuation-value vector, denotedby ut . Each component utx is the expected value-to-gowhen the system starts in state x at time t, and the policy 4tis followed from then on. Those ut vectors generated bylegitimate continuation policies are denoted by a set Ut . Tomake Ut a convex set (if needed), randomization should beintroduced in continuation policies. In a randomized policy,the action plan in period t is a function at* A×t−1 →6 A, where A×t−1 = a1 1 at−1 t−1 is theset of action and observation histories up to time t and6 A = 7*

∑a∈A 7a = 1 and 7a 0 for all a is the set

of distributions over A. In general, there is no u∗t ∈Ut thatdominates all others in all dimensions, so the entire frontierof Ut as defined below must be considered, which can befound by backward induction in the next theorem.

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 7: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis220 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

Definition 4. A continuation-value vector is feasible if itcan be generated by a randomized continuation policy. LetUt denote the set of feasible continuation-value vectors attime t. The continuation-value frontier at time t is the setPCo Ut, denoted by Ut . The extreme points of the time-tcontinuation-value frontier are denoted by set U ∗

t .

Theorem 2. The feasible continuation-value sets Ut , thecontinuation-value frontiers Ut , and the extreme-point setsU ∗t can be determined recursively:

Ut =Co

( ⋃a∈A

ga+Pa ∑

∈Ra uat+1 *

uat+1 ∈Ut+1 ∀ ∈) (9)

Ut = PCo

( ⋃a∈A

ga+Pa ∑

∈Ra uat+1 *

uat+1 ∈ Ut+1 ∀ ∈) (10)

U ∗t = PCoEx

( ⋃a∈A

ga+Pa ∑

∈Ra uat+1 *

uat+1 ∈U ∗t+1∀ ∈

) (11)

Furthermore, each u∗t ∈ U ∗t can be generated by a deter-

ministic continuation policy.

The theorem shows that randomization is unnecessaryif we focus on the extreme-point sets U ∗

t . However, setsUt and Ut are still useful in our later developments whenconvex or connected sets are more convenient. We note thatthe above expressions can be easily generalized such thatA, Pa, and Ra are time dependent and depends uponboth the time and the action.

Clearly, expression (11) coincides with iteration (8), andthe ut vectors coincide with the t vectors. The belief pro-cess is completely implicit in the picture except that, theinitial state distribution is used to determine optimal solu-tion through max ′

0u1* u1 ∈ U ∗1 . This “mystery” will be

resolved in Theorem 4 in §5.

4. Value Iteration ThroughMinkowski Sums

Value iteration is a general approach to solving the POMDPproblem over both finite and infinite horizons. We haveseen three value iteration procedures in previous sections.In this section, we show that the geometric interpretationpresented in last section connects the POMDP problemwith an existing problem in computational geometry, theMinkowski-sum problem.

4.1. Minkowski Sum of Convex Polytopes

A polytope is the convex hull of a finite set of points. TheMinkowski sum of two polytopes P1 and P2 is defined as

P1 ⊕ P2 = u+ v* u ∈ P1 v ∈ P2, i.e., the pairwise addi-tion of convex sets P1 and P2. The result is also a poly-tope. For convenience, we call the positive convex hull ofP1⊕P2 the positive Minkowski sum. The second step of thetwo continuation-value iterations can be compactly writtenas !a

t =⊕m

=1!at and !a

t =⊕m

=1!at , respectively.

Fukuda (2004) presents an algorithm that finds the verticesof a Minkowski sum in time linear to the number of out-put vertices. The key idea is to arrange the vertices of theoutput polytope as a minimum spanning tree. The algo-rithm can be directly used in the value iterations to savecomputational time. In this subsection, we discuss somebasic properties of the Minkowski sum and main featuresof Fukuda’s algorithm, which is valuable for understandingthe geometry behind the POMDP problem and developingmore efficient algorithms in the future.

A subset F is a face of a polytope P if there existsa vector c ∈ n such that F = argmaxv∈P c

′v, denotedby F P" c more precisely. An n-dimensional polytopehas n types of proper faces, with dimensions 01 and n − 1, respectively. The 0- and 1-dimensional facesare also called vertices and edges, respectively. For theMinkowski sum of a set of polytopes, it can be easilyshown that the faces of the output polytope can be con-structed from those of the input polytopes: F

⊕ki=1 Pi" c=⊕k

i=1 F Pi" c (Gritzmann and Sturmfels 1993). Figure 4illustrates the Minkowski sum of two polytopes in the pos-itive orthant and the decomposition of one output vertex.Compared with Figure 2(b), the vertices of the output poly-tope are exactly the -vectors on the value frontier.

The vertex/edge decomposition is the core of Fukuda’salgorithm. Let P =⊕k

i=1 Pi. It is shown that: (1) a pointv ∈ P is a vertex of P if and only if v =∑k

i=1 vi for somevertex vi of Pi and there exists c ∈ Rn with vi= F Pi" cfor all i; (2) a subset E of P is an edge of P if and onlyif E =⊕k

i=1 Fi, where each Fi is a vertex or edge of Pi all

Figure 4. Minkowski sum of two polytopes and decom-position of an output vertex.

ω1

ω2

0

P1 ⊕ P2

P2

P1

1

4

5

3

2

1 2 3 4 5 6 7

6

cINFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 8: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 221

such edges must be parallel and there exists c ∈ Rn suchthat Fi = F Pi" c for all i.

Fukuda’s algorithm also needs the adjacency informationof the input polytopes, i.e., the pairs of vertices linked byedges. Let 7i be the maximum degree of Pi, i.e., the maxi-mum number of vertices adjacent to any vertex of Pi. Then,7=∑k

i=1 7i is the upper bound of the maximum degree of⊕ki=1 Pi due to the decomposition property. It is shown that

there is a compact polynomial algorithm for the Minkowskisum of k polytopes that runs in time O z7LP n7 andspace linear in the input size, where z denotes the numberof vertices of P , and LP n7 denotes the time needed tosolve a linear program with n variables and 7 constraints.

By today’s standards, those LPs in Fukuda’s algorithmare small, and their sizes grow with 7 rather than the totalnumber of vertices. The worst-case complexity of the algo-rithm is linear in the output size z, and is polynomial inthat sense. It is not necessarily polynomial with respectto the input size

∑ki=1 zi, where zi is the number of ver-

tices of Pi. In comparison, the same step in the state-of-the-art POMDP algorithms (Cassandra et al. 1997, Fengand Zilberstein 2004) needs O z

∑ki=1 ziLP n z time.

Because the maximum degree of a polytope is typicallymuch smaller than the number of vertices, the advantage ofFukuda’s algorithm seems clear.

It is noteworthy that Fukuda’s algorithm requires theadjacency information of the input polytopes. For an inputpolytope Pi, a straightforward method to obtain the adja-cency information can take O z2i LP n zi time, which isstill better than the existing POMDP algorithms theoreti-cally, because z may not be polynomially bounded by zi.Further, because linear transformations do not alter adja-cency relationships, in the entire iteration t, we only need toacquire the adjacency information for a single input poly-tope, corresponding to the set !t+1. This is especially ben-eficial for systems with large action and observation spaces.

4.2. Numerical Examples

In this subsection, we illustrate some properties of theMinkowski sum through two examples. The first exampledemonstrates that the output size of a Minkowski sum ismuch smaller than the number of combinations of the inputvertices. In the context of the POMDP problem, we focuson the first two steps of the alternative continuation-valueiteration for a single action.

Example 1. The parameters are: n= 2, m= 3,

Ra =(

03 02 0505 01 04

)

for a given action a, and

!t+1 =(

3 4 5 5565 6 5 4

)

Figure 5. An alternative continuation-value iterationgiven a single action–positive Minkowskisum of three polytopes determined by threeobservations.

1 2 3 4 5

1

2

3

4

5

6

Ωt +1

ω1

ω2

Ωta(2)

~

Ωta(1)

~

Ωta~

Ωta(3)

~

where each column of !t+1 represents a continuation-valuevector. The first two steps of the alternative continuation-value iteration generates the following point set, in matrixform:

!at =

(30 32 37 39 44 47 48 505 535 5565 645 625 615 575 55 54 50 45 40

)

Although !at = 10 is much larger than !t+1 = 4, it is

still much smaller than !t+13 = 64. The example is illus-trated in Figure 5.

In the second example, we also focus on a single actionand perform the alternative continuation-value iterationrepeatedly. From an initial point set !T , the output !t+1

of iteration t + 1 is used as the input of iteration t. Theexample demonstrates that the maximum degree 7 of aMinkowski sum is much smaller than the output size z andgrows much more slowly.

Example 2. The parameters are: n=m= 3,

Ra = 03 03 04

02 02 0605 01 04

for a given action a, and

!T = 1 0 0

0 1 00 0 08

(for demonstration purposes, we start with a representative!T instead of a singleton set !T+1). The following table

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 9: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis222 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

Figure 6. An output polytope after three alternativecontinuation-value iterations given a singleaction.

lists the maximum degree 7 and the number of output ver-tices z after each iteration. The output polytope of iterationT − 3 is illustrated in Figure 6.

t T T − 1 T − 2 T − 3 T − 4

7 2 4 5 5 6

z 3 9 22 46 86

In last subsection, we argued the theoretical advantageof adopting Fukuda’s algorithm in the value iteration ofthe POMDP problem, whereas in this subsection we havepresented two simple numerical examples. A thoroughcomparison of the new POMDP algorithm (incorporat-ing Fukuda’s Minkowski-sum algorithm) with the existingPOMDP algorithms requires extensive numerical studies,which is beyond the scope of this paper and is a valuabletopic for future research.

5. Policy Iteration Overan Infinite Horizon

In the infinite-horizon setting, besides value iterations, thePOMDP problem can also be tackled by the policy itera-tion approach that was first introduced by Howard (1960)in the MDP problem. According to limited results in thePOMDP literature, the policy iteration approach appearsto outperform the value iteration approach (Sondik 1978,Hansen 1998a). In this section, we provide a systematictreatment of the infinite-horizon POMDP problem based onthe new framework. We first show some basic properties ofthe problem, then examine a special type of policy made upof finitely many components, and finally, we discuss a pol-icy iteration algorithm proposed by Hansen (1998a). Thissection also unifies some existing concepts and algorithmsin the literature.

5.1. Basic Properties of the POMDP Problem

It is convenient to express the continuation-value itera-tion (11) by an operator = ∗, as U ∗

t = = ∗U ∗t+1. Similarly, the

recursive Equations (9) and (10) can be expressed as Ut ==Ut+1 and Ut = = Ut+1, through two operators = and = ,respectively. These three operators are closely related andare useful in different contexts.

To study the convergence of continuation-value frontiers,we equip the space of continuation-value vectors with theHausdorff metric, which measures the distance betweentwo compact sets U ⊂n and V ⊂n:

dH UV ≡max

supu∈U

infv∈V

u− v supv∈V

infu∈U

u− v

In this definition, · denotes the maximum norm(which is chosen for convenience), i.e., u = max u1,u2 un for any u ∈n. It is known that the inducedHausdorff-metric set space is complete if the underlyingmetric space (n) is complete. We can show the followingresult (the = operator is more convenient here):

Lemma 3. The operator = is a contraction mapping withmodulus in the continuation-value set space with respectto the Hausdorff metric.

The contraction-mapping theorem immediately impliesthe existence and uniqueness of the continuation-valuefrontier over an infinite horizon:

Theorem 3. If T =, the POMDP problem has a uniquecontinuation-value set U, a unique continuation-valuefrontier U ⊂U, and a unique extreme point set U ∗

⊂ U.

Lemma 3 also guarantees that any sequence ofcontinuation-value frontiers resulting from successive =operations converges to a unique limit, which proves theconvergence of value iterations. The next result unveils thecoherence between the value frontier and the belief processthat has been suppressed in our analysis.

Because the value frontier U is a positive convex hull,for any extreme point u∗ ∈U ∗

, the unnormalized belief set) u∗ ≡ ∈ n

+* ′u∗ ′u ∀u ∈ U is a nonempty

cone, where n+ is the nonnegative orthant of n. This set

of belief vectors can testify that u∗ is on the continuation-value frontier. If U is smooth (differentiable) at u∗, ) u∗is a single ray along the outernormal direction of U at u∗;if U is nonsmooth at u∗, ) u∗ is a convex set. Thecollection of sets ) u∗* u∗ ∈ U ∗

constitutes a parti-tion of the unnormalized belief space n

+, except that theboundaries of these sets may overlap. According to expres-sion (2), the unnormalized posterior belief following priorbelief , action a, and observation is ′PaRa ′. Let)a u∗" ≡ ′PaRa ′* ∈) u∗ be the set of pos-terior beliefs updated from the set of prior beliefs ) u∗.

By expression (11), any extreme point of the continua-tion-value frontier can be generated from an action a∗ and

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 10: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 223

a set of continuation-value vectors u ∈ that are alsoextreme points of the value frontier. We have the followingresult:

Theorem 4. If an extreme point on the continuation-valuefrontier, u∗ ∈ U ∗

, is generated from an action a∗ andextreme points u ∈ U ∗

∈, )a∗ u∗" ⊂ ) u for

any ∈.The theorem suggests that if we start with a belief set that

supports an optimal policy, the posterior-belief set shrinks(relatively) as time elapses and always supports a singlecontinuation policy at any given point in time. Using theterminology of Sondik (1978), the collection ) u∗* u∗ ∈U ∗

forms a Markov partition of the unnormalized beliefspace, although he only states this property for the so-called“finitely transient policies.” In view of this result, the beliefprocess can be kept implicit in the analysis because therealways exists a belief process consistent with any policyderived from the continuation-value frontier. This result isalso true in the finite-horizon setting, which can be shownby modifying the proof of Theorem 4.

5.2. Stationary Policies: Finite-State Controllers

For the MDP problem with finite state and action sets,policy iteration can find an optimal policy in finite timebecause there always exists an optimal policy that is deter-ministic and stationary, and there are only finitely manysuch policies. However, this is an intriguing issue for thePOMDP problem because there are infinitely many history-dependent deterministic policies.

Sondik’s algorithm is deemed impractical by computerscientists because of its complexity (Hansen 1998a). How-ever, within the new framework, Sondik’s and Hansen’salgorithms are related. The key concept is the finite-statecontroller (also called “plan graph” in Kaelbling et al.1998), denoted by 4 , which consists of a finite set K ofcontrol states, an action rule a k ∈Ak∈K , and a transitionrule s k ∈ Kk∈K∈. In control state k, action a kis taken, and observation switches the controller to states k . A finite-state controller simplifies an infinite policytree, as shown in Figure 3, to a cyclic policy graph. Anexample is given next.

Example 3. The true status of a simplistic economy isunobservable. However, the performance of a representativemarket in the economy can be observed, as either “boom”or “bust,” which may serve as an indicator of the econ-omy. A typical firm can produce at three quantity levels:low, medium, and high. The firm’s action affects the under-lying economy and the representative market. Thus, thiseconomy can be modeled as a POMDP with hidden states,two observations, and three actions. A production policyof the firm in general depends upon the entire observationhistory, but simple policies like the one below may be par-ticularly interesting: if the indicator was “bust” in the lastperiod, choose the low production level; if the indicator

Figure 7. A three-state production policy in a simplisticeconomy.

Medium

Low High

Bust Boom

Boom

Boom

Bust

Bust

was “boom” in the last period but “bust” one period earlier,choose medium; if the indicator was “boom” in the last twoperiods or more, choose high. This policy can be describedby a three-state controller, as illustrated in Figure 7.

A state of the controller k ∈ K (a node in the policygraph) corresponds to a continuation policy and is associ-ated with a continuation-value vector, denoted by u∗4 k,which can be computed from a system of equations:

u∗4 k=ga k+Pa k∑∈Ra k u∗4 s k k∈K (12)

Any deterministic policy can be approximated arbitrarilyclosely by a finite-state controller by increasing the num-ber of control states. A finite-state controller may be opti-mal sometimes, which can be directly verified as follows(a corollary of Theorem 3).

Corollary 1. If the set of continuation-value vectors gen-erated by a finite-state controller is invariant under the = ∗

operation, the controller defines an optimal policy.

Another type of policy determined by a finite numberof objects is the finite-memory (or finite-history) policy, inwhich the current action depends only upon a finite numberof recent observations and actions (Platzman 1977, Whiteand Scherer 1994). Finite-memory policies and finite-statecontrollers have major differences that have not been fullyelaborated on in the literature. Conceptually, a control statein the finite-state controller represents a continuation pol-icy extending into the infinite future, whereas a memorystate in the finite-memory policy represents a history ofthe past. Mathematically, any finite-memory policy withfinitely many memory states can be represented by a finite-state controller, but not vice versa. In fact, the finite-statecontroller in Example 3 is also a finite-memory policy.However, it is easy to find examples in which a finite-statecontroller cannot be converted into a finite-memory policy(Yu and Bertsekas 2008).

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 11: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis224 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

5.3. Policy Iteration ThroughFinite-State Controllers

The policy-iteration algorithms by Sondik (1978) andHansen (1998a) both center on finite-state controllers.However, Sondik’s algorithm relies on both belief vec-tors and continuation-value vectors—essentially operatingin both primal and dual spaces—and is inevitably moreinvolved. For completeness, we present Hansen’s algorithmhere, with minor modifications in the description. Thisalgorithm is extended by Poupart and Boutilier (2004) toincorporate randomized policies. The control state corre-sponding to a continuation-value vector u is denoted byk u below.

Hansen’s Policy Iteration AlgorithmStep 1. Initialization: define an initial finite-state con-

troller 4 , and select precision level > 0.Step 2. Policy evaluation: calculate 4’s value vectors

(Equation (12)), denoted by set U .Step 3. Policy improvement: perform one-step value

iteration, U = = ∗U (Equation (11)), and modify 4 to 4 asfollows.

(a) For each u ∈ U : (i) If u= u for some u ∈U , keepk u unchanged. (ii) Else, if u u for some u ∈U , replacethe action and successor links of k u by those used to cre-ate u. If there are more than one such u, the correspondingcontrol states can be combined into a single one. (iii) Other-wise, add a new control state k u with the action andsuccessor links used to create u.

(b) For each u ∈ U\ U , if it is not used to create anyvector in U , remove k u; otherwise, keep k u unchanged.Step 4. Termination test: if dH PCo UPCo U

1 − /, exit with an -optimal policy. Otherwise,change 4 to 4 and return to step 2.

It can be shown that in a policy improvement step, if 4 isnot optimal, 4 generates a weakly improved value frontierthat is strictly better at some control states, and the policyiteration algorithm converges to an -optimal policy after afinite number of iterations. Hansen tests the above algorithmon 10 POMDP problems and observes a convergence rate40 to 50 times faster than that of the value iteration methodon average. Because the = ∗ operation is still needed in thepolicy improvement step, which is in fact the bottleneck ofthe algorithm, Fukuda’s Minkowski-sum algorithm may stillbe used to improve the efficiency of this algorithm.

6. Partial Order of ObservabilityPartial observability is the defining characteristic ofPOMDPs, but except for the two extreme cases, perfectobservability and no observability (Satia and Lave 1973,White 1980), little can be found in the literature that com-pares POMDPs from the perspective of observability. Inthis section, we extend Blackwell’s (1953) notion of infor-mativeness to define a partial order of POMDPs and showthat it leads to a partial order of continuation-value fron-tiers. The same notion can also be found in White (1979),

but it has not been pursued rigorously since then, to the bestknowledge of the author. In what follows, we start from thetwo extreme cases and then proceed to the middle.Case 1: Perfect Observability. If an observation matrix

Ra is the identity matrix I , the action a perfectly revealsthe system state, and each diagonal matrix Ra containsonly one nonzero element, the th main-diagonal element.Then, the positive Minkowski sum

⊕∈R

a u* u ∈ Ugiven any point set U is a single point, maxu* u ∈ U,where the maximum is taken componentwise. If all Ra,a ∈A, are identity matrices, the value frontier Ut reduces toa singleton u∗t , and the recursive expression (11) reducesto the standard dynamic programming formulation: u∗t =maxa∈Aga + Pau∗t+1. This MDP problem serves as anupper bound for the POMDP problem.Case 2: No Observability. If a column of an observa-

tion matrix Ra is proportional to the vector e (consist-ing of all 1s), the corresponding observation reveals noinformation about the system state. If every column of Ra

is proportional to e, i.e., Ra = ,a1e ,ame for ,a 0

with∑

∈ ,a = 1, no observation carries any informationabout the system state. In this case, we have Ra = ,aIfor all . For any positive convex hull U , the positiveMinkowski sum

⊕∈R

a u* u ∈ U equals U exactly.Thus, expression (11) reduces to U ∗

t = PCoEx ⋃a∈Aga +

Pauat+1* uat+1 ∈U ∗

t+1. This hidden-state MDP problem issignificantly simpler than the POMDP problem because themost time-consuming step, the Minkowski sum, is absentfrom the formulation. This problem provides a lower boundfor the POMDP problem.Case 3: Partial Observability. The above upper and

lower bounds depend purely upon the underlying Markovdecision process. If the observation matrices Ra satisfyneither of the above conditions, the model falls into thepartial-observability case, and its continuation-value fron-tier falls between the two bounding frontiers. To com-pare POMDPs that share the same underlying MDP, westart with a partial order of observation matrices, followingBlackwell (1953). Recall that an n-state observation matrixis a nonnegative stochastic matrix with n rows, all summingto one.

Definition 5. Among the n-state observation matrices,A is more informative than B, denoted by A B, if thereexists a nonnegative stochastic matrix X such that AX = B.If A B and B A, A is equivalent to B, denoted byA≈ B.

Notice that permuting the columns of any observationmatrix creates an equivalent observation matrix. For a studyof the equivalence classes, see Sulganik (1995). Next, weshow that a more informative observation matrix generatesgreater continuation values.

Lemma 4. Consider two n-state observation matrices Aand B, with mA and mB columns, respectively. Let A and B denote the diagonal matrices formed from the

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 12: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 225

th column of A and B, respectively. Then, if A B,for any point set U ⊂ n, the positive convex hullWA U = ⊕mA

=1A u* u ∈ U dominates WB U =⊕mB

=1B u* u ∈ U. If A ≈ B, WA U = WB U forany U ⊂n.

The class of n-state observation matrices contains a leastequivalence subclass and a greatest equivalence subclass,consistent with our previous discussion of the two extremecases. The proof follows from the definition of the rela-tion and is omitted.

Theorem 5. (1) All n-state observation matrices with fullrow rank and one nonzero element in each column areequivalent, forming the perfect-observability class. (2) Alln-state observation matrices of the form ,1e ,me forany , 0 with

∑m=1 , = 1 are equivalent, forming the

no-observability class. (3) If A and C belong to the perfect-and no-observability classes, respectively, and B is an arbi-trary n-state observation matrix, we have A B C.

Finally, the informativeness relation of the observationmatrices induces a partial order of the POMDPs, which inturn leads to a partial order of their value frontiers.

Definition 6. For two POMDPs sharing the same under-lying MDP, one is more observable than the other if, forevery action, the observation matrix in the former is moreinformative than that in the latter.

Theorem 6. Suppose that one POMDP is more observablethan another. Then, (1) if T <, starting with the samesingleton set gT+1, the continuation-value frontier of theformer dominates that of the latter in every period; and(2) if T =, the continuation-value frontier of the formerdominates that of the latter.

7. ConclusionIn this paper, we have proposed a novel framework forthe POMDP problem, based on continuation policies andcontinuation-value vectors, with natural geometric interpre-tations. The framework is more parsimonious than the tra-ditional framework based on belief vectors. It unveils therelationship between the POMDP problem and two existingcomputational geometry problems, which can help solvethe POMDP problem more efficiently. The framework canclarify some existing POMDP algorithms over both finiteand infinite horizons and sheds new light on them. It alsofacilitates the comparison of POMDPs in terms of observ-ability, which is a useful structural result.

We conclude the paper with a brief discussion of possiblefuture research topics. An important topic not addressed inthis paper is the structural properties of optimal policies forsome special POMDPs. This is a well-known challengingtask, even in the two-state case. For example, Ross (1971)shows that the optimal policy for a two-state machinereplacement problem does not have the control-limit prop-erty. Other examples of structural results for the optimal

policies are White (1977, 1979) and Grosfeld-Nir (1996,2007). The geometry underlying the new framework pro-vides a handy tool for exploring policy structures, whichcan be a fruitful future research direction.

Another important topic absent from the paper is theapproximation of optimal solutions. Because of the compu-tational burden inherent to the POMDP problem, approxi-mation may be unavoidable in solving practical problems.The finite-state controller discussed in §5 can generatelower bounds for the continuation-value frontier in theinfinite-horizon case. The number of control states can bejudiciously chosen to balance the quality of approxima-tion and cost of computation. To generate upper bounds,we can conduct successive value iterations starting withthe perfect-observability solution. A common approxima-tion approach in the primal space is to focus on beliefvectors on a finite grid (Kakalik 1965, Eckles 1966, Cheng1988, Lovejoy 1991a). The same idea can be applied tothe dual space, which is another valuable topic for futurestudies.

Last but not least, the belief vectors can be reintro-duced into the geometric framework. In this paper, we haveimplicitly aimed to solve the POMDP problem for all initialstate distributions simultaneously. However, many applica-tions of the problem only require the solution for a giveninitial state distribution and may not need the completecontinuation-value frontiers. Some (approximation) algo-rithms in the literature are tailored to such solutions (e.g.,Satia and Lave 1973, Hansen 1998b, Pineau et al. 2003).Starting from an initial belief vector and moving forward,in each period, we can directly compute a set of feasiblebeliefs that can be reached under an arbitrary policy. Anextreme point of the continuation-value frontier at time tneed not be created if it is not supported by any feasiblebelief at time t. A main implication from this paper is thatthe feasible belief sets and continuation-value frontiers canbe disentangled and independently generated (forward andbackward, respectively). Acquiring and utilizing both typesof information prudently may save computational time sub-stantially for problems stressing the initial state distribution.The details are left for future investigations.

Appendix. Proofs of Lemmas andTheorems

Proof of Lemma 1. (1) We first show that for any ∈),there is a ∈ PCo ! such that ′ ′ for all ∈!. If ∈)+, by the definition of PCO, such a must exist, andthen we are done. Suppose that ∈)\)+. By the defini-tions of ) and )+, there exists a sequence of belief vectorsk ∈ + 1/kB∩)+, where B is the unit open ball and + 1/kB is the open ball centered at with radius 1/k.Clearly, limk→k = . For each k, there exists a k ∈PCo ! such that ′

k k ′k for all ∈ !. Because

! ⊂ n is bounded, by the Bolzano-Weierstrass theorem,the sequence k has a convergent subsequence kj . Let

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 13: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis226 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

= limj→ kj . Because PCo ! is closed, ∈ PCo !.We write ′ − = ′ − kj + − kj

′ · kj − + ′

kj kj − . Because (i) limj→ ′ − kj =

0, (ii) limj→ −kj ′ kj − = 0 for all ∈! (by theboundedness of !), and (iii) ′

kj kj − 0 for all ∈!

and j , we have ′ − 0 for all ∈!.Next, we show that for any ∈ PCo !, there is a ∈)

such that ′ ′ for all ∈ !. If ∈ argmax ′ : ∈! for some ∈)+, we are done. Otherwise, by thedefinition of PCO, must be the limit of a point sequence k ∈PCo ! associated with a belief sequence k ∈)+such that ′

k k ′k for all ∈!. Again, by the Bolzano-

Weierstrass theorem, k has a convergent subsequencekj . Let = limj→kj . We have ∈) by the closenessof ). We write

′ − = −kj ′ − + ′kj − kj

+ ′kj kj −

Because

i limj→

′kj − kj = 0 ii lim

j→ −kj ′ − = 0

for all ∈! (by the boundedness of!), and (iii) ′kj kj −

0 for all ∈! and j , we have ′ − 0 for all ∈!.

(2) Because

V1 + V2 =max ∈!1

′ +max ∈!2

= max 1∈!1 2∈!2

′ 1 + 2

= max ∈!1+!2

the function is dual to PCo !1 +!2, by part (1).(3) Because

maxV1 V2

=maxmax ∈!1

′ max ∈!2

= max ∈!1∪!2

the function is dual to PCo !1 ∪!2, by part (1).

Proof of Lemma 2. The proof is by induction. Con-sider period t and suppose that the two iterations start withthe same !t+1 set. For the sake of clarity, we dedicatethe label !t to the !t set that results from the three-step iteration and use the label !∗

t for the one that resultsfrom the aggregated iteration. It suffices to show that thePCOs underlying !t and !∗

t are identical. For convenience,define W ∗

t =⋃a∈Aga+

∑∈ PaRa at+1 *

at+1 ∈

!t+1∀ ∈; hence, !∗t = PCoEx W ∗

t and PCo !∗t =

PCo W ∗t . Clearly, !t ⊂ Co W ∗

t , so !t is dominated byPCo !∗

t (with respect to the directions in )). It remainsto show the reverse, i.e., !∗

t is dominated by PCo !t.

Consider any ∈!∗t . By the definition of PCO, one of

the following must be true: (a) there exists ∈ )+ suchthat ′ ′ for all ∈ W ∗

t , or (b) is the limit ofa point sequence k ∈ W ∗

t associated with a directionsequence k ∈)+ such that ′

k k ′k for all ∈W ∗

t

and for all k. In case (a), because is an extreme point ofCo W ∗

t , there must exist a ∈ A and at+1 ∈!t+1∈such that = ga+∑∈ P aRa at+1 . For any ∈,P aRa at+1 is dominated by PCo !a

t , and hence∑∈ P aRa at+1 is dominated by PCo !a

t . As aresult, is dominated by PCo !t. In case (b), becausePCo !t is closed and every k is dominated by PCo !t, must be dominated by PCo !t as well. Thus, !∗

t isdominated by PCo !t. Therefore, we have PCo !∗

t =PCo !t and !∗

t =!t .

Proof of Theorem 2. (1) We show that Equation (9) istrue. Given the time-(t + 1) continuation-value set Ut+1, atime-t continuation-value vector

ga+Pa ∑∈Ra uat+1

can be obtained by taking action a and selecting time-(t+1)continuation-value vector uat+1 after observation . Theconvex hull of

⋃a∈A

ga+Pa ∑

∈Ra uat+1 * u

at+1 ∈Ut+1∀ ∈

contains all time-t continuation-value vectors that can beobtained by randomization.

(2) We show that (10) is true. Note that: (a) forany ∈ ), a ∈ A, and ∈ , if ′PaRa 2 > 0, ′PaRa / ′PaRa 2 ∈ ), where ·2 denotes theEuclidean norm; (b) by Lemma 1(1), for any compactconvex set Ut+1 ⊂ n and any ∈ ), there exists u ∈PCo Ut+1 such that ′ u ′u for any u ∈ Ut+1. Thesefacts, combined with Equation (9) and the definition ofPCO, imply that any point in PCo Ut can be created fromPCo Ut+1. Thus, Equation (10) follows.

(3) Now we show that (11) is true. Consider anyextreme point u∗t ∈ Ut . By expression (10), there mustexist a∗ ∈ A and ua

∗t+1 ∈ Ut+1∈ such that u∗t = ga∗ +

Pa∗ ∑

∈ Ra∗ ua

∗t+1 . If ua

∗t+1 is an extreme point

of Ut+1 for all ∈ , we are done. For any ∈ , ifua

∗t+1 is not an extreme point of Ut+1, it must lie on a

face F ⊂Ut+1 (with at least one dimension) or in the inte-rior of Ut+1 (define F = Ut+1 in that case). It follows thatPa

∗Ra

∗ v = Pa

∗Ra

∗ ua

∗t+1 for all v ∈ F ; otherwise,

Pa∗Ra

∗ ua

∗t+1 can be expressed as the convex combina-

tion of Pa∗Ra

∗ v′ and Pa

∗Ra

∗ v′′ for some v′ = v′′ ∈ F ,

and hence u∗t can be expressed as the convex combinationof two distinct points in Ut , a contradiction. Thus, ua

∗t+1

can be replaced by any extreme point of Ut+1 on F withoutaltering u∗t . Therefore, u∗t can always be constructed fromthe extreme points of Ut+1, and expression (11) holds.

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 14: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and AnalysisOperations Research 58(1), pp. 214–228, © 2010 INFORMS 227

(4) From part (3) and by induction, it is clear that everyu∗t ∈ U ∗

t can be generated by a deterministic continuationpolicy.

Proof of Lemma 3. We show that for any two sets UV ⊂n such that dH UV = d, =U and =V are also in n

and dH =U=V d. Consider any u∗ ∈ Ex =U (i.e.,an extreme point of =U ). By expression (10), there existan action a∗ and a set of vectors u ∈U∈ such thatu∗ = ga∗ +Pa∗ ∑∈ Ra

∗ u . Because dH UV = d,

for each u , there exists v ∈ V such that u −v d. Let e denote the vector of 1s with a proper dimension.We have

u∗ −de= ga∗ +Pa∗ ∑∈Ra

∗ u −de

ga∗ +Pa∗ ∑

∈Ra

∗ v ≡ v∗

ga∗ +Pa∗ ∑

∈Ra

∗ u +de= u∗ +de

The first and last equations above follow fromPa

∗ ∑∈ Ra

∗ e= e. Clearly, v∗ ∈ =V and u∗ −v∗ d.

The result can be generalized to nonextreme u∗ ∈ =U byconvex combinations. Similarly, we can show that for anyv∗ ∈ =V , there exists u∗ ∈ =U such that u∗ − v∗ d.Thus, dH =U=V d, and the operator = is acontraction mapping with modulus .

Proof of Theorem 4. Consider any ∗ ∈) u∗. By defi-nition, ∗′u∗ ∗′u for all u ∈ U. By expression (10),u∗ = ga

∗ + Pa∗ ∑

∈ Ra∗ u . Thus, for any ∈ ,

we must have ∗′Pa∗Ra∗ u ∗′Pa∗Ra∗ u forall u ∈ U, i.e., ∗′Pa∗Ra∗ ∈) u . By definition,)a∗ u∗" ⊂) u .

Proof of Lemma 4. Suppose that A B, i.e., thereexists nonnegative matrix X with row sum 1 such thatAX = B. Then, the th column of B is given by b· =Ax·, where x· is the th column of X. Let xk be the k th element of X. For any point wB ∈WB U, supposethat wB =∑mB

=1B uB for some uB ∈ U=1mB . Then,

wB = ∑mB

=1

∑mA

k=1 xkA kuB = ∑mA

k=1

∑mB

=1 xkA kuB . For

each k = 1 mA,∑mB

=1 xkA kuB is a convex com-

bination of the set of points A kuB =1mB and ishence weakly dominated by the positive convex hull ofA kuB =1mB . By the definition of positive Minkowskisum, wB = ∑mA

k=1 ∑mB

=1 xkA kuB is dominated by⊕mA

k=1A kuB =1mB . The latter is in turn dominated by

WA U because uB =1mB ⊂U . Therefore, the entire setWB U is dominated by WA U. If A≈ B, we have A Band B A, and the above result implies WA U=WB Ufor any U ⊂n.

Proof of Theorem 6. For clarity, we label the secondPOMDP by a “˜” symbol. Thus, Ra Ra for all a ∈A.By Lemma 4, for any set U and any action a, the pos-itive Minkowski sum

⊕∈R

a ua* ua ∈ U dominates

⊕∈ Ra ua* ua ∈ U. Thus, =U = PCo

⋃a∈Aga +

Pa∑

∈ Ra ua* ua ∈ U∀ ∈ dominates =U =

PCo ⋃a∈Aga + Pa

∑∈ Ra ua* ua ∈ U∀ ∈ ,

for any point set U . If T < =kU dominates =kU forany set U and any k=12 T ; if T =, the sequences=kUk=12 and =kUk=12 converge to the continua-tion-value frontiers U and U of the two POMDPs, respec-tively, and hence U dominates U.

AcknowledgmentsThe author thanks the associate editor and two anonymousreferees for their constructive suggestions that improvedthe exposition of this paper, and Mahesh Nagarajan for hishelpful comments on this work.

ReferencesAnily, S., A. Grosfeld-Nir. 2006. An optimal lot-sizing and offline inspec-

tion policy in the case of nonrigid demand. Oper. Res. 54(2) 311–323.

Aoki, M. 1965. Optimal control of partially observable Markovian sys-tems. J. Franklin Inst. 280 367–386.

Astrom, K. J. 1965. Optimal control of Markov decision processes withincomplete state estimation. J. Math. Anal. Appl. 10 174–205.

Berg, M. de, M. van Kreveld, M. Overmars, O. Schwarzkopf. 2000. Com-putational Geometry: Algorithms and Applications. Springer-Verlag,Berlin.

Bertsekas, D. 1976. Dynamic Programming and Stochastic Control.Academic Press, New York.

Blackwell, D. 1953. Equivalent comparisons of experiments. Ann. Math.Statist. 24 265–272.

Boissonnat, J-D., M. Yvinec. 1998. Algorithmic Geometry. CambridgeUniversity Press, Cambridge, UK.

Boutilier, C. 2002. A POMDP formulation of preference elicita-tion problems. Proc. Eighteenth National Conf. Artificial Intelli-gence (AAAI-02). AAAI Press, Menlo Park, CA, 239–246.

Cassandra, A. R., L. P. Kaelbling, J. A. Kurien. 1996. Acting underuncertainty: Discrete Bayesian models for mobile robot navigation.Proc. IEEE/RSJ Internat. Conf. Intelligent Robots and Systems. IEEE,Piscataway, NJ, 963–972.

Cassandra, A. R., L. P. Kaelbling, M. L. Littman. 1994. Acting optimallyin partially observable stochastic domains. Proc. Twelfth NationalConf. Artificial Intelligence (AAAI-94). AAAI Press, Menlo Park,CA, 1023–1028.

Cassandra, A. R., M. Littman, N. L. Zhang. 1997. Incremental pruning:A simple, fast, exact method for partially observable Markov deci-sion processes. Proc. Thirteenth Conf. Uncertainty in Artificial Intel-ligence (UAI-97). Morgan Kaufmann, San Francisco, 54–61.

Cheng, H.-T. 1988. Algorithms for partially observable Markov deci-sion processes. Ph.D. dissertation, University of British Columbia,Vancouver, British Columbia, Canada.

Drake, A. 1962. Observation of a Markov process through a noisychannel. Ph.D. dissertation, Massachusetts Institute of Technology,Cambridge, MA.

Eckles, J. E. 1966. Optimum replacement of stochastically failing systems.Ph.D. dissertation, Stanford University, Stanford, CA.

Eckles, J. E. 1968. Optimum maintenance with incomplete information.Oper. Res. 16 1058–1067.

Ehrenfeld, S. 1976. On a sequential Markovian decision procedure withincomplete information. Comput. Oper. Res. 3 39–48.

Feng, Z., S. Zilberstein. 2004. Region-based incremental pruning forPOMDPs. Proc. 20th Conf. Uncertainty in Artificial Intelligence(UAI-04). Morgan Kaufmann, San Francisco, 146–153.

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.

Page 15: Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The first explicit

Zhang: Partially Observable Markov Decision Processes: A Geometric Technique and Analysis228 Operations Research 58(1), pp. 214–228, © 2010 INFORMS

Fernández-Gaucherand, E., A. Arapostathis, S. I. Marcus. 1991. On theaverage cost optimality equation and the structure of optimal policiesfor partially observable Markov decision processes. Ann. Oper. Res.29 439–470.

Fukuda, K. 2004. From the zonetope construction to the Minkowski addi-tion of convex polytopes. J. Symbolic Comput. 38 1261–1272.

Gritzmann, P., B. Sturmfels. 1993. Minkowski addition of polytopes:Computational complexity and applications to Grobner bases. SIAMJ. Discrete Math. 6(2) 246–269.

Grosfeld-Nir, A. 1996. A two-state partially observable Markov decisionprocess with uniformly distributed observations. Oper. Res. 44(3)458–463.

Grosfeld-Nir, A. 2007. Control limits for two-state partially observableMarkov decision processes. Eur. J. Oper. Res. 182 300–304.

Hansen, E. A. 1998a. An improved policy iteration algorithm for partiallyobservable MDPs. Advances in Neural Inform. Processing Systems10 (NIPS-97). MIT Press, Cambridge, MA, 1015–1021.

Hansen, E. A. 1998b. Solving POMDPs by searching in policy space.Proc. Fourteenth Conf. Uncertainty in Artificial Intelligence (UAI-98). Morgan Kaufmann, San Francisco, 211–219.

Howard, R. 1960. Dynamic Programming and Markov Processes. MITPress, Cambridge, MA.

Hsu, S.-P., D.-M. Chuang, A. Arapostathis. 2006. On the existence ofstationary optimal policies for partially observed MDPs under thelong-run average cost criterion. Systems Control Lett. 55 165–173.

Kaelbling, L. P., M. Littman, A. R. Cassandra. 1998. Planning and actingin partially observable stochastic domains. Artificial Intelligence 10199–134.

Kakalik, J. S. 1965. Optimum policies for partially observableMarkov systems. Technical Report 18, Operations Research Center,Massachusetts Institute of Technology, Cambridge, MA.

Lane, D. 1989. A partially observable model of decision making by fish-ermen. Oper. Res. 37(2) 240–254.

Littman, M. L., A. R. Cassandra, L. P. Kaelbling. 1995. Learning poli-cies for partially observable environments: Scaling up. Proc. TwelfthInternat. Conf. Machine Learning (ICML-95). Morgan Kaufmann,San Francisco, 362–370.

Lovejoy, W. S. 1987. Some monotonicity results for partially observedMarkov decision processes. Oper. Res. 35 (5) 736–743.

Lovejoy, W. S. 1991a. Computationally feasible bounds for partiallyobserved Markov decision processes. Oper. Res. 39 162–175.

Lovejoy, W. S. 1991b. A survey of algorithmic methods for partiallyobserved Markov decision processes. Ann. Oper. Res. 28 47–66.

Lusena, C., J. Goldsmith, M. Mundhenk. 2001. Nonapproximabilityresults for partially observable Markov decision processes. J. Artifi-cial Intelligence Res. 14 83–103.

Marbach, P., O. Mihatsch, J. N. Tsitsiklis. 2000. Call admission controland routing in integrated service networks using neuro-dynamic pro-gramming. IEEE J. Selected Areas Commun. 18(2) 197–208.

Meuleau, N., K.-E. Kim, L. P. Kaelbling, A. R. Cassandra. 1999.Solving POMDPs by searching the space of finite policies. Proc.Fifteenth Conf. Uncertainty in Artificial Intelligence (UAI-99).Morgan Kaufmann, San Francisco, 417–426.

Meuleau, N., M. Hauskrecht, K.-E. Kim, L. Peshkin, L. P. Kaelbling,T. Dean, C. Boutilier. 1998. Solving very large weakly coupledMarkov decision processes. Proc. Fifteenth National Conf. ArtificialIntelligence (AAAI-98). AAAI Press, Menlo Park, CA, 165–172.

Monahan, G. E. 1982. A survey of partially observable Markov decisionprocesses: Theory, models and algorithms. Management Sci. 28 1–16.

Montemerlo, M., J. Pineau, N. Roy, S. Thrun, V. Verma. 2002. Experi-ences with a mobile robotic guide for the elderly. Proc. EighteenthNational Conf. Artificial Intelligence (AAAI-02). AAAI Press, MenloPark, CA, 587–592.

Paek, T., E. Horvitz. 2000. Conversation as action under uncertainty. Proc.Sixteenth Conf. Uncertainty in Artificial Intelligence (UAI-2000).Morgan Kaufmann, San Francisco, 455–464.

Papadimitriou, C. H., J. N. Tsitsiklis. 1987. The complexity of Markovdecision processes. Math. Oper. Res. 12(3) 441–450.

Paz, A. 1971. Introduction to Probabilistic Automata. Academic Press,New York.

Pineau, J., G. Gordon, S. Thrun. 2003. Point-based value iteration: Ananytime algorithm for POMDPs. Proc. Eighteenth Internat. JointConf. Artificial Intelligence (IJCAI-03). Morgan Kaufmann, SanFrancisco, 1025–1030.

Platzman, L. K. 1977. Finite memory estimation and control of finiteprobabilistic systems. Ph.D. dissertation, Massachusetts Institute ofTechnology, Cambridge, MA.

Platzman, L. K. 1980. Optimal infinite-horizon undiscounted control offinite probabilistic systems. SIAM J. Control Optim. 18(4) 362–380.

Poupart, P. 2005. Exploiting structure to efficiently solve large scalepartially observable Markov decision processes. Ph.D. dissertation,University of Toronto, Toronto, Ontario, Canada.

Poupart, P., C. Boutilier. 2004. Bounded finite state controllers. Advancesin Neural Inform. Processing Systems 16 (NIPS-03). MIT Press,Cambridge, MA, 823–830

Puterman, M. L. 1994. Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley & Sons, New York.

Rockafellar, R. T. 1970. Convex Analysis. Princeton University Press,Princeton, NJ.

Ross, S. 1971. Quality control under Markovian deterioration. Manage-ment Sci. 17 587–596.

Satia, J. K., R. E. Lave. 1973. Markovian decision processes with proba-bilistic observation of states. Management Sci. 20 1–13.

Smallwood, R. D., E. J. Sondik. 1973. The optimal control of partiallyobservable Markov processes over a finite horizon. Oper. Res. 211071–1088.

Sondik, E. J. 1971. The optimal control of partially observable Markovprocesses. Ph.D. dissertation, Stanford University, Stanford, CA.

Sondik, E. J. 1978. The optimal control of partially observable Markovprocesses over the infinite horizon: Discounted costs. Oper. Res. 26282–304.

Sulganik, E. 1995. On the structure of Blackwell’s equivalence classes ofinformation systems. Math. Soc. Sci. 29 213–223.

Thrun, S. 2000. Monte Carlo POMDPs. Advances in Neural Inform.Processing Systems 12 (NIPS-99). MIT Press, Cambridge, MA,1064–1070.

Treharne, J. T., C. R. Sox. 2002. Adaptive inventory control for non-stationary demand and partial information. Management Sci. 48(5)607–624.

Wang, R. 1977. Optimal replacement policy under unobservable states.J. Appl. Probab. 14 340–348.

White, C. C. 1977. A Markov quality control process subject to partialobservation. Management Sci. 23 843–852.

White, C. C. 1979. Optimal control-limit strategies for a partially observedreplacement problem. Internat. J. Systems Sci. 10 321–331.

White, C. C. 1980. Monotone control laws for noisy, countable-stateMarkov chains. Eur. J. Oper. Res. 5 124–132.

White, C. C. 1991. A survey of solution techniques for the partiallyobserved decision process. Ann. Oper. Res. 32 215–230.

White, C. C., W. T. Scherer. 1989. Solution procedures for partiallyobserved Markov decision processes. Oper. Res. 37 791–797.

White, C. C., W. T. Scherer. 1994. Finite-memory suboptimal designfor partially observed Markov decision processes. Oper. Res. 42439–455.

Yu, H., D. Bertsekas. 2008. On near optimality of the set of finite-statecontrollers for average cost POMDP. Math. Oper. Res. 33(1) 1–11.

Zhang, N. L., W. Liu. 1996. Planning in stochastic domains: Problem char-acteristics and approximation. Technical Report HKUST-CS96-31,Hong Kong University of Science and Technology, Hong Kong.

Zhang, B., Q. Cai, J. Mao, B. Guo. 2001. Planning and acting underuncertainty: A new model for spoken dialogue systems. Proc. Seven-teenth Conf. Uncertainty in Artificial Intelligence (UAI-01). MorganKaufmann, San Francisco, 572–579.

INFORMS

holds

copyrightto

this

article

and

distrib

uted

this

copy

asa

courtesy

tothe

author(s).

Add

ition

alinform

ation,

includ

ingrig

htsan

dpe

rmission

policies,

isav

ailableat

http://journa

ls.in

form

s.org/.