Control and Decision Making in Uncertain Multiagent Hierarchical Systems June 10 th, 2002 H. Jin Kim and Shankar Sastry University of California, Berkeley

Control and Decision Making in Uncertain Multiagent Hierarchical Systems

June 10th , 2002

H. Jin Kim and Shankar Sastry

University of California, Berkeley

2

Outline

Hierarchical architecture for multiagent operations

Confronting uncertainty

Partial observation Markov games (POMgame)

Incorporating human intervention in control and decision making

Model predictive techniques for dynamic replanning

3

Partial-observation Probabilistic Pursuit-Evasion Game(PEG) with 4 UGVs and 1 UAV

Fully autonomous operation

4

Uncertainty pervades every layer!

Hierarchy in Berkeley Platform

actuatorpositions

inertialpositions

height over

terrain

• obstacles detected• targets detectedcontrol

signals

INS GPSultrasonic altimeter

vision

state of agents

obstacles detected

targetsdetected

obstaclesdetected

agentspositions

desiredagentsactions

Tactical Planner& Regulation

Vehicle-level sensor fusion

Strategy Planner Map Builder

• position of targets • position of obstacles • positions of agents

Communications Network

tacticalplanner

trajectoryplanner

regulation

•lin. accel.•ang. vel.

Targets

Exogenousdisturbance

UAV

dynamics

Terrain

actuatorencoders

UGV dynamics

Impossible to buildautonomous agentsthat can cope with all contingencies

5

Human Interface

Command

Current Position, Vehicle Stats

Evader location detected by Vision system

Ground Station

High degree of autonomydoes not guarantee

superior performanceof overall system

6

Lessons Learned and UAV/UGV Objective

To design semi-autonomous teams that deliver mission reliably under uncertainty and evaluate their performance

Scalable/replicable system aided by computationally tractable algorithms

Hierarchical architecture design and analysis– High-level decision making in a discrete space– Physical-layer control in a continuous space

Hierarchical decomposition requires tight interaction between layers to achieve cooperative behavior, to deconflict and to support constraints.

Confronting uncertainty arising from partially observable, dynamically changing environments and intelligent adversaries – Proper degree of autonomy to incorporate reliance on human

intervention– Observability and directability, not excessive functionality

7

Representing and Managing Uncertainty

Uncertainty is introduced in various channels– Sensing -> unable to determine the current state of world– Prediction -> unable to infer the future state of world– Actuation -> unable to make the desired action to properly

affect the state of world

Different types of uncertainty can be addressed by different approaches – Nondeterministic uncertainty : Robust Control– Probabilistic uncertainty :

(Partially Observable) Markov Decision Processes– Adversarial uncertainty : Game Theory

POMGAME

9

Markov Games

Framework for sequential multiagent interaction in an Markov environment

10

Policy for Markov Games

The policy of agent i at time t is a mapping from the current state to probability distribution over its action set.

Agent i wants to maximize – the expected infinite sum of a reward that the agent will gain

by executing the optimal policy starting from that state– where is the discount factor, and is the reward

received at time t

Performance measure:

Every discounted Markov game has at least one stationary optimal policy, but not necessarily a deterministic one.

Special case : Markov decision processes (MDP)– Can be solved by dynamic programming

11

Partial Observation Markov Games (POMGame)

12

Policy for POMGames

The agent i wants to receive at least

Poorly understood: analysis exists only for very specially structured games such as a game with a complete information on one side

Special case : partially observable Markov decision processes (POMDP)

13

Acting under Partial Observations

Memory-free policies (mapping from observation to action or probability distribution over action sets) are not satisfactory.

In order to behave truly effectively we need to use memory of previous actions and observations to disambiguate the current state.

The state estimate, or belief state– Posterior probability distribution over states

= the likelihood the world is actually in the state x, at time t, given the agent’s past experience (I.e. actions and observation histories). A priori human input

on the initial state of world

14

Updating Belief State

– Can be updated recursively using the estimated world model and Bayes’ rule.

New info on the state of

world

New info on prediction

16

BEAR Pursuit-Evasion Scenario

Evade! Evade!

18

Performance measure : capture time

Optimal policy minimizes the cost

*

: min 1:

where is the set of all {y(1)...y( )},

associated with an evader not being found up to

fnd

fnd

T Y Y

Y

Optimal Pursuit Policy

*: EJ T

19

cost-to-go for policy , when the pursuers start with Yt= Y and a conditional distribution for the state x(t)

cost of policy

( )X

XX

( )

( , ) : E min 1: Y ( ) ,

: ( ( ) | ) [0,1]

: probability in that corresponds to the state

fndt x

x

t x

x

V Y t t x Y

P t x Y

x

Y x Y

x Y

Optimal Pursuit Policy

Y

Y

: E min 1: Y (1) P( (1))

({ }, ({ })) P( (1))

fnd

y

y

J t y

V y y

Y y y

y

20

Persistent pursuit policies

Optimization using dynamic programming is computationally intensive.

Persistent pursuit policy g

* *

* * * *

* * 1

1

P ( | ) 0

P ( ) P ( | ) 1 P ( )

P ( )

g

g g g

g gt

t t

t t t t

E t t

T T

T T T T

T T

21

Persistent pursuit policies

Persistent pursuit policy g with a period T

* *

* * 1

1

P ( { ,..., 1} | ) 0

P ( )

g

g gt

t t T t

E t t T

T T

T T

22

Pursuit Policies

• Greedy Policy– Pursuer moves to the cell with the highest probability of having an

evader at the next instant– Strategic planner assigns more importance to local or immediate

considerations

– u(v) : list of cells that are reachable from the current pursuers position v in a single time step.

1 2{ , , , } ( ) 1,

( ) argmax ( , 1 | )p

np

j k

n

t e k tv v v u v k

v v j k

g Y p v t Y

23

Persistent Pursuit Policies for unconstrained motion

Theorem 1, for unconstrained motion

The greedy policy is persistent.->The probability of the capture time being finite is equal to one

->The expected value of the capture time is finite

* *

1

*

P ( { ,..., 1} | ) ( , 1 | ) 0

pnp

g e k tk c

cg

p

nt t T t p v t Y

n

nE

n

T T

T

24

Persistent Pursuit Policies for constrained motion

Assumptions

1. For any

2. Theorem 2, for constrained motion

There is an admissible pursuit policy that is persistent on the average with period

( 1)oT d n d

1

There is a constant (0,1] such that

( , 1 | ) ( , | ) ( ( ) | ) 1e t e t t tp x t Y p x t Y x Y P x Y

m

, , there exists a finite sequence in U

{ (0), (1),..., ( ) : (0) , (1) }

such that , ( ) U ( 1) .

i f

i f

v v U

v v v t v v v v

v v

25

Experimental Results: Pursuit Evasion Games with 4UGVs (Spring’ 01)

26

Experimental Results: Pursuit Evasion Games with 4UGVs and 1 UAV (Spring’ 01)

27

Pursuit-Evasion Game Experiment

PEG with four UGVs• Global-Max pursuit policy• Simulated camera view

(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX

28

Pursuit-Evasion Game Experiment

PEG with four UGVs• Global-Max pursuit policy• Simulated camera view

(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX

29

Experimental Results: Evaluation of Policies for different visibility

Global max policy performs better than greedy, since the greedy policy selects movements based only on local considerations.

Both policies perform better with the trapezoidal view, since the camera rotates fast enough to compensate the narrow field of view.

Capture time of greedy and glo-max for the different region of visibility

of pursuers

3 Pursuers with trapezoidal or omni-directional view

Randomly moving evader

30

Experimental Results: Evader’s Speed vs. Intelligence

• Having a more intelligent evader increases the capture time

• Harder to capture an intelligent evader at a higher speed

• The capture time of a fast random evader is shorter than that of a slower random evader, when the speed of evader is only slightly higher than that of pursuers.

Capture time for different speeds and levels of intelligence of the evader

3 Pursuers with trapezoidal view & global maximum policy

Max speed of pursuers: 0.3 m/s

31

Game-theoretic Policy Search Paradigm

Solving very small games with partial information, or games with full information, are sometimes computationally tractable

Many interesting games including pursuit-evasion are a large game with partial information, and finding optimal solutions is well outside the capability of current algorithms

Approximate solution is not necessarily bad. There might be simple policies with satisfactory performances

-> Choose a good policy from a restricted class of policies !

We can find approximately optimal solutions from restricted classes, using a sparse sampling and a provably convergent policy search algorithm

32

Constructing A Policy Class

Given a mission with specific goals, we – decompose the problem in terms of the functions that need to

be achieved for success and the means that are available– analyze how a human team would solve the problem– determine a list of important factors that complicate task

performance such as safety or physical constraints Maximize aerial coverage, Stay within a communications range, Penalize actions that lead an agent to a danger zone, Maximize the explored region, Minimize fuel usage, …

33

Policy Representation

Quantize the above features and define a feature vector that consists of the estimate of above quantities for each action given agents’ history

Estimate the ‘goodness’ of each action by constructing

where is the weighting vector to be learned .

Choose an action that maximizes .

Or choose a randomized action according to the distribution

Degree of Exploration

34

Policy Learning

Policy parameters are learned using standard techniques such as gradient descent algorithm to maximize the long-term reward

Given a POMDP, and assuming that we have a deterministic simulative model, we can approximate a value for a specific policy by building a set of trajectory trees with depth

ms is independent of the size of the state space or the complexity of the transition distribution [Ng, Jordan00]

Computational tractability

36

Example: Policy Feature

Maximize collective aerial coverage -> maximize the distance between agents

where is the location of pursuer that will be landed by taking action from

Try to visit an unexplored region with high possibility of detecting an evader

where is a position arrived by the action that maximizes the evader map value along the frontier

37

Prioritize actions that are more compatible with the dynamics of agents

Policy representation

Example: Policy Feature (Continued)

38

Benchmarking Experiments

Performance of two pursuit policies compared in terms of capture time

Experiment 1 : two pursuers against the evader who moves greedily with respect to the pursuers’ location

Experiment 2 : When we supposed the position of evader at each step is detected by the sensor network with only 10% accuracy, two optimized pursuers took 24.1 steps, while the one-step greedy pursuers took over 146 steps in average to capture the evader in 30 by 30 grid.

Grid size1-Greedy pursuers

Optimized pursuers

10 by 10 (7.3, 4.8) (5.1, 2.7)

20 by 20 (42.3, 19.2) (12.3, 4.3)

39

Incorporating Human Intervention

Given the POMDP formalism, informational inputs affect only initializing or updating the belief state, and does not affect the procedure of computing (approximately) optimal actions.

When a part of the system is commanded to take specific actions, it may overrule internally chosen actions and simultaneously communicate its modified status to the rest of the system, which then in turn adapts to coordinate their own actions as well.

A human command in the form of mission objectives can be expressed as a change to the reward function, that causes the system to modify or dynamically replan its actions to achieve it. The importance of a goal is specified by changing the magnitude of the rewards.

40

Coordination under Multiple Sources of Commands

When different humans or layers specify multiple, possibly conflicting goals or actions, how the system can prioritize or resolve them ?

Different entities are a priori assigned different degrees of authority

If there are enough resources to resolve an important conflict, we may give operators the option of explicitly coordinating their goals

Surge in coordination demand when the situation deviates from textbook cases: can the overall system adapt real-time?

Intermediate, cooperative modes of interaction (vs. traditional human interrupt of full manual form) is desirable

Transparent, event-based display to highlight changes (vs. current data-oriented display)

Anticipatory reasoning (not just information on history) should be supported.

41

Deconfliction between Layers

Each UAV is given a waypoint by high-level planner

Shortest trajectories to the waypoints may lead collision

How to dynamically replan the trajectory for the UAVs subject to input saturation and state constraints

42

(Nonlinear) Model Predictive Control

Find that minimizes

Common choice

43

Planning of Feasible Trajectories

State saturation

Collision avoidance

Magnitude of each cost element represents the priority of tasks/functionality, or the authority of layers

44

Hierarchy in Berkeley Platform

actuatorpositions

inertialpositions

height over

terrain

• obstacles detected• targets detectedcontrol

signals

INS GPSultrasonic altimeter

vision

state of agents

obstacles detected

targetsdetected

obstaclesdetected

agentspositions

desiredagentsactions

Tactical Planner& Regulation

Vehicle-level sensor fusion

Strategy Planner Map Builder

• position of targets • position of obstacles • positions of agents

Communications Network

tacticalplanner

trajectoryplanner

regulation

•lin. accel.•ang. vel.

Targets

Exogenousdisturbance

UAV

dynamics

Terrain

actuatorencoders

UGV dynamics

45

H1

H2

H0

Cooperative Path Planning & Control

Trajectories followed by 3 UAVs

Coordination based on priority

Example: Three UAVs are given straight line trajectories that will lead to collision. |Lin. Vel.|

< 16.7ft/s

|Ang| < pi/6 rad

|Control Inputs| < 1

Constraints supported

NMPPC dynamically replans and tracks the safe trajectory of H1 and H2 under input/state

constraints.

46

Summary

Decomposition of complex multiagent operation problems requires tighter interaction between subsystems and human intervention

Partial observation Markov games provides a mathematical representation of a hierarchical multiagent system operating under adversarial and environmental uncertainty

Policy class framework provides a setup for including human experience

Policy search methods and sparse sampling produce computationally tractable algorithms to generate approximate solutions to partially observable Markov games.

Human input can/should be incorporated, either a priori or on-the-fly, into various factors such as reward functions, feature vector elements, transition rules, action priority

Model predictive (receding horizon) techniques can be used for dynamic replanning to deconflict/coordinate between vehicles, layers or subtasks

47

Unifying Trajectory Generation and Tracking Control

Nonlinear Model Predictive Planning & Control combines trajectory planning and control into a single problem, using ideas from

– Potential-field based navigation (real-time path planning)– Nonlinear model predictive control (optimal control of nonlinear multi-

input, multi-output systems with input/state constraints)

We incorporate a tracking performance, potential function, state constraints into the cost function to minimize, and use gradient-descent for on-line optimization.

Removes feasibility issues by considering the UAV dynamics from the trajectory planning

Robust to parameter uncertainties

Optimization can be done real-time

48

Modeling and Control of UAVs

A single, computationally tractable model cannot capture nonlinear UAV dynamics throughout the large flight envelope.

Real control systems are partially observed (noise, hidden variables).

It is impossible to have data for all parts of the high-dimensional state-space.

-> Model and Control algorithm must be robust to unmodeled dynamics and noise and handle MIMO nonlinearity.

Observation: Linear analysis and deterministic robust control techniques fail to do so.

49

Modeling RUAV Dynamics

PositionSpatial velocitiesAnglesAngular rates

Ser

voin

pu

ts

throttle

longitudinal flappinglateral flapping

main rotor collective pitch tail rotor collective pitch

Body Velocities

Angular rates

Aerodynamic Analysis

Coordinate Transformation

Augmented Servodynamics

Tractable Nonlinear Model

50

Benchmarking Trajectory

PD controller

ExamplePD controller fails to achieve nose-in circle type trajectories.

Nonlinear, coupled dynamics are intrinsic characteristics in pirouette and nose-in circle trajectories.

51

Reinforcement Learning Policy Search Control Design

1. Aerodynamics/kinematics generates a model to identify.

2. Locally weighted Bayesian regression is used for nonlinear stochastic identification: we get the posterior distribution of parameters, and can easily simulate the posterior predictive distribution to check the fit and robustness.

3. A controller class is defined from the identification process and physical insights and we apply policy search algorithm .

4. We obtain approximately optimal controller parameters by reinforcement learning, I.e. training using the flight data and the reward function.

5. Considering the controller performance with a confidence interval of the identification process, we measure the safety and robustness of control system.

52

Performance of RL Controller

Manual vs. Autonomous Hover Assent & 360° x2 pirouette

53

Demo of RL controller doing acrobatic maneuvers (Spring 02)

54

pirouette

maneuver2maneuver1maneuver3

Nose-inDuring circling

Heading kept the same

•Any variation of the following maneuvers in x-y direction •Any combination of the following maneuvers

Set of Manuevers

55

Video tape of Maneuvers

56

Back Up Slides

57

PEGASUS (Ng & Jordan, 00)

Given a POMDP ,

Assuming a deterministic simulator, we can construct an equivalent POMDP with deterministic transitions .

For each policy 2 for we can construct an equivalent policy 0 2 0 for 0 such that they have the same value function, i.e. V () = V 0 (0) .

It suffices for us to find a good policy for the transformed POMDP 0 .

Value function can be approximated by a deterministic function , and ms samples are taken and reused to compute the value function for each candidate policy. --> Then we can use standard optimization techniques to search for approximately optimal policy.

58

PEGASUS (Ng & Jordan, 00)

Given a POMDP ,

Assuming a deterministic simulator, we can construct an equivalent POMDP with deterministic transitions .

For each policy 2 for we can construct an equivalent policy 0 2 0 for 0 such that they have the same value function, i.e. V () = V 0 (0) .

It suffices for us to find a good policy for the transformed POMDP 0 .

Value function can be approximated by a deterministic function , and ms samples are taken and reused to compute the value function for each candidate policy. --> Then we can use standard optimization techniques to search for approximately optimal policy.

59

Performance Guarantee & Scalability

Theorem

We are guaranteed to have a policy with the value close enough to the optimal value in the class

Note that

60

Markov Decision Process (MDP)

Framework for sequential decision making in the stationary environment

Documents

Control and Decision Making in Uncertain Multiagent Hierarchical Systems June 10 th, 2002 H. Jin Kim and Shankar Sastry University of California, Berkeley