1 Design Lessons from an Unattended Ground Sensor System* Lewis Girod † CS 294-1 23 Sept 2003 † Center for Embedded Networked Sensing, [email protected]

1

Design Lessons from an Unattended Ground Sensor System*

Lewis Girod†

CS 294-1

23 Sept 2003†Center for Embedded Networked Sensing, [email protected]*Work done at Sensoria Corp, supported by DARPA/ATO contract DAAE30-00-C-1055. Sensoria team: R. Costanza, J. Elson, L. Girod, W. Kaiser, D. McIntire, W. Merrill, F. Newberg, G. Rava, B. Schiffer, K. Sohrabi

2

Introduction

Networked embedded systems come in a variety of sizes

RAM is a primary tradeoff– Costs power, cost to shutdown– Enables greater complexity– Enables development while

postponing optimization

SPEC (J. Hill): 4MHz/8bit, 3K/0K

Mica2 (Berkeley/Xbow): 8MHz/8bit, 4K/128K

MK2 (UCLA/NESL): 40MHz/16bit, 136K/1M

SHM (Sensoria):300MIPS+FP/32bit, 64M/32M

3

Motivation for EmStar

EmStar is a run time environment designed for Linux-based distributed embedded systems

– Useful facilities (process health/respawn, logging, emulation)– Common APIs (neighbor discovery, link interface, etc)– Designed for larger memory footprint (avoids hard limits)

Many of the ideas and motivations for EmStar derived from our experience with SHM

– Modularity, robustness to module failure– System transparency at low cost to developers

Some parts of EmStar are used in SHM & elsewhere– Time Synch service– Audio Server

4

System Objectives and Design

5

Objectives

Unattended Ground Sensor (UGS) System

Fully autonomous operation Ad-hoc deployment

Scaling unit: 50 nodes– All operations and protocols local to 50 node region– No global operations or context required

*

*Fancy graphics taken from the official SHM website

6

Adaptive and Self-Configuring

Self-localizing without GPS*: acoustic ranging– Build map of relative locations– Adaptive/Resilient to environmental conditions

e.g. wind, sunny days, background noise

Self-assembled data network– TDMA MAC layer

Typically 10-hop diameter with 100 nodes

– Adaptive/Resilient to RF environment

*In this application, GPS is avoided for security reasons. In other applications, obstructions and foliage can be an issue

7

Maintain coverage via Actuation

Vigilant units detect failed unit(s) Remaining units autonomously

move in to maintain coverage

8

Demonstration Requirements

200x50m outdoor field 100 nodes, 10m spacing Sunny afternoon

– 85°F, 20 MPH wind

No preconfigured state GPS-free relative

geolocation to 0.25m Detect downed nodes,

move to maintain coverage within 1 min

9

SHM Project Design Choices: Optimize for rapid development

Concurrent HW/SW development Compressed schedule

– Aggressive scaling milestones– Logistical problems with debugging system

of 100 nodes in 200x50m field Complex software required

– 150K lines C code– 30 processes– 100 IPC channels

Power is not the driving constraint– Continuous vigilance, rapid response are

project requirements– System lifetime target: < 1 day

Node Count

0

20

40

60

80

100

120

Nov-01 May-02 Dec-02 Jun-03

Milestones

No

des

10

System Configuration

300 MIPS RISC processor with FPU 64M RAM / 32M Flash 2 50kbps 2.4GHz data radios, TDMA, frequency-

hopping, star-topology MAC, 63 hopping patterns 4 channels full-duplex audio 3-axis magnetometer / accelerometer 2 mobility units, with integrated thrusters Linux 2.4 kernel Optional wired ethernet (for Devel/Debug only)

11

Results: Acoustic Ranging

Ground truth was hand-surveyed, +/- 0.5m

Ranges not temperature compensated in demo

Ranges with angles are more accurate

– Angle from TDOA of two or more ranges, must be consistent

– Bug discovered after the fact, caused large errors

12

Results: Radio Utilization

Graph shows traffic at three bases over a complete run

Initial spikes:– Tree formation– Lots of ranging

Quiescent rates– Heartbeats to detect

down nodes– Maintenance of trees

and location– Reaction to dynamics

13

Challenges in Implementation

Dealing with a dynamic environment – Adapt to wind, weather, RF connectivity

Dealing with noise– Rejecting outliers from timesync, ranges, angles– Filtering neighbor connectivity, insignificant changes to range/angle

Dealing with failure– Node failure– Infrequent crashes (e.g. FP exceptions from transient bad data)– Fault tolerance at process boundaries, avoid ripple effect

Dealing with complexity– Cross layer integration vs. modularity.. Or both?– What are the right set of primitives for coordination?

14

Case Study: Acoustic Ranging

15

Basic TOF Ranging

Basic idea: – Sender emits a characteristic acoustic signal– Receiver correlates received time series with time-offsets of reference

signal to find “peak” offset

T

Degree of correlation as a function of time offset Amount of Correlation

16

Basic AOA Estimation

•16 possible paths•First pick best speaker•Then estimate angle from TDOA of one or more consistent ranges

17

Acoustic Ranging, Version 1

First cut implemented explicit cluster coordination protocol

– Lots of error cases to handle, hard to handle all efficiently

– Very timing sensitive (sync) Did not scale past 20 nodes

– Can’t range across clusters– Best acoustic neighbors may

be in other clusters– MLat merging algorithm is

error prone Overuse of flooding

– Soft state reflood of cluster MLats and orientation data

AR AcousticSignal

Cluster Mlat(CH only)

Merging Orientation

Field Table

Radio0

NeighborDiscovery

Flo

odin

g

mlat

orientation

mlat

orientation

Radio1

NeighborDiscovery

18

AR

V2: Decomposing AR

AR

ReliableState Sync

AudioServerTimesync

Service

Flooding w/Hop by HopTime Conv

19

Audio Sample Server

Continuously samples audio, integrates to Timesync– Eliminates error-prone “Synchronized start”– Enables acquisition of overlapped sample sets

Buffers past N seconds, exposes buffered interface– Data access can be triggered after the fact: relaxes timing

constraints on trigger message– Can process overlapping chirps by requesting overlapping

retrievals, rather than having to pick one and ignore other Enables access so audio device from multiple apps

– Ranging can coexist with acoustic comm subsystem*

*Acoustic comm was developed as a backup channel to be used in event of RF jamming

20

Inter-node Timesync: RBS

Key idea: – Receiver latency more deterministic than sender – Thus, common receivers of a sender can be synched by

correlating the reception times of sender’s broadcasts– It’s your only option if you don’t control the MAC

Senders*

Sync to green nodes

Sync to purple nodes

Sync to P&G nodes

* For sender sync, senders must be in some other sender’s broadcast domain

21

TimeSync Service

Inter-node Sync: Implementation of “RBS”– Computes conversion params among all nodes in each cluster– CH does computation, reports parameters to CM’s

Intra-node Sync: Codec sample clocks– Clock pairs reported by audio server

Map time of DMA interrupt to sample number– Outlier rejection and linear fit to find offset and skew estimate

Yields more consistent result than “synch start”

Multihop Time Conversion– Graph of sync relations through system– Conversion from one element to another requires path through

graph. Gaussian error at each step ~sqrt(hops)

22

Hop-by-hop Time Conversion

Problem: – Nodes have ability to convert within cluster but not outside– Could continually broadcast conversion parameters… BUT

They are continuously varying Large amount of data to transmit across network

Solution: Integrate time conversion with routing– Routing layer knows about packets that contain timestamps*– Convert timestamps en route

At cluster boundaries At destination node

– Integrated with flooding– Can fail if sync graph route A

Timebase: ATime: 144

B C D E

Timebase: CTime: 732

Timebase: ETime: 234

*Unclear what the right API is here: we simply added code to flooding.

23

Reliable State Synchronization

Problem:– Need to reliably broadcast the latest range data to N-hop away

nodes, so they can build a consistent coordinate system– Should have reasonable latency and low overhead

V1 addressed this problem with periodic refresh– Cluster heads retransmit Mlat tables every 15 seconds

Problems: Traffic load from redundant sends, latency on msg loss– Traffic load forced new protocol:

Send a hash when there was no change since last refresh If the hash has not been seen, request full version

– But, still has 15 second latency on lost data V2 introduced a “Reliable State Sync” protocol (RSS)

24

RSS Design

Semantics:– Reliably converges on latest published state– Does not guarantee client sees every transition

Robust and Efficient, structurally similar to SRM/wb:– Based on reliable transfer of a sequenced log of “diffs”.

Pruning of the log is done with awareness of log semantics (replaced or deleted keys are pruned)

– Per-source forwarding trees (MST of connectivity graph)– Local repair, up to complete history, from upstream neighbor

New or restarted nodes will download all active flows from upstream neighbor

25

RSS API

Node X publishes current state as Key-Value Pairs Diffs are reliably broadcast N hops away from X Each node within N hops of X eventually sees the data X published

API presents each node’s KVPs in its own namespace Caveat: transmission latency, loss, edge of hopcount can cause transient inconsistencies

1

1: A=11: B=22: A=32: C=4

32 4

1: A=11: B=22: A=32: C=4

1: A=11: B=22: A=32: C=4

2: A=32: C=4

State Sync “Bus”

Note: 2-hop publish from 1 doesn’t reach 4

26

Putting it back together: AR V2

Reliable State Sync “bus”

Flooding with hop by hop time conversion

orient_bcast

Acoustic Signal

Chirp Notification

ar_send audiod

ar_recv

types/orienttypes/orient

Chirp Notification

mlatd types/ranges

types/orient types/ranges

ar/request_chirp

syncd

audio/[02]/sync_bin

Audio Data

Conversion

Conversion

ar/mlat

orient/pub_seq

Software ModuleOutput DeviceMessage/Device FileDistributed Module

orient/status

orientd

27

AR V1 Event Diagram

mlatd (CH) AR (CH) AR (CM)

If not enough data for cluster mlat, request ranging to specific missing cluster members

Send Range REQ to first CM in round robin order, check busy

AR (CM)

ACK with preferred start time

Bcast Range Start, specify code

Run correlation, report time offset from bcast to detection in data

Report new ranges and notify mlat when round robin thru CMs completes

Timestamp msg arrival. Sender delays before starting to ensure rough sync, and reports exact time offset from bcast to codec start

Acoustic Signal

CH waits for stragglers

But, there are many error cases:•REQ lost?•ACK lost?•Bcast lost to some receivers?•Bcast delayed in queue?•Bcast lost to sender?•CMs join two clusters; may be busy ranging in other cluster.

•Inaccurate codec sync start?•Interference in acoustic channel?•Reply from sender lost?•Reply from receiver(s) lost?•How long to wait for stragglers?•CH failure loses all ranges for cluster

Cluster Head(Coordinator)

Cluster Member(Sender)

Cluster Member(Receiver)

Big complexity increase to…:•Range across clusters•Coordinate adjacent clusters•Do regional mlats•Average multiple sync bcasts

Reliability challenges:The sender is the linchpin: an error in sender sync affects all ranges to receivers, and replies from receivers can’t be interpreted without the sender reply. If connectivity to sender is bad and the broadcast is lost, all receivers waste CPU on a useless correlation. Implementing reliable reporting is made more complex because retx’d receiver replies must be matched to a past sender.

28

AR V2 Event Diagram

mlatd ar_send ar_recvsyncd

If not enough data for mlat, request chirp and wait for a while

Chirp audio (audiod on remote node records it)

audiod

Continuous Sampling

Continuous Sync Maintenance

Flood Chirp notification message, with hop-by-hop conversion at flood layer

Retrieve samples from buffer and correlate in separate thread

Publish new range to N hop away neighbors

Try mlat again with new data in separate thread

Waiting for chirpnotification

Waiting for chirprequest

Waiting for enough data to compute mlat

What can go wrong here?•Collisions in acoustic channel.•Flooded message delayed beyond audio buffering (16 seconds).

•Flooded message dropped for lack of sync relations along route.

•Node restart causes ranges to/from that node to be dropped.

Key design points:•Encapsulate timing critical parts, no timing constraints on reliability.

•If a receiver can’t sync to sender it won’t attempt correlation.

Acoustic Signal

29

Key Observations

No coordination required Simplifying transport abstractions Continuous operation and service model

30

Key Observations

No coordination required– If mlatd doesn’t have enough data it triggers chirping to start

generating more data Exponential backoff on chirping with reset when data is lost. Simplicity of system lets designer focus on these details

– ar_send & ar_recv are slaves to request and notify messages. Transparently, ar_recv can receive overlapping triggers and buffer

the data for correlation Priority scheme decides the best order to process queued

correlations, based on past success/failure and RF hopcount

Simplifying transport abstractions Continuous operation and service model

31

Key Observations

No coordination required Simplifying transport abstractions

– Flooding takes care of delivering a local time– State Sync provides consistency for data input to mlatd

Efficiently supports a potentially large number of keys (~1000), enabling full regional mlat at each node (no “merging”)

Mlat takes 10-15min, sync is consistent on that timescale Failure of one node only loses range data for that node

Continuous operation and service model

32

Key Observations

No coordination required Simplifying transport abstractions Continuous operation and service model

– Eliminates many inconsistencies and corner cases– Reduces the number of states or modes

Simplifies interfaces to services

– Recovery from faults without coordination – just wait for stuff to start working again

– Service model supports multiple apps concurrently

33

The Catch

Of course, the catch is power consumption– Continuous operation can be wasteful– Modularity can be less efficient than cross-layer integration

Interesting questions:– How much is gained by fine-grained shutdown, plus the added

coordination overhead, relative to more coarse grained shutdown and periods of continuous operation?

For instance, the AR system could shut down after generating an initial map, and only wake up when something moves.

34

The End!

For more information on EmStar, seehttp://cvs.cens.ucla.edu/emstar/

35

Design Evolution

Initial design strategy: shortest path first– Modular decomposition according to best guess at time– Making a full-blown, generalized service is much more work than a

one-off feature – so tradeoff considered case by case– Problem: As more is learned these tradeoffs fit more poorly– Unmanageable complexity to address problems

Redesign: – Factor out common components– Plan for known scaling problems– Remaining modules are of manageable complexity, yet usually

achieve a more complete and correct implementation– More sophisticated inter-module dependencies

36

Radios

Each node has two radios TDMA, frequency hopping radios 63 hopping patterns

– Each radio can lock to one pattern– Patterns are independent “channels”– Bases on same pattern tend to be desynchronized

Base/Remote (star) topology– Base synchronizes TDMA cycle, remotes join

37

TDMA Slot Scheme

Each frame contains 1 transmit slot for the base and 1 transmit slot for each remote– Slot size implies MTU

Frame size is a constant– Base slot size is fixed: 70 byte MTU– Number of remotes inv. prop. to remote MTU

Practical MTU (40 bytes) 8 node clusters

S Base SlotBaseRemote 1Remote 2Remote 3

14ms

38

Packet Transfer

Broadcast capability– Base can use its slot to send a broadcast to all

remotes, or a unicast to a single remote– Remotes can send only unicasts to base

Link layer retransmission– MAC implements link layer ACKs for unicast

messages, and configurable retransmission

39

Breach Healing

In this application, “healing” is intended only to address breaches created by dying nodes, not preexisting breaches.Other algorithms might also be useful, e.g. density maintenance, but were not implemented here.

Documents

1 Design Lessons from an Unattended Ground Sensor System* Lewis Girod † CS 294-1 23 Sept 2003 † Center for Embedded Networked Sensing, [email protected]