High Performance DoD DSP Applications - MIT CSAILgroups.csail.mit.edu/cag/wss03/talks/wss03-bond.pdf · 2003-08-26 · MIT Lincoln Laboratory Slide-1 SC2002-Tutorial Embedded Signal

Slide-1SC2002 Tutorial

MIT Lincoln Laboratory

High Performance DoD DSP Applications

Robert Bond

Embedded Digital Systems Group

MIT Lincoln Laboratory

23 August 2003

MIT Lincoln LaboratorySlide-2

SC2002 Tutorial

Outline

• DoD High-Performance DSP Applications• Middleware (with some streaming constructs)• Future Directions• Summary


SC2002 Tutorial

Embedded Signal Processing Requirements

0.001

0.01

0.1

1

10

100

1990 1995 2000 2005 2010

YEAR

TF

LO

PS

Airborne Radar

ShipboardSurveillanceUAV

Missile Seeker

SBR

Small UnitOperationsSIGINT

REQUIREMENTS INCREASINGBY AN ORDER OF MAGNITUDEEVERY 5 YEARS

EMBEDDED PROCESSING REQUIREMENTS WILLREQUIRE TFOPS IN THE 2005-2010 TIME FRAME

EMBEDDED PROCESSING REQUIREMENTS WILLREQUIRE TFOPS IN THE 2005-2010 TIME FRAME

hpec_01-1KT 09-26-01


SC2002 Tutorial

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000MOPS / Watt

GO

PS

/ cu

. ft.

Embedded Processing SpectrumApplications vs. Digital Technology

ProgrammableProcessors

Reconfigurable ComputingWith FPGAs

ASIC - Standard Cell(Conventional Packaging)

Full Custom VLSIMulti-Chip Modules

Programmable

Reconfigurable

Custom SpaceSeekerUAVAirborneShipboardSmall UnitOperationsSIGINT

• High performance Programmable Signal Processors (PSP) address a wide range of applications

• Efficient Stream Processing is the challenge!

• High performance Programmable Signal Processors (PSP) address a wide range of applications

• Efficient Stream Processing is the challenge!


SC2002 Tutorial

SAR Algorithm

SAR Functional PipelineSAR Functional Pipeline

GMTI (STAP) Algorithm

GMTI Functional PipelineGMTI Functional Pipeline

SAR Data FlowSAR Data FlowGMTI Data FlowGMTI Data Flow

FilteringTask

FilteringTask

BeamformerTask

FilteringTask

(PC & DF)

DetectionTask

( STAP &CFAR )

SchedulerTask

RADAR Control + TimingPlatform State Information

DataOutput

Task

TargetDetectionsSorted byRange & Doppler

RawInputData

DataInputTask

SARImage(s)

MotionCompensation,

PulseCompression

PolarResampling

2D FFTDataInputTask

SchedulerTask

RADAR Control + TimingPlatform State Information

RawInputData

DataOutputTask

Corner Turn Corner Turn Corner TurnCorner TurnCorner Turn

Radar Stream Signal Processing Algorithms

Synthetic Aperture Radar (SAR): ~ 100-1000 GOPS ~ 1-100 s latencyGround Moving Target Indicator (GMTI) ~ 100-1000 GOPS ~ 0.1-1s latencySpace-Time Adaptive Processing (STAP) ~ 1-100 GOPS ~ 0.01- 0.1s latency

Synthetic Aperture Radar (SAR): ~ 100-1000 GOPS ~ 1-100 s latencyGround Moving Target Indicator (GMTI) ~ 100-1000 GOPS ~ 0.1-1s latencySpace-Time Adaptive Processing (STAP) ~ 1-100 GOPS ~ 0.01- 0.1s latency


SC2002 Tutorial

Adaptive Nonuniformity

Correction

Multi-FrameProcessing

CFARDetection

TrackingFeature

Extraction

AimpointSelection

(PH)

DataFusion

TargetSelection

(Discrimination)

Guidance

Target update from surface based radar

Bulk Filter(PH)

DitherProcessing

Missile Seeker Stream Video Processing

256x256Focal Plane Array

Inte

gra

tor

A/D

BackgroundEstimation

lMU From other sensors

Stream Video Processing

Back-end Processing

.

.

.

SubpixelProcessing

@ 60 FPS

Processing Block Opcount (MFLOPS)

NUC 65 — 2000 Dither Subtraction 15 — 1800

Dempster-Shafer Plausibility Computation 4 — 100 Total 310 — 7300+

Latency ~ 1mS


SC2002 Tutorial

Outline

• DoD High-Performance DSP Applications • Middleware (with some streaming constructs)• Some Key Features• Future Directions• Summary


SC2002 Tutorial

Mappable Stream Processing

• Each compute stage can be mapped to different sets of hardware and timed

• Each compute stage can be mapped to different sets of hardware and timed

FilterXOUT = FIR(XIN)

DetectXOUT = |XIN|>c

BeamformXOUT = w *XIN

Mappableto different sets of hardware

Measurableresource usage of each mapping

Decomposableinto Tasks (computations)and Conduits (communications)

Task

Task

Task

Conduit

Conduit


SC2002 Tutorial

Parallel Vector Library Layered Architecture

Map

�� stream

�� stream F(s)F(s)

TaskConduit

Grid

Distribution

Math PrimitivesMath Primitives Communication PrimitivesCommunication Primitives

Application

ParallelVectorLibrary

Hardware

Input Stream Processing Output

Library “ API”

Library “ ISA”

WorkstationEmbedded

Multi-computerPowerPC

ClusterEmbedded

Board

IntelCluster


SC2002 Tutorial

Str

eam

Pro

cess

ing

& C

on

tro

lM

app

ing

Parallel Vector Library Components

Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR)

Function

DataStreams organized as matrices or vectors that span multiple processors

Stream Vector/Matrix

Task & Pipeline

Supports stream movement between tasks (interconnection in stream flow graph)

Conduit

Task & Pipeline

Supports algorithm task decomposition (the nodes in a signal flow diagram)Hierarchical

Task

Organizes processors into a 2D layout(virtual machine model) - Hierarchical

Grid

Data, Task & Pipeline

Specifies how Tasks, Streams, and functions are distributed over processors

Map

ParallelismDescriptionClass


SC2002 Tutorial

Some Key Features

• The application stream flow is defined by tasks interconnected by conduits.

– Defines an application macro-architecture Presents opportunities for optimization at construction time

– Tasks can contain other tasks and conduits Hierarchy helps to handle complex applications

– Conduits can reorganize data and handle buffering Corner turns, multi-rates, different granularities

• Streams flow through conduits and are transformed in tasks by functions.

– Streams are resizable with fundamental epochs defined by vector or matrix dimensions. They can be produced and consumed piecemeal.

– Function implementations specialize based on stream arguments (and maps.)

• All objects are mappable and hence implicitly parallel.– Functionality is map invariant– Scalable and portable (virtual machines)

• Re-mapping gives support for fault recovery and for automated performance tuning.


SC2002 Tutorial

Outline

• DoD High-Performance DSP Applications • Middleware (stream language prototyping)• Future Directions• Summary


SC2002 Tutorial

Future Directions

• Standardization of parallel signal/image processing middleware:

– HPEC-Software Initiative (VSIPL++)

• Generative programming techniques– Efficient implementation of multi-expression program blocks– Breaking the “ abstraction barrier” without using streams

• Extensions of middleware to heterogeneous platforms– FPGAs, DSPs, Microprocessors

• Run-time optimal mapping– Managing latency, throughput, power,…


SC2002 Tutorial

Generative Programming Approach: Expression Templates + Comma Operator

Solution: PETE combined with Comma Operator• Approach: Use comma operator to build expression list• PETE Expression list destructor triggers evaluation

when semicolon is reachedout = in > thresh,thresh &= 0.9 * thresh + 0.1 * in;

Problem: “Abstraction Boundary”• Related expressions: Vector/Matrix expressions sharing common

variablesOUT = IN > THRESHTHRESH = 0.9 * THRESH + 0.1 * IN

• Efficient implementation requires that data used in more than one expression remain in register/cache

• How to do this and maintain high-level API?– Hand-coded “for” loop works, but is too low level– Expression templates (PETE) can build a compilable expression for a

single statement, but cannot span multiple statements

Problem: “Abstraction Boundary”• Related expressions: Vector/Matrix expressions sharing common

variablesOUT = IN > THRESHTHRESH = 0.9 * THRESH + 0.1 * IN

• Efficient implementation requires that data used in more than one expression remain in register/cache

• How to do this and maintain high-level API?– Hand-coded “for” loop works, but is too low level– Expression templates (PETE) can build a compilable expression for a

single statement, but cannot span multiple statements


SC2002 Tutorial

Optimal Mapping of Stream Algorithms

Input

XINXIN

Low Pass Filter

XINXIN

W1W1

FIR1FIR1 XOUTXOUT

W2W2

FIR2FIR2

Beamform

XINXIN

W3W3

multmult XOUTXOUT

Matched Filter

XINXIN

W4W4

FFTFFT

IFFTIFFT XOUTXOUT

3.2 31.5

1.4 15.7

1.0 10.4

0.7 8.2

16.1 31.4

9.8 18.0

6.5 13.7

3.3 11.5

52494642472721244429202460332315

1231-

571617-

28149.1-

181815-

14

8.38.73.32.67.38.39.48.0----

17141413

33 frames/sec(1.6 MHz BW)

66 frames/sec (3.2 MHz BW)

Example - Find:Min(latency | #CPU)Max(throughput | #CPU

1 PE

2 PE

3 PE

4 PEExample - Use:

Genetic AlgorithmDecision Directed Learning


SC2002 Tutorial

Predicted and AchievedLatency and Throughput

• Graph Solution selects correct optimal mapping

• Good agreement between predicted and achieved latencies and throughputs

• Graph Solution selects correct optimal mapping

• Good agreement between predicted and achieved latencies and throughputs

#CPU

Lat

ency

(sec

on

ds)

Large (48x128K)Small (48x4K)

Th

rou

gh

pu

t(f

ram

es/s

ec)

4 5 6 7 8

25

20

15

100.25

0.20

0.15

0.10

1.5

1.0

0.5

predictedachieved

predictedachieved

predictedachieved

predictedachieved

5.0

4.0

3.0

2.04 5 6 7 8

Problem Size

#CPU

1-1-1-1

1-1-1-1

1-1-1-1

1-1-1-1

1-1-2-1

1-2-2-1

1-2-2-2 1-3-2-2

1-1-2-1

1-1-2-2

1-2-2-2

1-3-2-2

1-1-1-2

1-1-2-2

1-2-2-2

1-2-2-3

1-1-2-1

1-2-2-1

1-2-2-2

2-2-2-2


SC2002 Tutorial

Summary

• High-performance stream processing for DoD applications requires 100’s GOPS in stringent form factors.

– Programmable solutions are parallel processors– Computational efficiency is paramount

• Middleware approaches emerging today have attributes of stream languages except:

– Lack rigorous formal definitions (and are incomplete)– Do not have good compiler support (efficiency issues)– BUT: they provide high productivity compared to traditional

approaches AND better lifecycle support

• MIT Lincoln Laboratory is in the business of evaluating new technologies for DoD signal and image processing applications

– Middleware, languages,…– FPGAs, VLSI, DSP, PCA, …


SC2002 Tutorial

PETE Review

Vector Operator =

*it=forEach(expr, DereferenceLeaf(), OpCombine() );forEach (expr, IncrementLeaf(), NullCombine() );

*it= *bIt + *cIt * *dIt;bIt++; cIt++; dIt++;

it=begin ();for (int i=0; i<size(); i++){

it++;}

PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves- User defines action performed at internal nodes

Increment Leaf

Dereference LeafOpAdd +OpCombine

Action performed at

internal nodes

Action performed at

leaves

BinaryNode<OpAdd, float*, Binary Node <OpMul, float*, float*>>

+

B

C

*

A=

D

A=B+C*D

Step 1: Operators Form expression

User specifieswhat to storeat the leaves

Step 2: ‘=‘ operator evaluates expression

Translation at compile time- Template specialization- Inlining


SC2002 Tutorial

PETE and the Comma Operator

Expression list destructor

forEach(expr, DereferenceLeaf(), OpCombine() );forEach(expr, IncrementLeaf(), NullCombine() );

*out = *in1 > *thresh1;*thresh2 = 0.9 * *thresh3 + 0.1 * *in2;out++; in1++; thresh1++; thresh2++; thresh3++; in2++;

if (isBaseNode) {

for (int i=0; i<size; i++){

} }

=

out

in

>

thresh

out = in > thresh,thresh = 0.9*thresh + 0.1*in;

Step 1: Operators form expression list

Step 2: Expression list destructor triggers expression evaluation

,

**

+thresh

thresh in

=

0.9 0.1

PETE ForEach changes:- ‘=‘ operator behavior with OpCombine: assign from RHS to LHS- ‘,’ operator behavior: first evaluate left expression, then evaluate right expression

Comma orders evaluation from left to right

‘=‘ operator + OpCombine


SC2002 Tutorial

Scalable Approach

Single Processor Mapping

Multi Processor Mapping

A = B + C#include <Vector.h>#include <AddPvl.h>

void addVectors(aMap, bMap, cMap) {Vector< Complex<Float> > a(‘a’, aMap, LENGTH);Vector< Complex<Float> > b(‘b’, bMap, LENGTH);Vector< Complex<Float> > c(‘c’, cMap, LENGTH);

b = 1;c = 2;a=b+c;

}

A = B + C

• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact

• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact


SC2002 Tutorial

Corner Turn Operation

All-to-all communication

All-to-all communication

Ch

ann

els

Time Samples

Half the data movesacross the bisectionof the machine

Original Data Matrix

Corner-turned

Data Matrix

Processor Ch

ann

els

Time Samples

BeamformFilter


SC2002 Tutorial

Anatomy of a PVL Map

Vector/MatrixStream

Function Conduit Task

Map

Grid

{0,2,4,6,8,10}Distribution

• All PVL objects contain maps

• PVL Maps contain

•Grid•List of nodes•Distribution•Overlap

• All PVL objects contain maps

• PVL Maps contain

•Grid•List of nodes•Distribution•Overlap

List of Nodes

Overlap

Documents

High Performance DoD DSP Applications - MIT CSAILgroups.csail.mit.edu/cag/wss03/talks/wss03-bond.pdf · 2003-08-26 · MIT Lincoln Laboratory Slide-1 SC2002-Tutorial Embedded Signal