Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Slide-1SC2002 Tutorial
MIT Lincoln Laboratory
High Performance DoD DSP Applications
Robert Bond
Embedded Digital Systems Group
MIT Lincoln Laboratory
23 August 2003
MIT Lincoln LaboratorySlide-2
SC2002 Tutorial
Outline
• DoD High-Performance DSP Applications• Middleware (with some streaming constructs)• Future Directions• Summary
MIT Lincoln LaboratorySlide-1
SC2002 Tutorial
Embedded Signal Processing Requirements
0.001
0.01
0.1
1
10
100
1990 1995 2000 2005 2010
YEAR
TF
LO
PS
Airborne Radar
ShipboardSurveillanceUAV
Missile Seeker
SBR
Small UnitOperationsSIGINT
REQUIREMENTS INCREASINGBY AN ORDER OF MAGNITUDEEVERY 5 YEARS
EMBEDDED PROCESSING REQUIREMENTS WILLREQUIRE TFOPS IN THE 2005-2010 TIME FRAME
EMBEDDED PROCESSING REQUIREMENTS WILLREQUIRE TFOPS IN THE 2005-2010 TIME FRAME
hpec_01-1KT 09-26-01
MIT Lincoln LaboratorySlide-3
SC2002 Tutorial
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000MOPS / Watt
GO
PS
/ cu
. ft.
Embedded Processing SpectrumApplications vs. Digital Technology
ProgrammableProcessors
Reconfigurable ComputingWith FPGAs
ASIC - Standard Cell(Conventional Packaging)
Full Custom VLSIMulti-Chip Modules
Programmable
Reconfigurable
Custom SpaceSeekerUAVAirborneShipboardSmall UnitOperationsSIGINT
• High performance Programmable Signal Processors (PSP) address a wide range of applications
• Efficient Stream Processing is the challenge!
• High performance Programmable Signal Processors (PSP) address a wide range of applications
• Efficient Stream Processing is the challenge!
MIT Lincoln LaboratorySlide-4
SC2002 Tutorial
SAR Algorithm
SAR Functional PipelineSAR Functional Pipeline
GMTI (STAP) Algorithm
GMTI Functional PipelineGMTI Functional Pipeline
SAR Data FlowSAR Data FlowGMTI Data FlowGMTI Data Flow
FilteringTask
FilteringTask
BeamformerTask
FilteringTask
(PC & DF)
DetectionTask
( STAP &CFAR )
SchedulerTask
RADAR Control + TimingPlatform State Information
DataOutput
Task
TargetDetectionsSorted byRange & Doppler
RawInputData
DataInputTask
SARImage(s)
MotionCompensation,
PulseCompression
PolarResampling
2D FFTDataInputTask
SchedulerTask
RADAR Control + TimingPlatform State Information
RawInputData
DataOutputTask
Corner Turn Corner Turn Corner TurnCorner TurnCorner Turn
Radar Stream Signal Processing Algorithms
Synthetic Aperture Radar (SAR): ~ 100-1000 GOPS ~ 1-100 s latencyGround Moving Target Indicator (GMTI) ~ 100-1000 GOPS ~ 0.1-1s latencySpace-Time Adaptive Processing (STAP) ~ 1-100 GOPS ~ 0.01- 0.1s latency
Synthetic Aperture Radar (SAR): ~ 100-1000 GOPS ~ 1-100 s latencyGround Moving Target Indicator (GMTI) ~ 100-1000 GOPS ~ 0.1-1s latencySpace-Time Adaptive Processing (STAP) ~ 1-100 GOPS ~ 0.01- 0.1s latency
MIT Lincoln LaboratorySlide-5
SC2002 Tutorial
Adaptive Nonuniformity
Correction
Multi-FrameProcessing
CFARDetection
TrackingFeature
Extraction
AimpointSelection
(PH)
DataFusion
TargetSelection
(Discrimination)
Guidance
Target update from surface based radar
Bulk Filter(PH)
DitherProcessing
Missile Seeker Stream Video Processing
256x256Focal Plane Array
Inte
gra
tor
A/D
BackgroundEstimation
lMU From other sensors
Stream Video Processing
Back-end Processing
.
.
.
SubpixelProcessing
@ 60 FPS
Processing Block Opcount (MFLOPS)
NUC 65 — 2000 Dither Subtraction 15 — 1800
Dempster-Shafer Plausibility Computation 4 — 100 Total 310 — 7300+
Latency ~ 1mS
MIT Lincoln LaboratorySlide-6
SC2002 Tutorial
Outline
• DoD High-Performance DSP Applications • Middleware (with some streaming constructs)• Some Key Features• Future Directions• Summary
MIT Lincoln LaboratorySlide-7
SC2002 Tutorial
Mappable Stream Processing
• Each compute stage can be mapped to different sets of hardware and timed
• Each compute stage can be mapped to different sets of hardware and timed
FilterXOUT = FIR(XIN)
DetectXOUT = |XIN|>c
BeamformXOUT = w *XIN
Mappableto different sets of hardware
Measurableresource usage of each mapping
Decomposableinto Tasks (computations)and Conduits (communications)
Task
Task
Task
Conduit
Conduit
MIT Lincoln LaboratorySlide-8
SC2002 Tutorial
Parallel Vector Library Layered Architecture
Map
������������ ���������������� ���������������� ���������������� ����stream
������������ ���������������� ���������������� ���������������� ����stream F(s)F(s)
TaskConduit
Grid
Distribution
Math PrimitivesMath Primitives Communication PrimitivesCommunication Primitives
Application
ParallelVectorLibrary
Hardware
Input Stream Processing Output
Library “ API”
Library “ ISA”
WorkstationEmbedded
Multi-computerPowerPC
ClusterEmbedded
Board
IntelCluster
MIT Lincoln LaboratorySlide-9
SC2002 Tutorial
Str
eam
Pro
cess
ing
& C
on
tro
lM
app
ing
Parallel Vector Library Components
Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR)
Function
DataStreams organized as matrices or vectors that span multiple processors
Stream Vector/Matrix
Task & Pipeline
Supports stream movement between tasks (interconnection in stream flow graph)
Conduit
Task & Pipeline
Supports algorithm task decomposition (the nodes in a signal flow diagram)Hierarchical
Task
Organizes processors into a 2D layout(virtual machine model) - Hierarchical
Grid
Data, Task & Pipeline
Specifies how Tasks, Streams, and functions are distributed over processors
Map
ParallelismDescriptionClass
MIT Lincoln LaboratorySlide-10
SC2002 Tutorial
Some Key Features
• The application stream flow is defined by tasks interconnected by conduits.
– Defines an application macro-architecture Presents opportunities for optimization at construction time
– Tasks can contain other tasks and conduits Hierarchy helps to handle complex applications
– Conduits can reorganize data and handle buffering Corner turns, multi-rates, different granularities
• Streams flow through conduits and are transformed in tasks by functions.
– Streams are resizable with fundamental epochs defined by vector or matrix dimensions. They can be produced and consumed piecemeal.
– Function implementations specialize based on stream arguments (and maps.)
• All objects are mappable and hence implicitly parallel.– Functionality is map invariant– Scalable and portable (virtual machines)
• Re-mapping gives support for fault recovery and for automated performance tuning.
MIT Lincoln LaboratorySlide-11
SC2002 Tutorial
Outline
• DoD High-Performance DSP Applications • Middleware (stream language prototyping)• Future Directions• Summary
MIT Lincoln LaboratorySlide-12
SC2002 Tutorial
Future Directions
• Standardization of parallel signal/image processing middleware:
– HPEC-Software Initiative (VSIPL++)
• Generative programming techniques– Efficient implementation of multi-expression program blocks– Breaking the “ abstraction barrier” without using streams
• Extensions of middleware to heterogeneous platforms– FPGAs, DSPs, Microprocessors
• Run-time optimal mapping– Managing latency, throughput, power,…
MIT Lincoln LaboratorySlide-13
SC2002 Tutorial
Generative Programming Approach: Expression Templates + Comma Operator
Solution: PETE combined with Comma Operator• Approach: Use comma operator to build expression list• PETE Expression list destructor triggers evaluation
when semicolon is reachedout = in > thresh,thresh &= 0.9 * thresh + 0.1 * in;
Problem: “Abstraction Boundary”• Related expressions: Vector/Matrix expressions sharing common
variablesOUT = IN > THRESHTHRESH = 0.9 * THRESH + 0.1 * IN
• Efficient implementation requires that data used in more than one expression remain in register/cache
• How to do this and maintain high-level API?– Hand-coded “for” loop works, but is too low level– Expression templates (PETE) can build a compilable expression for a
single statement, but cannot span multiple statements
Problem: “Abstraction Boundary”• Related expressions: Vector/Matrix expressions sharing common
variablesOUT = IN > THRESHTHRESH = 0.9 * THRESH + 0.1 * IN
• Efficient implementation requires that data used in more than one expression remain in register/cache
• How to do this and maintain high-level API?– Hand-coded “for” loop works, but is too low level– Expression templates (PETE) can build a compilable expression for a
single statement, but cannot span multiple statements
MIT Lincoln LaboratorySlide-14
SC2002 Tutorial
Optimal Mapping of Stream Algorithms
Input
XINXIN
Low Pass Filter
XINXIN
W1W1
FIR1FIR1 XOUTXOUT
W2W2
FIR2FIR2
Beamform
XINXIN
W3W3
multmult XOUTXOUT
Matched Filter
XINXIN
W4W4
FFTFFT
IFFTIFFT XOUTXOUT
3.2 31.5
1.4 15.7
1.0 10.4
0.7 8.2
16.1 31.4
9.8 18.0
6.5 13.7
3.3 11.5
52494642472721244429202460332315
1231-
571617-
28149.1-
181815-
14
8.38.73.32.67.38.39.48.0----
17141413
33 frames/sec(1.6 MHz BW)
66 frames/sec (3.2 MHz BW)
Example - Find:Min(latency | #CPU)Max(throughput | #CPU
1 PE
2 PE
3 PE
4 PEExample - Use:
Genetic AlgorithmDecision Directed Learning
MIT Lincoln LaboratorySlide-15
SC2002 Tutorial
Predicted and AchievedLatency and Throughput
• Graph Solution selects correct optimal mapping
• Good agreement between predicted and achieved latencies and throughputs
• Graph Solution selects correct optimal mapping
• Good agreement between predicted and achieved latencies and throughputs
#CPU
Lat
ency
(sec
on
ds)
Large (48x128K)Small (48x4K)
Th
rou
gh
pu
t(f
ram
es/s
ec)
4 5 6 7 8
25
20
15
100.25
0.20
0.15
0.10
1.5
1.0
0.5
predictedachieved
predictedachieved
predictedachieved
predictedachieved
5.0
4.0
3.0
2.04 5 6 7 8
Problem Size
#CPU
1-1-1-1
1-1-1-1
1-1-1-1
1-1-1-1
1-1-2-1
1-2-2-1
1-2-2-2 1-3-2-2
1-1-2-1
1-1-2-2
1-2-2-2
1-3-2-2
1-1-1-2
1-1-2-2
1-2-2-2
1-2-2-3
1-1-2-1
1-2-2-1
1-2-2-2
2-2-2-2
MIT Lincoln LaboratorySlide-16
SC2002 Tutorial
Summary
• High-performance stream processing for DoD applications requires 100’s GOPS in stringent form factors.
– Programmable solutions are parallel processors– Computational efficiency is paramount
• Middleware approaches emerging today have attributes of stream languages except:
– Lack rigorous formal definitions (and are incomplete)– Do not have good compiler support (efficiency issues)– BUT: they provide high productivity compared to traditional
approaches AND better lifecycle support
• MIT Lincoln Laboratory is in the business of evaluating new technologies for DoD signal and image processing applications
– Middleware, languages,…– FPGAs, VLSI, DSP, PCA, …
MIT Lincoln LaboratorySlide-17
SC2002 Tutorial
PETE Review
Vector Operator =
*it=forEach(expr, DereferenceLeaf(), OpCombine() );forEach (expr, IncrementLeaf(), NullCombine() );
*it= *bIt + *cIt * *dIt;bIt++; cIt++; dIt++;
it=begin ();for (int i=0; i<size(); i++){
it++;}
PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves- User defines action performed at internal nodes
Increment Leaf
Dereference LeafOpAdd +OpCombine
Action performed at
internal nodes
Action performed at
leaves
BinaryNode<OpAdd, float*, Binary Node <OpMul, float*, float*>>
+
B
C
*
A=
D
A=B+C*D
Step 1: Operators Form expression
User specifieswhat to storeat the leaves
Step 2: ‘=‘ operator evaluates expression
Translation at compile time- Template specialization- Inlining
MIT Lincoln LaboratorySlide-18
SC2002 Tutorial
PETE and the Comma Operator
Expression list destructor
forEach(expr, DereferenceLeaf(), OpCombine() );forEach(expr, IncrementLeaf(), NullCombine() );
*out = *in1 > *thresh1;*thresh2 = 0.9 * *thresh3 + 0.1 * *in2;out++; in1++; thresh1++; thresh2++; thresh3++; in2++;
if (isBaseNode) {
for (int i=0; i<size; i++){
} }
=
out
in
>
thresh
out = in > thresh,thresh = 0.9*thresh + 0.1*in;
Step 1: Operators form expression list
Step 2: Expression list destructor triggers expression evaluation
,
**
+thresh
thresh in
=
0.9 0.1
PETE ForEach changes:- ‘=‘ operator behavior with OpCombine: assign from RHS to LHS- ‘,’ operator behavior: first evaluate left expression, then evaluate right expression
Comma orders evaluation from left to right
‘=‘ operator + OpCombine
MIT Lincoln LaboratorySlide-19
SC2002 Tutorial
Scalable Approach
Single Processor Mapping
Multi Processor Mapping
A = B + C#include <Vector.h>#include <AddPvl.h>
void addVectors(aMap, bMap, cMap) {Vector< Complex<Float> > a(‘a’, aMap, LENGTH);Vector< Complex<Float> > b(‘b’, bMap, LENGTH);Vector< Complex<Float> > c(‘c’, cMap, LENGTH);
b = 1;c = 2;a=b+c;
}
A = B + C
• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact
• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact
MIT Lincoln LaboratorySlide-20
SC2002 Tutorial
Corner Turn Operation
All-to-all communication
All-to-all communication
Ch
ann
els
Time Samples
Half the data movesacross the bisectionof the machine
Original Data Matrix
Corner-turned
Data Matrix
Processor Ch
ann
els
Time Samples
BeamformFilter
MIT Lincoln LaboratorySlide-21
SC2002 Tutorial
Anatomy of a PVL Map
Vector/MatrixStream
Function Conduit Task
Map
Grid
{0,2,4,6,8,10}Distribution
• All PVL objects contain maps
• PVL Maps contain
•Grid•List of nodes•Distribution•Overlap
• All PVL objects contain maps
• PVL Maps contain
•Grid•List of nodes•Distribution•Overlap
List of Nodes
Overlap