Space-Time Adaptive Processing on the Mesh … Adaptive Processing on the Mesh Synchronous ... target detection in the presence of environmental clutter and hostile jamming. AEW radar

• MCMAHONSpace-Time Adaptive Processing on the Mesh Synchronous Processor

VOLUME 9, NUMBER 2, 1996 THE LINCOLN LABORATORY JOURNAL 131

Space-Time Adaptive Processingon the Mesh SynchronousProcessorJanice Onanian McMahon

■ This article examines the suitability of the Mesh synchronous processor(MeshSP) architecture for a class of radar signal processing algorithms known asspace-time adaptive processing (STAP), which is an important butcomputationally demanding technique for mitigating clutter and jamming asseen by an airborne radar. The high computational requirements of STAPalgorithms, combined with the need for programming flexibility, havemotivated Lincoln Laboratory to investigate the application of commerciallyavailable massively parallel processors to the STAP problem. These processorsmust be sufficiently flexible to accommodate different STAP architectures andalgorithms, and must be scalable over a wide parameter space to support therequirements of different radar systems.

The MeshSP offers high peak-signal-processing performance in a small formfactor, which makes it attractive for airborne radar environments. An algorithmimplementation must be reasonably efficient to take advantage of this perform-ance. The implementation efficiency depends on four factors: the computationaldetails of the algorithm, the chosen decomposition of the algorithm intoconstituent parts capable of parallel execution, the subsequent mapping of theseparallel components onto the processor, and the underlying suitability of thearchitecture of massively parallel processors. This article describes performancemodels and methods of estimating MeshSP efficiency on representative STAPalgorithms in order to assess the potential use of MeshSP in airborne STAPapplications.

T airborne surveillance ra-dars can be severely degraded by environmen-tal clutter, especially in overland or coastal re-

gions, and by hostile jamming sources. Faced with apotentially formidable interference environment, ra-dar engineers have developed advanced signal pro-cessing techniques to improve radar performance.One such technique, known as space-time adaptiveprocessing, or STAP, is based on the idea of designinga two-dimensional (space and time) filter that maxi-mizes the output signal-to-interference-plus-noise ra-tio, thereby selectively nulling clutter and jamming

returns while at the same time retaining the target sig-nal. The filter is formed by simultaneously combiningthe signals received on multiple elements of an an-tenna array and multiple pulse-repetition intervals ofa coherent processing interval. The antenna elementsprovide the filter’s spatial dimension and the pulse-repetition intervals provide the temporal dimension.This combination can also be seen as a beamformingoperation, in which a beam is formed with chosenspatial and frequency directions matched to the targetsignal, and possessing nulls in the directions and fre-quencies of adaptively sensed interferers. The beam-


132 THE LINCOLN LABORATORY JOURNAL VOLUME 9, NUMBER 2, 1996

forming weights are computed from radar returnscontaining information on the spatial and temporalcharacteristics of the interference.

STAP offers the potential to improve airborne ra-dar performance in several ways. It can increase detec-tion of low-velocity targets by better suppression ofmainlobe clutter, improve detection of small cross-section targets that might otherwise be obscured bysidelobe clutter, and improve detection in combinedclutter and jamming environments. Moreover, STAPis inherently robust to radar system errors, and pro-vides a means to handle nonstationary interference.

The benefits of STAP, however, come with a highcomputational cost. A STAP signal processor com-putes a set of adaptive filter weights by solving in realtime a system of linear equations of size NM, whereN and M are the number of spatial and temporal de-grees of freedom, respectively, in the filter. The solu-tion for each set of weights requires on the order of(NM)3 operations. For the fully adaptive approach,in which a separate set of weights is applied to all theNc antenna elements and Nd pulse-repetition intervalsper coherent processing interval, values of Nc and Ndin the range of ten to two hundred are typical. As theairborne radar scans a volume of space and searchesfor targets throughout a range of possible target ve-locities, coherent processing intervals from fifty milli-seconds to a full second in duration are common.Given all these multiplicative factors, a STAP proces-sor must be able to reformulate and solve the adaptiveweight computation problem an enormous numberof times. The net result of such high demand on theprocessor can be a requirement for computationalthroughputs on the order of hundreds of billions offloating-point operations per second, with executionspeeds of fractions of a second.

Although the fundamentals of STAP were first pio-neered by L.E. Brennan and I.S. Reed in the early1970s [1], STAP has become practical as a real-timetechnique only with the recent advent of high-perfor-mance digital signal processors. Even so, fully adap-tive STAP is still beyond the reach of state-of-the-artprocessor technology. Consequently, much of thecurrent research work on STAP has focused on thedevelopment of algorithms that decompose the fullyadaptive problem into reduced-dimension adaptive

problems capable of being implemented in real timeon reasonably sized processors.

A radar processor operating in real time must reli-ably produce the correct output within a prescribedtime limit, or latency. Consequently, sustained perfor-mance on key STAP kernel computations and com-munication patterns is a primary consideration in as-sessing processor suitability. We must also be able toinstall and operate a STAP processor on an aircraft,where the availability of space and other resources islimited. Thus processing performance per unit size,power, and weight are important metrics. In addition,algorithm development for STAP systems is an activearea of research. To accommodate new algorithmsand modes of operations, STAP processors must beprogrammable. Furthermore, STAP is applicable to awide range of radars, so scalability of processor per-formance in terms of both machine size and problemsize is an important consideration.

The substantial computational requirements ofSTAP, coupled with the need for future growth andscalability, have led Lincoln Laboratory to investigatethe suitability of massively parallel processors (MPP)for this application. An MPP is a computer systemwith a large number of processing nodes (hundreds tothousands) interconnected by a dedicated communi-cation network. MPPs are based on high-perfor-mance, low-cost commodity microprocessors; largecommodity memory chips; and high-bandwidth,low-latency networks. They have supplanted vectorsupercomputers as the best high-performance ma-chines commercially available.

Lincoln Laboratory has conducted a performanceanalysis of a representative STAP processing kernel ona commercially available programmable parallel ar-chitecture called the Mesh synchronous processor, orMeshSP. The MeshSP provides state-of-the-art per-formance, compact size, reduced weight, and lowpower consumption, which makes it a potentiallysuitable solution to our computational problem. It iscapable of 7.7-billion floating-point operations persecond (Gflops/sec) computational throughput on aboard 7 × 13 inches in size, while consuming only ahundred watts of power. These excellent attributeshave motivated us to look closely at the MeshSP todetermine its potential as a real-time STAP processor.



The approach we applied in the performance analysiswas to develop a parameterized model for estimatinglatencies of computation and communication interms of problem complexity and machine size. If weapply the model to various radar systems and proces-sor configurations, we can determine a configurationthat meets real-time requirements, and also analyzethe scalability of the configuration for future changesin algorithm development [2].

In this article we review the fundamentals of STAPand describe a representative algorithm—the higher-order post-Doppler (HOPD) STAP algorithm and itskey computational kernels. We follow with an over-view of MPPs and a description of the MeshSP archi-tecture. Then we present mappings onto the MeshSPof the HOPD algorithm for two representative ra-dars, and subsequently analyze the performance andscalability of the MeshSP. We discuss the results andshow that the MeshSP has strong potential as an air-borne STAP processor. We conclude by indicating fu-ture directions for STAP research.

Space-Time Adaptive Processing

Figure 1 illustrates the elements of the airborne early-warning (AEW) radar environment. The AEW sur-veillance radar must detect increasingly smaller tar-gets in a background of heavy interference. The twomain sources of interference are environmental clut-ter and hostile jamming. The motion of the airborneradar platform spreads the clutter in Doppler fre-quency; clutter from a specific point on the groundhas a Doppler frequency that depends on the angle ofthe clutter position relative to the heading of the plat-form. Noise-like jamming (or barrage noise jamming)from discrete sources (both in the air and on theground) is localized in angle but spread over all Dop-pler frequencies. Detecting a target by enhancing ra-dar performance in this environment of interferenceis the ultimate problem we must solve.

The goal of STAP is straightforward—suppress theinterference and detect the target [3]. Figure 2 illus-trates an example of the signal-to-noise ratio (SNR)

FIGURE 1. The airborne early-warning (AEW) radar environment. The airborne radar must be able to provide long-rangetarget detection in the presence of environmental clutter and hostile jamming.

AEW radar Jamming

Target

Sea clutter

Jamming

Land clutter



that results from clutter and a single jamming signal,as a function of angle and Doppler frequency. Clutterfrom all angles lies on the clutter ridge shown in thefigure, whereas the jamming signal from one angleappears in all Doppler frequencies. To suppress theinterference and detect the target, the AEW surveil-lance radar must have high gain at the target angleand Doppler frequency, as well as deep nulls along theclutter and jamming lines.

A space-time adaptive processor combines receivebeamforming in the spatial dimension and Dopplerfiltering in the temporal dimension to achieve thespecific filter response shown in Figure 3 for a par-ticular target angle and Doppler frequency. TheSTAP processor applies many of these filters, eachcovering a different target angle and velocity, to detecttargets within the range of interest.

The process of electronically steering the radar re-ceiver in different directions is called phased-arraybeamforming. Beamforming algorithms involve the

FIGURE 2. The AEW interference scenario. The problem is to detect the target by enhancingradar performance in this environment of interference. This plot shows the signal-to-noiseratio (SNR) resulting from clutter and a single jamming signal, as a function of angle and Dop-pler frequency. The plot also shows the view of the clutter characteristic from the perspectiveof azimuth for a given Doppler frequency, and the view of the clutter from the perspective ofDoppler frequency for a given azimuth. These views indicate that the problem is two-dimen-sional in nature because filtering must be performed in each dimension.

application of weights to samples in a signal process-ing system. Weight application is computed as a dotproduct between weight vectors and sample vectors,where the vectors span the radar channels (thesechannels are either independent antenna receivers orelements within a single large antenna). In a conven-tional nonadaptive beamforming algorithm theweights are a fixed function of the look direction. Inan adaptive beamforming algorithm the weights arecomputed from the input training data and the beamsteering vectors. Figure 4 shows the STAP processingtypically performed in one radar coherent processinginterval, which consists of L range gates, M pulse-rep-etition intervals, and N antenna elements.

The optimal adaptive weight vector w for a givensteering vector s is related to the interference covari-ance matrix R through the relationship Rw = s. Thecovariance matrix R, which is unknown a priori, isdefined as R = E {xxH}, where x is the space-timesample vector. We therefore perform the adaptation

–1

0

1 –0.5

0

0.50

20

40

Sin (Azimuth) Normaliz

ed Doppler

Jamming

Target

Clutter

SN

R (

dB

)



FIGURE 3. Space-time adaptive filter response. This filter combines beamforming in thespatial dimension and Doppler filtering in the temporal dimension to achieve a response fora particular target angle and Doppler frequency. These data show deep nulls along the clut-ter ridge and jamming line shown in Figure 2, with a high response at the target location.

FIGURE 4. The space-time adaptive processing (STAP) typically performed in one radarcoherent processing interval, which consists of L range gates, M pulse-repetition intervals(PRI), and N antenna elements. A subset of samples is chosen from the input data cubeaccording to a training strategy. The resulting training set is used to compute the weightsthat are applied to the entire data cube. The final detection phase identifies the location ofthe targets in the data cube.

Weightapplication

z = wHy

Space-time adaptive processor

Weightcomputation

w

Trainingstrategy

Trainingset

1

N

L

1 MPRI

Thresholddetector

Targetdecision

An

ten

na

elem

ent Range g

ate

y z

Rel

ativ

e p

ow

er (

dB

)

–80

–70

–60

–50

–40

–30

–20

–10

0

Sin (Azimuth)

No

rmal

ized

Do

pp

ler

–1 0 1–0.5

–0.4

–0.3

–0.2

–0.1

0

0.1

0.2

0.3

0.4

0.5



by forming an estimate of R from a set of radar datacalled the training set. Specifically, the covariance ma-trix R is estimated as R = XHX, where the training-set matrix X is a subset of the input data. The processof computing the adaptive weight vector w from theestimated covariance matrix R is called sample matrixinversion. These methods include direct matrix inver-sion after explicitly forming R , as well as factoriza-tion approaches that compute the Cholesky decom-position of R via QR decomposition of X [4]. Forreasons of numerical stability as well as computa-tional complexity, the factorization approach is themore viable.

Specifically, we compute X = QA, where A is uppertriangular. Since Q is an orthogonal, unitary matrix,QHQ = I, where I is the identity matrix. Therefore,R = XHX is equivalent to R = AHQHQA = AHA.Since A is triangular, w is easily computed from thetwo backsolve operations, AHu = s and Aw = u, wheres is the target response corresponding to a particularhypothesized angle and Doppler frequency, and u isan intermediate computation vector. A different s isused for each member of the STAP filter bank. Thebeamforming operation is a matrix-vector multiplica-tion, z = wHy, where y is the input data for a specificrange gate and z is the output data in that beam.

In fully adaptive STAP algorithms, in which aseparate adaptive weight is applied to all pulse-repeti-tion intervals as well as all channels, the covariancematrix R has dimension NcNd × NcNd , where Nc isthe number of array elements and Nd is the numberof pulse-repetition intervals per coherent processinginterval. For typical radar systems, the product NcNdcan vary from several hundreds to tens of thousands.For a variety of reasons, not the least of which is thelarge amount of computational power required forfully adaptive STAP, partially adaptive STAP is amore attractive candidate for implementation in anactual processing system.

In a partially adaptive STAP algorithm, the pro-hibitively large problem of a fully adaptive STAP al-gorithm is broken down into a number of indepen-dent, smaller, and more computationally tractableadaptive problems while achieving near-optimumperformance. Figure 5 shows the generic partiallyadaptive STAP architecture. The first step in the par-

FIGURE 5. The generic partially adaptive STAP architec-ture. By transforming the original data cube intosubarrays, we break down the prohibitively large prob-lem of a fully adaptive STAP algorithm into a number ofsmaller and more computationally tractable partiallyadaptive problems.

T T T T T T T T T

Transformeddata cube

Reduced-dimension space-time adaptive processor

Selection logic to create subarrays of data

...

... ... ...

Nonadaptive transformation for each range gate

M PRIsN Elements

Doppler bin or PRI

Bea

m o

r ch

ann

el

1

N

L

1 MPRI

An

ten

na

elem

ent Range g

ate

Range gate

Originaldata cube



tially adaptive algorithm is nonadaptive filtering ofthe input signal data to reduce the dimensionality ofthe problem. The weight vector is computed by ap-plying sample matrix inversion to the final reduced-dimension data. Once the input data are trans-formed, and bins and beams (or channels andpulse-repetition intervals) are selected to span the tar-get and interference subspaces, multiple separateadaptive sample-matrix-inversion problems aresolved, one for each Doppler frequency bin or pulse-repetition interval, across either antenna elements orbeams, depending on the domain of the adaptation.Therefore, there is a natural inherent parallelism inpartially adaptive STAP algorithms, which forms thefirst step in parallelizing the STAP problem.

The initial nonadaptive filtering can be either atransformation into the frequency domain (for ex-ample, by performing a fast Fourier transform, orFFT, over pulses in each channel) or a transformationinto beam space (for example, by performing non-adaptive beamforming in each pulse). We can per-form both spatial and temporal transformations, ifdesired, or we can eliminate nonadaptive filtering al-together. The nonadaptive filtering determines thedomain (frequency or time, element or beam) inwhich adaptive weight computation occurs, andthereby serves as a convenient means of classifyingSTAP algorithms, as shown in Figure 6. Each quad-rant in this figure shows a box representing the datadomain for a single range gate after a different type ofnonadaptive transform. For example, the upper rightquadrant represents pulse-repetition-interval datathat have been transformed into Doppler space. Thuseach sample is a radar return for a specific Dopplerfrequency and receiver element. The lower-left quad-rant represents element data that have been trans-formed into beam space. Thus each sample is a radarreturn for a specific pulse-repetition interval and lookdirection. Each quadrant has a variety of particular re-duced-dimension algorithms based on filtering andsubspace selection.

The STAP kernel used in this analysis is adaptivein the frequency domain, and therefore falls into theelement-space post-Doppler quadrant of the tax-onomy. A Doppler filter-bank FFT is applied to sig-nals from each element. Low-sidelobe Doppler filter-

ing effectively localizes competing clutter in angle, re-ducing the extent of clutter that must be adaptivelycanceled. The adaptivity occurs over all elements anda number of Doppler bins. The number of Dopplerbins is a parameter of the element-space post-Doppleralgorithm. When the number of Doppler bins is twoor greater, the algorithm is in the class of HOPD vari-ants of STAP. Factored post-Doppler algorithms per-form spatial adaptation in a single bin and are there-fore, in the strict sense, not adaptive in the temporaldimension.

Figure 7 illustrates the STAP algorithm functionalflow. The input to the kernel is a three-dimensionaldata cube consisting of sample data (range gates)from some number of radar channels, for some num-ber of Doppler bins. The range gates are divided intorange segments. A sample matrix, or training set, isformed for each Doppler bin and range segment byselecting a subset of the range gates from all channelsfor that Doppler bin and the two adjacent Dopplerbins. The two-dimensional data from the three Dop-pler bins are concatenated to form the two-dimen-sional matrix used for sample matrix inversion. Thealgorithm to upper-triangularize the sample matrix isbased on Householder transformations [5]. Figure 8

FIGURE 6. The four types of STAP algorithm transforma-tions. Each quadrant represents a different domain inwhich STAP can occur. Transitions from one domain toanother require the specific discrete Fourier transforms(DFT) shown in blue.

PRI

PRI

Doppler bin

Time domain Frequency domain

Doppler bin

Ele

men

t sp

ace

Bea

m s

pac

e

SpatialDFT

SpatialDFT

TemporalDFT

Elementspace

pre-Doppler

Elementspace

post-Doppler

Beam spacepre-Doppler

Beam spacepost-Doppler

TemporalDFT

2D-DFT



illustrates the partitioning of the input data cube andthe transformation of the input data into a two-di-mensional sample matrix.

Performance of the adaptive beamforming algo-rithm varies with the size of each training-set samplematrix. Performance is particularly sensitive to thenumber of adaptive weights (i.e., the number of col-umns in the sample matrix). This parameter is calledthe degrees of freedom. Performance is also sensitive to

the size of the training set (i.e., the number of samplesselected for training). Training-set size is quantifiedvia the sample ratio, which is the ratio of columns torows in the sample matrix. Typically, to achieve de-cent algorithm performance, the sample ratio must betwo or greater to provide adequate target detectionand to null clutter and jamming. These parameters, aswell as radar-specific parameters, are used to definethe scalability of the algorithm mapping.

FIGURE 8. Transformation of a selected subset of data from the input data cube into a 2D training-set matrixused for sample matrix inversion. The 2D training-set matrix for each Doppler bin and range segment is formedby concatenating all the channels from adjacent Doppler bins and selecting a subset of the range gates.

FIGURE 7. Functional flow for the higher-order post-Doppler (HOPD) algorithm. The fast Fouriertransform (FFT) is performed over channels and PRIs in each range gate to set frequency-domaindata for each Doppler bin. The adaptive beamforming in each Doppler bin requires post-FFT datafrom the two adjacent Doppler bins for all range gates.

......

... ...

T...

... ...

... ...

FFT

Bin 1adaptive

beamforming

Bin Madaptive

beamforming

M PRIs M PRIs M PRIsN channels

M Doppler binsper channel

FFT FFT

N samplesBin m–1

N samplesBin m

N samplesBin m+1

T T T T T T T T

... ... ... ...

... ...... ... ... ... ... ...

Bin m–1adaptive

beamforming

Bin madaptive

beamforming

Bin m+1adaptive

beamforming

Samples from all N channels

Ch

ann

els

Doppler bins

Range gates

(Divided into range segments)

N channels

Bin m–1Bin m

Bin m+1

Ran

ge

gat

es

Subset of data in range segment



MeshSP Architecture

The MeshSP is a single-instruction multiple-data(SIMD) architecture originally developed at LincolnLaboratory and now marketed by Integrated Com-puting Engines, Inc., of Waltham, Massachusetts. It iscomprised of an array of processors connected via atwo-dimensional or three-dimensional nearest-neigh-bor mesh [6]. Its current realization incorporates asingle monolithic processor element, the Analog De-vices ADSP-21060 SHARC integrated circuit, whichwas developed jointly by Analog Devices and LincolnLaboratory. The 21060 is based on a 21000 core butadds communication and memory to form a mono-lithic processor element that offers significant advan-tages in processing density per unit of size, weight,and power, if available data memory is sufficient.Each processor element permits 120 Mflops/sec peakperformance and has 512 KB of internal memory(SRAM), six interprocessor communication ports,each capable of 40 MB/sec peak throughput, and two

input/output ports, each capable of 5 MB/sec peakthroughput. The 21060 does not have a cache; theon-chip internal memory contains the data for criticalcomputations and is managed by the application pro-grammer. This solution provides maximum perfor-mance in a deterministic manner. The 21060 is notan instrinsically SIMD processor in the strict sense,since it contains its own instruction-execution en-gine, and can be utilized in multiple-instruction mul-tiple-data (MIMD) parallel processors as well.

The MeshSP includes an array of processor ele-ments, a master processor, an input/output module,and a host computer. The program is executed by themaster processor and the instructions are broadcast tothe processor-element array, or slaves, for parallel ex-ecution. The hardware can support limited MIMDoperation when code segments are loaded into theprocessor-element on-chip internal memory. Figure 9illustrates the MeshSP architecture.

The MeshSP processing array has two-dimensionaltoroidal connectivity via its interprocessor communi-

FIGURE 9. MeshSP architecture. Most of the processing power is in the array of processor elements, or slaves, which isa two-dimensional mesh of SHARC processors with four-way nearest-neighbor toroidal communication links. Eachprocessor executes the same program, which is downloaded by the master SHARC processor from program memory.The host processor is a personal computer or workstation that controls the MeshSP and provides external communica-tion for operators. Data input/output (I/O) for the array of slaves is achieved by the serial I/O module.

Array of slavesBuffer

Reset

Shift

Data

Address

Data

Address

Slave ready

Instruction stream

Control

Transceiver Host(PC or workstation)

Host busMaster bus

Programmemory

Interrupt

Serial linksto and from each column

Inter-slavecommunication

Serial I/O module

Toroidal connectivity

Masterprocessor



cation links. Each processor has six input ports andsix output ports. In the current configuration, fourinput ports and four output ports are devoted to four-way nearest-neighbor interconnections. The extratwo ports are used for fault tolerance by connectingthem to the next nearest neighbor to the right andleft. Interprocessor communication in the MeshSP isasynchronous to computation and can occur in paral-lel as long as no data conflicts occur.

Latency of a data transfer between processors isproportional to the sum of the vertical and horizontaldistances between them in processor coordinates. Inmapping an algorithm, or equivalently, a dataset,onto the processor array, data locality is of primaryimportance. Locality in the context of massively par-allel processing means that data used in a computa-tion are mapped to the same or physically close pro-cessors in order to reduce the cost of interprocessorcommunication.

Figure 10 shows a MeshSP processing board,which is 7 × 13 inches in size and dissipates approxi-mately one hundred watts of power. The board con-tains sixty-four SHARC chips (thirty-two per side)and is capable of 7.7-Gflops/sec peak performance.The MeshSP architecture is scalable and can thereforereside on a multiple number of processing boards.The high processing density per unit of size, power,and weight of the MeshSP architecture is highly desir-able in an airborne platform, provided acceptable lev-els of efficiency can be sustained. This high density ismore readily achievable in SIMD processors than inMIMD processors because only one global programmemory is needed, from which instructions arebroadcast to each processor.

Algorithm Mapping

The first step in algorithm design is to map the prob-lem topologically onto the processor architecture.The mapping problem involves assigning data andcomputations in the algorithm to one or more spe-cific processors in the processor array. Computationsassigned to multiple nodes in a parallel processor ben-efit from increased computational bandwidth. How-ever, a price is often paid in the lower efficiencies thatresult from necessary communication between nodes.The exact trade-off between computational power

FIGURE 10. MeshSP processing board. Each processorelement is an Analog Devices SHARC integrated circuitdeveloped jointly by Analog Devices and Lincoln Labo-ratory. Each processing board is two sided, and containsthirty-two SHARC processor elements per side; thisboard is capable of 7.7-Gflops/sec peak performancewhile dissipating only one hundred watts of power.

and parallel efficiency depends on the balance ofcomputation and communication that results from aparticular mapping on a particular architecture. Theend-to-end latency, or execution time, of the algo-rithm must meet real-time requirements of the radar;this fact establishes a minimum number of nodes onwhich the algorithm must be mapped. To maximizeefficiency, the assignment of data to processors musttake into account the advantages of data locality sothat overall communication is minimized. In addi-tion, assignment of computations to processors mustinsure that there is adequate load balancing; i.e., thecomputational load is evenly distributed between in-dividual processors so that overall processor utiliza-tion is maximized.

In addition to algorithm execution time, threemetrics are used in benchmarking parallel-processorperformance in this study. These metrics—executionspeedup, parallel efficiency, and ratio of communica-tion-to-computation latency—each quantify the per-formance of a parallel algorithm. The metrics are usedto evaluate the suitability of the MeshSP architecturefor the HOPD algorithm, and are described in thesection entitled “Analytical Results.”

Because each Doppler bin and range segment has



its own sample-matrix-inversion subproblem, the glo-bal mapping approach for the HOPD algorithm is tomap each sample-matrix-inversion subproblem to asubarray of the full MeshSP processor-element array,where the size of the subarray is Px by Py processors.Figure 11 illustrates the global mapping approach.Because each subproblem after the initial data sharingis independent, scalability and performance of theHOPD algorithm as a whole are equivalent to thoseof an individual subproblem. For the remainder ofthis article we examine only subproblem perfor-mance; the number of subproblems affects only thetotal number of processors required.

Two mappings of training-set matrices to subarraysof the full MeshSP processor-element array weresimulated: block and cyclic [7]. Each training set is anM × N matrix; therefore, each processor has R = M /Pyrows and C = N /Px columns. In the block mapping,adjacent processors in the two-dimensional meshcontain adjacent blocks of multiple data elements in

the two-dimensional matrix. In the cyclic mapping,each processor is assigned every Px-th element on thex dimension and every Py-th element on the y dimen-sion. Adjacent processors in the two-dimensionalmesh contain adjacent single data elements in thetwo-dimensional matrix, with wraparound at the pro-cessor boundaries. Figure 12 illustrates the differencesbetween block mapping and cyclic mapping. In gen-eral, the cyclic mapping yields better load balancingbetween processors because the data are spread overmore processors. For the same reason, the block map-ping generally yields better (i.e., lower) communica-tion-to-computation ratios.

Distributed Algorithm Operation

In the section on space-time adaptive processing westated that the computation of the adaptive weightvector w from the estimated covariance matrix Rcould be done most effectively through QR decom-position of the training-set matrix. In the HOPD al-gorithm, the method for computing A = QR is basedon the technique of Householder reflections [4]. Atwo-by-two orthogonal matrix Q is a reflection if ithas the form

Q =−

cos sin

sin cos.

θ θ

θ θ

If y Q x Qx= =T , then y is obtained by reflectingthe vector x across the line defined by

S =

spancos( / )

sin( / ).

θ

θ

2

2

Reflections can be used to introduce zeros in a vectorby properly choosing the reflection plane. A House-holder reflection is an n-by-n matrix P of the form

P I v v v v= − 2 T T/( ) ,

where v ∈ ℜn is called the Householder vector.When a vector x is multiplied by P, it is reflected inthe hyperplane span{ }v ⊥. Householder reflections aresymmetric and orthogonal, and can be used to set se-lected components of a vector to zero. A succession ofHouseholder reflections applied to the columns ofmatrix A can introduce zeros below the main diago-

FIGURE 11. The global mapping approach. The two-di-mensional training-set matrix for each Doppler bin m ismapped to a region of multiple processors in the fulltwo-dimensional processor array. By increasing thenumber of processors allocated to each Doppler bin, wecan improve overall execution time.

Px

Py

Processor array

N channels

Bin m–1Bin m

Bin m+1

Ran

ge

gat

es



FIGURE 12. Block mapping and cyclic mapping. Each M × N matrix of training-set data is mapped onto a Px × Py array ofprocessors such that each processor is assigned R = M /Py rows and C = N/Px columns of the matrix. (a) In the blockmapping, adjacent processors in the two-dimensional mesh contain adjacent blocks of multiple data elements in thetwo-dimensional matrix. (b) In the cyclic mapping, each processor is assigned every Px-th element on the x dimensionand every Py-th element on the y dimension. Adjacent processors in the two-dimensional mesh contain adjacent singledata elements in the two-dimensional matrix, with wraparound at the processor boundaries. In general, the block map-ping yields better communication-to-computation ratios, while the cyclic mapping yields better load balancing betweenprocessors.

0

Px

Py

1

2 3

4

N (channels)

M (

ran

ge

gat

es)

5

X X X X X X X X X X

X X X X X X

X X X X X X

X X X X X X

X X X X X X

X X X X X X

X X X X

X X X X

X X X X

X X X X

X X X X

X X X XX X X X X X

X X X X X X

X X X X X X

X X X X X X

X X X X X X

X X X X X X

X X X X

X X X X

X X X X

X X X X

X X X X

X X X XX X X X X X

X X X X X X

X X X X X X

X X X X

X X X X

0

Px

Py

1

2 3

4

N (channels)

(b)(a)

M (

ran

ge

gat

es)

5

X X X X X X X X X X X XX X X X X XX X X X X XX X X X X XX X X X X XX X X X X X

X X X X X XX X X X X XX X X X X XX X X X X XX X X X X XX X X X X XX X X X X X

X X X X X XX X X X X XX X X X X XX X X X X XX X X X X X

X X X X X XX X X X X XX X X X X XX X X X X XX X X X X XX X X X X XX X X X X X



nal, yielding an upper-triangular matrix R. A House-holder update of a matrix A involves a matrix-vectormultiplication followed by an outer-product update,and therefore never entails the explicit formulation ofthe Householder matrix. The algorithm for the kthHouseholder update of the sample matrix is

Q Iv v

v v

Q A Iv v

v vA A v w

k

H

H

k

H

HH

= − ⇒

= −

= +

2

2,

where, if ak is a column vector of data below the di-agonal of matrix A, then

v a a a e

w A v

v v

= +

=

= −

k k k

kH

H

sgn[ ( )]

.

1

2

1

β

β

In this algorithm the matrix Ak is the submatrix of A

for the kth iteration. Table 1 lists the steps in the dis-tributed Householder decomposition of the matrix A.Table 2 outlines the steps in the distributed backsolveand weight-application operation.

As the Householder algorithm iterates over col-umns in the input matrix, the active part (i.e., the ele-ments of the matrix that are involved in the currentupdate) consists of the elements below and to theright of the main diagonal. Since only part of the ma-trix is active in each iteration, load balancing is an is-sue. Figure 13 illustrates the different load-balancingcharacteristics of the Householder algorithm for thetwo different mappings. In the block mapping someprocessors become inactive earlier in the sequence ofiterations, whereas in the cyclic mapping all proces-sors remain active until the final iteration. Because ofthis characteristic of the mappings, cyclic mappinghas better load balancing.

Performance Model

Table 3 shows the equations for estimating the com-putation and communication costs for each step inthe HOPD algorithm. A reduction operation over P



Table 1. Distributed Householder Decomposition

Given a complex matrix A, apply a sequence of Householder transformation matrices Qk to upper-triangularize A.

Each Householder matrix Qk zeros a column of A below the diagonal.

For each column ak of A ( k = 1 to N), compute the update ofA that represents QkA :

Square the elements in that column, then perform a reduc-tion (addition) over the appropriate column of processors,and take the square root to compute |a k|.

Update the appropriate value in the appropriate processorto compute v(1) = ak(1) + sgn[a(1)] |a|.

Multiply the elements in that column by their complex conju-gates, then perform a reduction (addition), division, andmultiplication to compute β = –2 /(v Hv).

Broadcast β from the appropriate processor to the right inthe processor array.

Broadcast v from all processors in the appropriate columnto the right in the processor array.

In all processors to the right and down from the selectedprocessor, compute w = βakv, then perform a reduction overthose columns of processors to compute w = βa k

Hv.

Update by setting a k = a k + v w* for all elements of A.

Broadcast β and v

Reduction network to compute v, β, w

Processing blocks to update matrix A

Zeroed elements of matrix A

Active processors in current iteration

processors in which K numbers are added over the Pprocessors involves an initial step to add the elementswithin each processor, then log2P stages to add partialsums between processors. The ith stage involves anaddition and a communication of distance 2i–1. Onthe MeshSP, the result is repeated in each of the Pprocessors.

The number of steps to broadcast K numbers overP processors is proportional to KP. If pipelining canbe used, then the number of steps is proportional toK + P. On the MeshSP there is a three-cycle overheadpenalty for pipelined broadcasts; therefore, the num-ber of steps is proportional to K + 3P. The values Rjand Cj are the number of rows and columns, respec-tively, per processor during iteration j. The values Pj

x

and Pjy are the respective number of processors active

during iteration j in the x and y dimensions. Thesefour values vary with different mappings for the ma-trix decomposition. Table 4 lists the values for the twomappings in this study, and Table 5 lists the commu-nication costs for the backsolve operations, where

Py′ = N /R is the number of processors in the y dimen-sion active in the backsolve operations, and R ′ = N /Pyis the number of backsolve rows per processor.

Analytical Results

Two different sets of parameter values based on twodifferent radar systems were used in this study. Table6 lists the parameters and their values for each case.

The resource requirements for memory and input/output are defined for a given application by using adata cube of size Nd × Nr × Ndof , where Nd is thenumber of Doppler bins, Nr is the number of rangegates, and Ndof is the number of degrees of freedom,to allow for sharing of adjacent Doppler bins. Thenumber of processors required for the entire processorarray is Nd × Ns × Px × Py , where Ns is the number ofrange segments, since each Doppler bin and rangesegment has its own training set. The number of dataelements per processor element is the data cube sizedivided by the total number of processors; thememory required per processor element is twice that



FIGURE 13. Householder-algorithm load-balancing char-acteristics for block mapping and cyclic mapping. (a)Block mapping produces better communication-to-com-putation ratios, while (b) cyclic mapping produces betterload balancing between processors because the data arespread over more processors.

Block Cyclic

(a) (b)

Table 2. Distributed Backsolve and Weight Application

Distributed backsolve.

value to reflect double buffering of the data cube(scaled by eight for eight-byte complex data). Input/output throughput required per processor element isthe number of data elements per processor element(times eight) divided by the length of a coherent pro-cessing interval.

Table 7 lists the performance metrics that were de-rived in this study. The peak values for MeshSP com-putational throughput and communication band-width are inserted in the equations with an estimatedefficiency of 25% in both processor-element perfor-mance and processor-element intercommunication.The single processor-element efficiency has a signifi-cant effect on overall processor efficiency and there-fore must be measured on actual MeshSP hardware.

Architecture Performance Analysis

The performance model described in the previoussection was exercised by using two different radar

Broadcast result for LHS update

Pass new LHS for next iteration

Processing blocks to compute weights

Zeroed elements of matrix A

Active processors in current iteration

First backsolve, for ( j = 1 to N), to solve AHu = s for vector u:

Divide to get u( j ); broadcast u( j ) right along the proces-sor row.

Multiply u( j ) by a( j, j ) and subtract from v( j ) for all valuesin all processors in rest of current row.

Send updated v( j ) down to next row.

Second backsolve, for ( j = N to 1), to solve Aw = u for theweight vector w:

Divide to get w( j ); broadcast w( j ) up along the processorcolumn.

Multiply w( j ) by a( j, j) and subtract from u( j ) for all val-ues in all processors in rest of current column.

Send updated u( j ) left to next column.

Distributed weight application.

For each range gate per processor (I = 1 to Nr/Py ):

Multiply and add all elements within each processor.

Perform reduction (addition) over rows of processors.



Table 3. Total Computation and Communication Costs*

Function Computation Communication

Householder decomposition:

Compute v 4Rj + log2Pjy + 1 4(Pj

y – 1)

Compute β 4Rj + log2Pjy + 1 4(Pj

y – 1)

Compute w Cj(8Rj + 2 log2Pjy + 2) 8(Pj

y – 1)

Compute a 8Rj Cj 8Rj + 12Pjy

Backsolve Nv(22 + 8Cj + 8Rj′) NvZ

Weight application NvNr(8C + 2 log2Px )/(PyNs ) Nv × Nr × 8(Px – 1)/PyNs)

* Rj is number of rows per processor, Cj is number of columns per processor, Pjx is number of x processors

active, and Pjy is number of y processors active during iteration j of the Householder algorithm. See Tables

4 through 6 for the mapping-specific parameters ( ′Rj , Z) and radar parameters (Nv, Nr, Ns).

Table 4. Mapping-Specific Parameters

Value * Block Cyclic

Rj if j < M – R then R else M – j R – floor ( j /Py )

Cj if j < N – C then C else N – j C – floor( j/Px )

Pjx Px – ceiling(( j + 1)/ C) if j < N – Px then Px – 1 else N – j – 1

Pjy Py Py

′Rj if j < N – R then R else N – j R – floor ( j /Py )

Pjy ′

Py – ceiling(( j + 1)/R ) if j < N – Py then Py – 1 else N – j – 1

* ′Rj is the number of rows per processor during iteration j of the backsolve algorithm; Pjy ′ is the number of

y processors active during iteration j of the backsolve algorithm.

Table 5. Backsolve Costs for Block Mapping and Cyclic Mapping*

Block Z P P P Pjx

j

jy

j

y x= + + ′ − + −

∑ ∑ ′8 1 1( ) ( )

Cyclic Z P P P R P Cjx

j

jy

j

y x= + + − ′ − + − −

∑ ∑8 1 2 1 1 2 1( )( ) ( )( )

* ′Py = N/R is the number of y processors active in the backsolve; ′R = N/Py is thenumber of rows per processor.



Table 7. Performance Metrics

Definition Value

Processing latency Tp = Nops /(120 × 0.25)

Communication latency Tc = Nbytes /(40 × 0.25)

Total latency T = Tp + Tc

Communication-to-computation ratio R = Tc /Tp

Speedup S = T1 /TN

Efficiency E = T1 /(N × TN )

Table 6. Radar Application Definition

Parameter Symbol Radar 1 Radar 2

Number of channels Nc 48 18

Number of Doppler bins Nd 128 16

Number of range gates Nr 1250 16,384

Number of segments Ns 2 4

Degrees of freedom Ndof 144 54

Training sample ratio (M/N ) k 2 3

Number of beams Nv 2 3

Length of coherent processing interval (sec) Tcpi 0.5 0.06

applications (Radar 1 and Radar 2) and two differentmappings (block and cyclic). Radar 1 and Radar 2 pa-rameter sets are based on actual radar systems withlow and high instantaneous bandwidths, respectively.For each case, we derived performance measures for avariety of processor configurations. By plotting pro-cessor configuration on one axis and performancemetric on the other axis, we can show scalability ofthe algorithm mapping to the MeshSP.

Radar 1 Performance Results

Figure 14 shows the variation in estimated total la-tency for each mapping of the HOPD algorithm withthe Radar 1 radar parameters. These graphs show thatreal-time requirements can be met with four proces-sors per sample-matrix-inversion problem with each

mapping. For 128 Doppler bins and two range seg-ments, a total of 1024 processors for all Doppler binsand range segments is required. The cyclic mappinggives greater flexibility in mapping choice to meetreal-time requirements for latency, since two points inprocessor space that do not meet real-time require-ments for block mapping do meet real-time require-ments for cyclic mapping. This fact occurs because ofthe better load balancing of the cyclic mapping.

Figure 15 shows a breakdown of latency for each ofthe three operations in the algorithm. Algorithm ex-ecution time is clearly dominated by the HouseholderQR decomposition. Execution time for the cyclicmapping is approximately half that of the block map-ping because of the better load balancing of the cyclicmapping. In the Householder QR decomposition,



algorithm activity progresses to the lower right cornerof the matrix. Thus, with the block mapping, proces-sors in the upper left corner of each processorsubarray become inactive in the beginning stages ofthe algorithm. With the cyclic mapping, all proces-sors stay active for more iterations of the algorithm.

Figure 16 shows speedup and efficiency of the en-tire Radar 1 application. Speedup is defined as the ra-tio of execution time with one processor (T1) to ex-ecution time with N processors (TN ). Once again wesee that the cyclic mapping is more efficient becauseof better load balancing between processors. We alsodetermined that parallel efficiency, as defined in Table

7, is high for lower numbers of processors, i.e., four oreight processor elements per problem.

Figure 17 shows the communication-to-computa-tion ratio for the entire Radar 1 application. Thesegraphs show that the cyclic algorithm, although it hasa better execution time because of better load balanc-ing, has a worse communication-to-computation ra-tio than the block mapping. This result matches ex-pectations, since better load balancing meansspreading data over a higher average number of pro-cessors per iteration. The increased number of proces-sors lowers latency, but increases the communicationburden in the distributed algorithm.

FIGURE 15. Radar 1 mapping analysis. The majority of al-gorithm execution time is spent in the Householder QRdecomposition. Execution time with the cyclic mappingis lower because of better load balancing.

FIGURE 16. Radar 1 speedup. Speedup is better with thecyclic mapping because of the better load balancingamong processors.

QR ApplyBacksolve

QR ApplyBacksolve

Block Cyclic

0.2

0.3

0.4

Exe

cuti

on

tim

e (s

ec)

0.1

0

Py = 8

0.2

0.3

0.4

2 4 62 4 6 8Processors in x dimension (Px)

To

tal c

om

m/c

om

p (

T /T

) (

sec/

sec)

Block Cyclic

Py = 8

8

0.1

0

cp

4

21

FIGURE 17. Radar 1 communication-to-computation ra-tio. This ratio is higher with the cyclic mapping becausedata computation is distributed over more processors,which requires greater interprocessor communication.

FIGURE 14. Radar 1 real-time processing. Real-time re-quirements can be met with a minimum of four proces-sors for either mapping.

Py = 1

2

4

8

0.5

0

1

1.5


To

tal e

xecu

tio

n ti

me

(sec

) Block Cyclic

Real-timerequirement


Py = 1

2

48

8

Py = 820

30

40


To

tal s

pee

du

p (

T1

/ TN

.) (s

ec/s

ec)

Block Cyclic

Py = 8

8

10

0

4

2

1

4

2

1



Figure 18 shows the effect of adding processors inthe x dimension versus adding processors in the y di-mension. In each graph, as we move to the right alongthe x-axis, the aspect ratio of y processors to x proces-sors increases; if the ratio is greater than one, thenthere are more y processors than x processors. Execu-tion time decreases slightly as we increase y processorsrelative to x processors, slightly favoring y-rectangularprocessor arrays. Because reduction versus broadcastoperations vary in execution time in proportion tolog(N ) versus N, this result again meets expectations.

Figure 19 shows memory and input/output re-quirements for different processor configurations. We

can see that the required memory just fits into thatprovided by the four processors per matrix decompo-sition. Input/output requirements are well within thecapabilities of the MeshSP.

The performance results for Radar 1, on the basisof the performance model, indicate that we can meetreal-time requirements for Radar 1 with approxi-mately sixteen MeshSP processor boards. The resultsshow that the Radar 1 application scales reasonablyon the MeshSP architecture, and suggest a preferencefor cyclic mapping and y-rectangular processor aspectratio.

Radar 2 Performance Results

Figure 20 shows the variation in estimated latency foreach mapping of the HOPD algorithm with the Ra-dar 2 radar parameters. As shown in Table 6, the“shape” of the Radar 2 radar application is differentfrom that of the Radar 1 problem; Radar 2 has manyrange gates and smaller degrees of freedom, whereasRadar 1 has fewer range gates and larger degrees offreedom.

Figure 21 shows a breakdown of latency for eachoperation in the algorithm. Because the Radar 2 casehas many more range gates, algorithm execution timeis dominated by the weight application (the applica-tion of weights to the entire data cube). Although thecyclic mapping still outperforms block mapping forthe Householder matrix decomposition, it is equiva-

FIGURE 19. (a) Radar 1 required memory per processor and (b) requiredinput/output per processor. Radar 1 processing fits into available mem-ory limits with a minimum of four processors, and within input/outputlimits for all processor sizes.

P = 2

4

16

8

1.0

1/8 1/4

Aspect ratio (Py /Px)

To

tal e

xecu

tio

n ti

me

(sec

)

Block Cyclic

P = 2

1/2 1 2 4 81/8 1/4 1/2 1 2 4 8

32

4

168

32

0.8

0.6

0.4

0.2

FIGURE 18. Radar 1 aspect ratio. In general, executiontime is lower when Py is greater than Px for both the blockmapping and the cyclic mapping.

500

1500

Processors in x dimension (Px)

0

KB

1000

2 4 6 8Processors in x dimension (Px)

2 4 6 8

5000

0

2

Py = 1

8

4

2

Py = 1

KB

/sec

Available memoryper processor

Available I/Oper processor

8

4000

3000

2000

1000

4

(a) (b)



lent to the block mapping for the weight application.This result meets expectation, since reduction opera-tions as used in the weight application simply add Nnumbers distributed over P processors, irrespective ofthe mapping.

Figure 22 shows speedup and efficiency of the en-tire Radar 2 application. In this case, the cyclic map-ping is only slightly more efficient than the blockmapping, since weight application dominates execu-tion time and is mapping-insensitive. Again, we seehigher efficiency for lower numbers of processors.

Figure 23 shows the communication-to-computa-tion ratio for the entire Radar 2 application. Again,

the cyclic algorithm has a communication burdenthat is only slightly higher because of the distributionover more processors, since weight application domi-nates. We can also see that adding y processors doesnot increase communication significantly. This resultalso meets expectation, because the weight applica-tion communicates only in the x direction. We wouldtherefore also expect that this algorithm wouldstrongly favor distribution over processors in the y di-mension over distribution over processors in the x di-mension. This result can be seen in Figure 24. Lastly,Figure 25 shows the memory and input/output re-quirements for different processor configurations. We

18


Block CyclicPy = 8

80

4

2

To

tal s

pee

du

p (

T1

/ TN

) (s

ec/s

ec)

Py = 8

1

4

21

2

4

68

10

12

1416

FIGURE 21. Radar 2 mapping analysis. Unlike Radar 1, themajority of algorithm execution time for Radar 2 is spentin the weight application, which is not influenced by themapping.

FIGURE 22. Radar 2 speedup. The total speedup, which isapproximately the same for both mappings, increasesonly marginally for values of Px greater than four.

FIGURE 23. Radar 2 communication-to-computation ra-tio. This ratio is higher with the cyclic mapping; it variesonly slightly with changes in Py because the weight appli-cation communicates only in the x dimension.

FIGURE 20. Radar 2 real-time processing. Real-time re-quirements can be met with a minimum of eight proces-sors for either mapping.

Py = 1

0.1

0.2

0.3


To

tal e

xecu

tio

n ti

me

(sec

) Block CyclicPy = 1

80

4

2

84

2

8



QR ApplyBacksolve

QR ApplyBacksolve

Block Cyclic

0.02

0.03

0.04

Exe

cuti

on

tim

e (s

ec)

0.01

0

1.5

2.0

2.5


Block Cyclic

80

To

tal c

om

m/c

om

p (

Tc

/Tp

) (s

ec/s

ec)

Py = 8Py = 4Py = 2Py = 1

1.0

0.5



can see that the required memory just fits into thatprovided by the eight processors per matrix decompo-sition, but the fit is very tight. For the Radar 2 case, aswith the Radar 1 case, input/output requirements arealso within capabilities of the MeshSP.

These graphs show that real-time requirements canbe met with eight processors per sample-matrix-in-version problem with each mapping, for a total of512 processors for all sixteen Doppler bins and fourrange segments. The cyclic mapping gives slightlymore flexibility in mapping choice, but not as muchas with the Radar 1 radar parameters. These graphsalso show that adding processors in the x dimensionafter a certain point does not increase performancesignificantly, whereas adding processors in the y

dimension does increase performance significantly.From the performance results for Radar 2 based on

the performance model, we can meet real-time re-quirements for Radar 2 with eight MeshSP processorboards. The results show that the Radar 2 applicationscales reasonably on the MeshSP architecture, andsuggest a preference for cyclic mapping and y-rectan-gular processor aspect ratio.

Future Work

With this analysis as a basis, the next step is to imple-ment and optimize the algorithm in order to verifythe performance model here and to derive the exactmeasured efficiency of a representative STAP algo-rithm on the MeshSP architecture. Another area for

FIGURE 25. (a) Radar 2 required memory per processor and (b) requiredinput/output per processor. Radar 2 processing fits into available memoryand input/output limits with a minimum of eight processors.

FIGURE 24. Radar 2 aspect ratio. Execution time is always lower when Py

is greater than Px for both the block mapping and the cyclic mapping.

Cyclic

4

16

P = 2

8

32

Block

4

16

P = 2

8

32

0.06

0.1

0.18

Aspect ratio (Py / Px)

0.02

To

tal e

xecu

tio

n ti

me

(sec

)

0.14

211/21/41/8 4 8211/21/41/8 4 8

Aspect ratio (Py / Px)

0.12

0.16

0.20

0.08

0.04

0

To

tal e

xecu

tio

n ti

me

(sec

)

1000

2000

4000


0

KB

3000

2 4 6 8


2 4 6 8

1.5

2.0

3.0

2.5

1.0

0.5

08

4

2

Py = 1

8

4

2

Py = 1

KB

/sec

× 1

04

Available memoryper processor

Available I/Oper processor

(a) (b)



future work is that of algorithm optimization, whichwas not addressed here. Other issues that must besolved for a feasible system solution are the input/out-put and radar-interface issues.

In actuality, the overall processor efficiency is theproduct of the parallel efficiency and the single-pro-cessor efficiency, which was assumed in this study tobe 25%. More accurate estimates of overall processorefficiency must utilize single-processor efficiencies de-rived from measurements of execution times on ac-tual processors. These estimated measurements mustbe made in order to derive actual efficiencies.

Other STAP processing stages, such as channelequalization, pulse compression, Doppler filtering,and detection, must be included in the processingstream for analysis. In these stages, the parallelism isusually on a different dimension from that of the par-allelism in the HOPD algorithm presented here.Since the optimal mapping for each stage is poten-tially different, an analysis must be done of the addi-tional communication costs caused by interstageremappings. There is a trade-off between optimalmappings for each stage with interstage remappingsand suboptimal mappings for each stage with lowerinterstage remapping costs. The remapping step ofteninvolves a major rearrangement of the input data,commonly referred to as corner turn. Corner-turn op-erations involve all nodes in the parallel processor andtherefore require high levels of sustained inter-processor communication bandwidth. The ability ofan architecture to implement remappings must beanalyzed and understood in the context of STAP al-gorithms and their associated implementation trade-offs. Future versions of the performance model pre-sented here will be enlarged to include this concept.

Conclusion

This article has demonstrated an approach to measur-ing performance and scalability of a demanding ker-nel in STAP algorithms on a commercially availablemassively parallel processor called the MeshSP. Foreach radar system evaluated, we can meet real-timerequirements with 1024 processors or fewer (eight orsixteen individual processor boards), with availableon-chip memory, and in a form factor that is not un-reasonable for an airborne platform. The algorithms

scale well on the architecture, and we have been ableto compare different mappings of data to processorsin order to determine which provides the best perfor-mance. We have also been able to suggest processoraspect ratios that best suit the algorithm.

On the basis of the parallel efficiencies attained inthis study, SIMD seems to be a good architecturalmatch for the single STAP computational kernel pre-sented here, both in scalability and in delivered pro-cessing power per unit size, weight, and power. Fol-low-on work that must be performed in order to fullyunderstand the match between the SIMD architec-ture and STAP algorithms includes enlarging themodel to include the other stages of STAP computa-tions as well as obtaining measured efficiencies on ac-tual MeshSP hardware.

Acknowledgments

This work was originally performed with KenTeitelbaum, who brought valuable insight and con-tributed significant direction to this project [2]. Theauthor is also indebted to Robert Bond for his impor-tant contributions and feedback. The STAP materialpresented in this article is summarized from an origi-nal report by James Ward [3], where a more detailedtreatment can be found.

R E F E R E N C E S1. L.E. Brennan and I.S. Reed, “Theory of Adaptive Radar,”

IEEE Trans. Aerosp. Electron. Syst. 9 (2), 1973, pp. 237–252.2. J.O. McMahon and K. Teitelbaum, “Space-Time Adaptive

Processing on the Mesh Synchronous Processor,” 10th Int.Parallel Processing Symp., Honolulu, Hawaii, 15–19 Apr. 1996,pp. 734–740.

3. J. Ward, “Space-Time Adaptive Processing for Airborne Ra-dar,” Technical Report 1015, MIT Lincoln Laboratory, DTIC#AD-A293032 (13 Dec. 1994).

4. G.H. Golub and C.F. Van Loan, Matrix Computations, 2nd ed.(Johns Hopkins, Baltimore, 1989).

5. C.M. Rader and A.O. Steinhardt, “Hyperbolic HouseholderTransformations,” IEEE Trans. Acoustics, Speech Signal Process.34 (6), 1986, pp. 1589–1602.

6. I.H. Gilbert and W.S. Farmer, “The Mesh Synchronous Pro-cessor MeshSPTM,” Technical Report 1004, MIT Lincoln Labo-ratory, DTIC #AD-A289875 (14 Dec. 1994).

7. C. Koelbel, D.B. Loveman, R.S. Schreiber, G.L. Steele, Jr., andM.E. Zosel, The High-Performance FORTRAN Handbook(MIT, Cambridge, Mass. 1994).



is a staff member in the DigitalRadar Technology group. Herresearch interests are in theapplication of massively paral-lel processing technology toadvanced signal processingapplications. She received anS.B. degree in computer sci-ence and engineering, and anS.M. degree in electrical engi-neering and computer science,both from MIT. Before joiningthe staff of Lincoln Laboratoryin 1995 she worked at MasParComputer Corporation.

Documents

Space-Time Adaptive Processing on the Mesh … Adaptive Processing on the Mesh Synchronous ... target detection in the presence of environmental clutter and hostile jamming. AEW radar