Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Partitioning of a Signal Detection Algorithm to a Heterogeneous
Multicomputing Platform
Partitioning of a Signal Detection Partitioning of a Signal Detection Algorithm to a Heterogeneous Algorithm to a Heterogeneous
Multicomputing Multicomputing PlatformPlatformMichael Vinskus
Mercury Computer Systems, Inc.
High Performance Embedded Computing (HPEC) ConferenceSeptember 23, 2003
PeakDetection
New EnergyAlarm
DelayMemory
Channelizer16K point real
4:1 Overlap
ADC100 MHz
14-Bit
DemodulatorBank
Tasking
? Typical acquisition systems place the delay memory after the analog to digital converter. For demodulation, a digital down converter(DDC) is used to heterodyne and filter the delayed data stream
? When the number of signals to demodulate becomes very large (>100), the typical DDC-based approach becomes cumbersome. FFT processing can allow the
? The goal of the acquisition system is to detect the presence of new signals in the environment in a timely manner so that they may be identified and exploited? A homogeneous system based on general purpose processors can be used to solve
the problem, but advances in FPGAs allow for a higher processing density, especially in the channelizer
? Current FPGAs can offer over a 10x computational density improvement over genepurpose processors for certain applications. Unfortunately, most communication fabrics have not scaled at the same rate
How can the I/O issue be managed?By partitioning across multiple processing elements!
Parameters:16384 point real DFT4:1 overlap and windowing (P = 4)16384 input sample maximum latency requirementOutput 8192 bins, complex, 24 bits per component (1200 MB/s)
? The channelizer decomposes the input sample stream into frequency channels by performing overlapped and windowed, short time Fourier transforms on the input data stream
ADC
100 MHz14 Bit
800MB/s
1200MB/s
200MB/s
4:1 Data Overlap and Windowing16384
4096
0
3
1
2
16384 point real FFT
??
?
?1
0
)()(N
n
kNWnxkX
The channelizer throughput is greater than most interconnect fabrics can support
Peak Findingand
Measurement
PeakAssociation
New EnergyAlarm
Log Magnitude
? ?)(log10 mX k
Bin ThresholdCalculation
Delay Memory
Alarm Report
Fc = 45.325 MHzBW = 30.21 kHzTup = 15.000
? The new energy alarm detects the presence of “new” signals and performs some rudimentary external measurements
? Partitioning strategies? Commutation? Time domain broadcast? Frequency division
? High-input bandwidth and computational requirements necessitate partitioning
Each processing partition generates the entire frequency sweep for a range of time slices
Each partition processes a contiguous segment of the input data stream By partitioning the problem in this manner, extra overhead to handle the segment boundaries is incurred
Shared
New Data
IPC
? ?P
PL 1?
IPC
IPC
? In this example, P-1 old input data blocks need to be received and P-1 blocks need to be sent
In this case, the channelizer is split into four partitions
Also, the system latency increases due to time expansion nature of the
To meet the one block latency requirement, only 1/4 of a block of new data is passed to each partition; 3/4 of the data comes from the other partitions. This results in the same data block being transferred an extra 3 times!
For this example, the input bandwidth increases from 200 MB/s to 800 MB/s. We’re going the wrong way!
Throughput Overhead for Time Commutation
50
100
150
200
250
300
350
Per
cen
t
New Data
SavedData
Each partition receives the entire data streamNo extra I/O overhead is incurredPartitioning does not add latency
No communication between channelizer partitionsStill need to send all of the channelizer output data to the detector processorsIf there are a sufficient number of detector processing elements allocated and they are located correctly in the fabric, then I/O bottlenecks can be avoided
200200200
AD
C
200 100150
200 200 20
0
200
250 250
200200
50 50
100/100
100
Det
ecto
r
Det
ecto
r
Det
ecto
r
150100
Ch
ann
eliz
er
Crossbar
CrossbarCrossbar
Ch
ann
eliz
er
Unu
sed
200 100150
200 200
250
250
200
200
50 50
100
Det
ecto
r
Det
ecto
r
150100
Ch
ann
eliz
er
Crossbar
CrossbarCrossbar
Ch
ann
eliz
er
Each fabric connection is a 266 MB/s half duplex link to the crossbar. Typical performance is around 250 MB/sTo accommodate the channelizer output rate, the output stream musplit across multiple connectionsThis complicates the data flow and fully utilizes the fabric I/O capacity in many places
? ?PNpk p 1??
Partition p
OutputInput? p kkG :1?
? ? pkNG)(ng
? ?1?pkX
? ?pp kkX :1?? ?1: 1 ??? ?pp kNkNX
Partition generates DFTpoints:
IPC
IPC
? Each node generates a subset of the frequency bins for all time slices
? Each partition needs all of the time series data for each sweep
Able to pipeline the channelizer and detection processing with minimal inter-
Each partition performs the front part of the FFT computations. This is inefficient, but the I/O issues are simplerUniform communication between partitions at partition boundaries only? Single element communication between channelizer partitions? Similar communication between detector partitions!
So why the unusual allocation of the frequency bins?
Partition p
Real DFT)(ng ? ??10log PeakDetection
Partition p-1
GR(k)
GI(k)
(k)
(N-k)
(k)
(N-k)
-
- ???
???
Nk
22
sin?
???
???
Nk
22sin ?
???
???
Nk
22cos ?
GR(N-k)
--
-
? ? ? ? ? ? ? ? ? ?? ?? ?kNXkXWjkNXkXkG kN ???????? *
2*
21
for k = 0, 1, ..., N
The algorithm exploits the efficient computation of the DFT of a2N-point, real-valued, input sequence with an N-point complex transformSince the input data is real, only one half of the output spectrneeds to be computed
By processing a range of X(k), X(N-k) frequency pairs, twice as many output points can be calculated with only two extra additions!
Unlike the time partitioning case, most of the I/O movement occurs locally in the fabric
200Inter-partition Comms
Ch
ann
eliz
er
200
200 200 20
0
200
Crossbar
250 20033+50 50
Ch
ann
eliz
er
AD
C
Det
ecto
r
Det
ecto
r
Det
ecto
r
200
Crossbar
Crossbar
Ch
ann
eliz
er
200
200
Crossbar
250 20033+50 50
Ch
ann
eliz
er
Unu
sed
Det
ecto
r
200
Crossbar
Crossbar
250
250
Redundant computations are performed in each channelizer partition to reduce the system I/O requirements
Exchanges processing capacity for reduced I/O complexity
The detection algorithms are different than the single antenna case. Typically, eigenspace methods are employed
The channelizer is similar between the single and multi-antenna casesDetection processing requires the time series data for each frequency binLog magnitude computation is not requiredEigenvalues and eigenvectors are computed for each bin, across all antennas
As shown in the previous example, frequency domain partitioning has advantages when I/O bound conditions occur
Eigen-analysis
Detector
FrequencyChannelizerA/D
FrequencyChannelizerA/D
FrequencyChannelizerA/D
? Adds an extra data dimension to the problem
? Typical antenna arrays consist of 4 - 8 antenna elements