Upload
borna
View
68
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems. Rod Barto NASA/GSFC Office of Logic Design Spacecraft Digital Electronics 3312 Moonlight El Paso, Texas 79904. Reconfigurable Computing is…. - PowerPoint PPT Presentation
Citation preview
Barto 127-MAPLD2005Slide 1
Reconfigurable Computing:Current Status and Potential
for Spacecraft Computing Systems
Rod BartoNASA/GSFC Office of Logic Design
Spacecraft Digital Electronics3312 Moonlight
El Paso, Texas 79904
Barto 127-MAPLD2005Slide 2
Reconfigurable Computing is…• A design methodology by which
computational components can be arranged in several ways to perform various computing tasks
• Two types of reconfigurable computing:– Static, i.e., the computing system is
configured before launch– Dynamic, i.e., the computing system can be
reconfigured after launch
Barto 127-MAPLD2005Slide 3
Static Reconfigurability
• Several examples exist, e.g., Cray• Typically processing modules connected
by an intercommunication mechanism, e.g., Ethernet
• Goals are– To reduce system development costs– To provide higher performance computing
Barto 127-MAPLD2005Slide 4
Dynamic Reconfigurability (DR)• Processing modules that can be
reconfigured in flight• Goal is to provide processing support for
algorithms that do not map well onto general purpose computers using reduced amounts of hardware
Barto 127-MAPLD2005Slide 5
Outline of Paper1. Discuss the computation of a series of algorithms on
general purpose, special purpose, and DR computers2. Calculate the execution time of an image processing
algorithm on a concept DR computer3. Compare the reconfiguration time of a Xilinx FPGA
with the algorithm execution time calculated in section 2.
4. Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer
5. Conclude that the DR computer described offers higher performance than does the flight computer
Barto 127-MAPLD2005Slide 6
Section 1:Algorithm Execution on General Purpose (GP),
Special Purpose (SP), and DR Computers
Barto 127-MAPLD2005Slide 7
Processing example
• A computing function is the composition of n algorithms executed serially
• Can be executed on a general purpose computer (GP) or a special purpose computer (SP)
Outputf1 f2 fnInput
Barto 127-MAPLD2005Slide 8
Execution on a GP Computer
n
i
ti1
Outputf1 f2 fnInput
Processing time of each stage = ti, i=1..n
Total processing time =
Latency time =
n
i
ti1
GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function
Barto 127-MAPLD2005Slide 9
Processing on an SP ProcessorOutputf1 f2 fnInput
Each stage is an independently operating processor designed specifically for the algorithm it executes
Processing time of each stage = ti, i=1..n
Results appear at rate of one per max(ti), 1=1..n
Latency time = n
i
ti1
Performance increase comes from two factors:
• Pipelining of constituent algorithms exploiting parallelism
• Processors being designed specifically for their algorithms
Barto 127-MAPLD2005Slide 10
Processing on a DR Computer
• Two processing elements alternately process and reconfigure, i.e., fodd executes one algorithm while feven reconfigures for the next algorithm, etc.
fodd
feven
InputOutput
Barto 127-MAPLD2005Slide 11
DR Computer Processing Flowfodd
feven
f1
R
R
f2
f3
R
R
f4
fn
R
time
n
i
Rti1
),max(
n
i
Rti1
),max(
R = reconfiguration
Results appear at rate of one per
Latency =
Performance increase comes from configuring processors specifically for the algorithm they are executing
Do not get increase from exploiting parallelism.
Barto 127-MAPLD2005Slide 12
Section 2:Execution Time of an Image Processing Algorithm on a
Concept DR Computer
Barto 127-MAPLD2005Slide 13
DR Computer Concept
• RAM0 is source for FPGA0, destination for FPGFA1, etc.• Processing elements are implemented in FPGAs• FPGA0 and FPGA1 alternately process and reconfigure, as previously
discussed.• Input and output not shown
FPGA0
FPGA1
RAM1RAM0
Barto 127-MAPLD2005Slide 14
AlgorithmExample: 3x3 Image Convolution
• Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads.
Image width in pixels
row i-1row irow i+1
Parallel shift rows uprow i+2
Circular shift rows through convolution processor
3x3 convolution processor
Destination RAM
Source
RAM
one pixel
Barto 127-MAPLD2005Slide 15
Convolution Operationi-1j-1
i-1j
i-1j+1
ij-1
row Icolumn j
ij+1
i+1j-1
i+1j
i+1j+1
Used, for example, to compute the intensity gradient (derivative) at pixel (i,j)
Result = P(i-1,j-1)*m11+P(i-1,j)*m12+P(i-1,j-1)*m13+…+P(i+1,j+1)*m33
m11 m12 m13
m21 m22 m23
m31 m32 m33
Pixel array Convolution mask
Barto 127-MAPLD2005Slide 16
Convolution Calculation
• Arithmetic processing may require some pipelining
*P(i-1,j-1)
m11
*P(i-1,j)
m12
*P(i-1,j+1)
m13
*P(i,j-1)
m21
*P(i,j)
m22
*P(i,j+1)
m23
*P(i+1,j-1)
m31
*P(i+1,j)
m32
*P(i+1,j+1)
m33 Result(I,j)
Barto 127-MAPLD2005Slide 17
Convolution Timing• Total time = latency+processing = 20.971 msec
– This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate
– Latency = time to read 3 rows: • 1024 pixels *3 rows * 20 nsec/pixel = 61 usec
– Processing = time to stream remaining 1021 rows through and process:
• 1024 * 1021 * 20 nsec = 20.910 msec• Larger convolutions (e.g., 7x7) have longer
latencies, but same computation time• Calculation is for a mono image, stereo image
would take twice as long.
Barto 127-MAPLD2005Slide 18
Section 3:Comparing the Reconfiguration Time of a Xilinx FPGA With the
Algorithm Execution Time Calculated in Section 2.
Barto 127-MAPLD2005Slide 19
DR Computer Processing Element:
Virtex-4 LX FPGA• Eight versions:
– XC4VLX15, -25, -40, -60, -80, -100, -160, -200• Logic hierarchically arranged:
– 2 flip-flops per slice– 4 slices per CLB
Device CLBs FFs-15 64x24 12,288
-200 192x116 178,176
Barto 127-MAPLD2005Slide 20
Time to Configure FPGA
• FPGA Configuration SequencePROG_B
INIT_B
CCLK
DONE
Tpl Tconfig
Total Configuration Time
Barto 127-MAPLD2005Slide 21
Configuration Timing: Tpl
• Tpl = 0.5 usec/frame• “frame” is a unit of configuration RAM• Tpl period clears configuration RAM
Device Frames Tpl
-15 3740 1.87 msec
-200 40108 20.1 msec
Barto 127-MAPLD2005Slide 22
Configuration Timing: Tconfig• FPGA programmed by bitstream• CCLK (programming CLK) can run at 100 MHz• Parallel mode loads 8 bits per CCLK
Device Bitstream length
Parallel loads
Tconfig
-15 4,765,138 595,648 5.956 msec
-200 48,722,432 6,090,304 60.903 msec
Barto 127-MAPLD2005Slide 23
Total Configuration Time
• Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each)
Device Tpl, msec
Tconfig, msec
Total Configration time, msec
-15 1.87 5.956 7.826
-200 20.054 60.903 80.957
Barto 127-MAPLD2005Slide 24
Processing and Reconfiguration Time Comparison
• Convolution execution is faster than reconfiguration – Convolution = 21 msec mono, 42 msec stereo– Reconfiguration = 81 msec– Assuming -200 device
• Processing shown is well within FPGA’s capabilities• More complex algorithms may require use of FPGA
performance features– Much higher internal clock rates– Large internal RAM– Dedicated arithmetic support in –SX series
• What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs
Barto 127-MAPLD2005Slide 25
Section 4: An Extremely Rough Estimate of
Image Processing Algorithm Execution Time on a Flight
Computer
ROUGH ESTIMATE
Barto 127-MAPLD2005Slide 26
GP Computing Performance Estimate
• DANGER: really rough estimate!• Based on data from this paper:
– “Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference
– Available at robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf
– Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements.
ROUGH ESTIMATE
Barto 127-MAPLD2005Slide 27
Published Vision Algorithm Timing• Timed on Pentium III 700 MHz CPU, 32K L1 cache,
256K L2 cache, 512M RAM, Win2K• algorithms explicitly timed (names from paper):
Algorithm C code VectorDifference of Gaussian and Decimate
4229 msec 2047 msec
Prepare Next Row 5559 msec 1163 msecInner Loop 4360 msec 2928 msecCompute Sub-Pixel 667 msec 871 msec
• The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size
ROUGH ESTIMATE
Barto 127-MAPLD2005Slide 28
Flight Computer Performance• Flight processor is RAD6000• GESTALT Navigation algorithm timed on 3 processors:
Processor Execution Time Multiplier
Pentium III, 500 MHz, Linux 1
Sparc 300 MHz, Solaris 3.0-3.5
RAD6000, 20 MHz, VxWorks 7.7-8.7
Assume that the RAD6000 takes 7 times as long as the 500 MHz Pentium
ROUGH ESTIMATE
Barto 127-MAPLD2005Slide 29
Final Peformance Estimate• Assume RAD6000 time = 7 times the 500 MHz Pentium time• Assume 500 MHz Pentium time = 7/5=1.4 times the 700 MHz Pentium
time• Then, RAD6000 time is 1.4*7=9.8 times the 700 MHz Pentium time• Vision algorithm timing can be estimated as follows:
Algorithm Fastest time RAD6000 timeDifference of Gaussian and Decimate
2047 msec 20,060 msec
Prepare Next Row 1163 msec 11,397 msecInner Loop 2928 msec 28,694 msecCompute Sub-Pixel 667 msec 6536 msec
Remember: This is a really rough estimate!!
ROUGH ESTIMATE
Barto 127-MAPLD2005Slide 30
Section 5: Conclusions
Barto 127-MAPLD2005Slide 31
What We Have Shown• We have shown that the concept DR computer
presented executes a 3x3 neighborhood-type algorithm “a lot” faster than it appears that a RAD6000 executes what are probably a bunch of neighborhood algorithms.
• The reader is cautioned to not try to quantify what “a lot” means based on the data given here.
• But, it’s a good enough estimate to tell us that this is worth looking into in more detail.
ROUGH ESTIMATE
Barto 127-MAPLD2005Slide 32
Conclusions
• Xilinx-based DR computer shows promise for performance enhancement of a vision system
• By extension, the DR computer shows promise for the performance enhancement of other algorithms.