Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems

Barto 127-MAPLD2005Slide 1

Reconfigurable Computing:Current Status and Potential

for Spacecraft Computing Systems

Rod BartoNASA/GSFC Office of Logic Design

Spacecraft Digital Electronics3312 Moonlight

El Paso, Texas 79904


Reconfigurable Computing is…• A design methodology by which

computational components can be arranged in several ways to perform various computing tasks

• Two types of reconfigurable computing:– Static, i.e., the computing system is

configured before launch– Dynamic, i.e., the computing system can be

reconfigured after launch


Static Reconfigurability

• Several examples exist, e.g., Cray• Typically processing modules connected

by an intercommunication mechanism, e.g., Ethernet

• Goals are– To reduce system development costs– To provide higher performance computing


Dynamic Reconfigurability (DR)• Processing modules that can be

reconfigured in flight• Goal is to provide processing support for

algorithms that do not map well onto general purpose computers using reduced amounts of hardware


Outline of Paper1. Discuss the computation of a series of algorithms on

general purpose, special purpose, and DR computers2. Calculate the execution time of an image processing

algorithm on a concept DR computer3. Compare the reconfiguration time of a Xilinx FPGA

with the algorithm execution time calculated in section 2.

4. Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer

5. Conclude that the DR computer described offers higher performance than does the flight computer


Section 1:Algorithm Execution on General Purpose (GP),

Special Purpose (SP), and DR Computers


Processing example

• A computing function is the composition of n algorithms executed serially

• Can be executed on a general purpose computer (GP) or a special purpose computer (SP)

Outputf1 f2 fnInput


Execution on a GP Computer

n

i

ti1

Outputf1 f2 fnInput

Processing time of each stage = ti, i=1..n

Total processing time =

Latency time =

n

i

ti1

GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function


Processing on an SP ProcessorOutputf1 f2 fnInput

Each stage is an independently operating processor designed specifically for the algorithm it executes

Processing time of each stage = ti, i=1..n

Results appear at rate of one per max(ti), 1=1..n

Latency time = n

i

ti1

Performance increase comes from two factors:

• Pipelining of constituent algorithms exploiting parallelism

• Processors being designed specifically for their algorithms


Processing on a DR Computer

• Two processing elements alternately process and reconfigure, i.e., fodd executes one algorithm while feven reconfigures for the next algorithm, etc.

fodd

feven

InputOutput


DR Computer Processing Flowfodd

feven

f1

R

R

f2

f3

R

R

f4

fn

R

time

n

i

Rti1

),max(

n

i

Rti1

),max(

R = reconfiguration

Results appear at rate of one per

Latency =

Performance increase comes from configuring processors specifically for the algorithm they are executing

Do not get increase from exploiting parallelism.


Section 2:Execution Time of an Image Processing Algorithm on a

Concept DR Computer


DR Computer Concept

• RAM0 is source for FPGA0, destination for FPGFA1, etc.• Processing elements are implemented in FPGAs• FPGA0 and FPGA1 alternately process and reconfigure, as previously

discussed.• Input and output not shown

FPGA0

FPGA1

RAM1RAM0


AlgorithmExample: 3x3 Image Convolution

• Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads.

Image width in pixels

row i-1row irow i+1

Parallel shift rows uprow i+2

Circular shift rows through convolution processor

3x3 convolution processor

Destination RAM

Source

RAM

one pixel


Convolution Operationi-1j-1

i-1j

i-1j+1

ij-1

row Icolumn j

ij+1

i+1j-1

i+1j

i+1j+1

Used, for example, to compute the intensity gradient (derivative) at pixel (i,j)

Result = P(i-1,j-1)*m11+P(i-1,j)*m12+P(i-1,j-1)*m13+…+P(i+1,j+1)*m33

m11 m12 m13

m21 m22 m23

m31 m32 m33

Pixel array Convolution mask


Convolution Calculation

• Arithmetic processing may require some pipelining

*P(i-1,j-1)

m11

*P(i-1,j)

m12

*P(i-1,j+1)

m13

*P(i,j-1)

m21

*P(i,j)

m22

*P(i,j+1)

m23

*P(i+1,j-1)

m31

*P(i+1,j)

m32

*P(i+1,j+1)

m33 Result(I,j)


Convolution Timing• Total time = latency+processing = 20.971 msec

– This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate

– Latency = time to read 3 rows: • 1024 pixels *3 rows * 20 nsec/pixel = 61 usec

– Processing = time to stream remaining 1021 rows through and process:

• 1024 * 1021 * 20 nsec = 20.910 msec• Larger convolutions (e.g., 7x7) have longer

latencies, but same computation time• Calculation is for a mono image, stereo image

would take twice as long.


Section 3:Comparing the Reconfiguration Time of a Xilinx FPGA With the

Algorithm Execution Time Calculated in Section 2.


DR Computer Processing Element:

Virtex-4 LX FPGA• Eight versions:

– XC4VLX15, -25, -40, -60, -80, -100, -160, -200• Logic hierarchically arranged:

– 2 flip-flops per slice– 4 slices per CLB

Device CLBs FFs-15 64x24 12,288

-200 192x116 178,176


Time to Configure FPGA

• FPGA Configuration SequencePROG_B

INIT_B

CCLK

DONE

Tpl Tconfig

Total Configuration Time


Configuration Timing: Tpl

• Tpl = 0.5 usec/frame• “frame” is a unit of configuration RAM• Tpl period clears configuration RAM

Device Frames Tpl

-15 3740 1.87 msec

-200 40108 20.1 msec


Configuration Timing: Tconfig• FPGA programmed by bitstream• CCLK (programming CLK) can run at 100 MHz• Parallel mode loads 8 bits per CCLK

Device Bitstream length

Parallel loads

Tconfig

-15 4,765,138 595,648 5.956 msec

-200 48,722,432 6,090,304 60.903 msec


Total Configuration Time

• Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each)

Device Tpl, msec

Tconfig, msec

Total Configration time, msec

-15 1.87 5.956 7.826

-200 20.054 60.903 80.957


Processing and Reconfiguration Time Comparison

• Convolution execution is faster than reconfiguration – Convolution = 21 msec mono, 42 msec stereo– Reconfiguration = 81 msec– Assuming -200 device

• Processing shown is well within FPGA’s capabilities• More complex algorithms may require use of FPGA

performance features– Much higher internal clock rates– Large internal RAM– Dedicated arithmetic support in –SX series

• What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs


Section 4: An Extremely Rough Estimate of

Image Processing Algorithm Execution Time on a Flight

Computer

ROUGH ESTIMATE


GP Computing Performance Estimate

• DANGER: really rough estimate!• Based on data from this paper:

– “Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference

– Available at robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf

– Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements.

ROUGH ESTIMATE


Published Vision Algorithm Timing• Timed on Pentium III 700 MHz CPU, 32K L1 cache,

256K L2 cache, 512M RAM, Win2K• algorithms explicitly timed (names from paper):

Algorithm C code VectorDifference of Gaussian and Decimate

4229 msec 2047 msec

Prepare Next Row 5559 msec 1163 msecInner Loop 4360 msec 2928 msecCompute Sub-Pixel 667 msec 871 msec

• The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size

ROUGH ESTIMATE


Flight Computer Performance• Flight processor is RAD6000• GESTALT Navigation algorithm timed on 3 processors:

Processor Execution Time Multiplier

Pentium III, 500 MHz, Linux 1

Sparc 300 MHz, Solaris 3.0-3.5

RAD6000, 20 MHz, VxWorks 7.7-8.7

Assume that the RAD6000 takes 7 times as long as the 500 MHz Pentium

ROUGH ESTIMATE


Final Peformance Estimate• Assume RAD6000 time = 7 times the 500 MHz Pentium time• Assume 500 MHz Pentium time = 7/5=1.4 times the 700 MHz Pentium

time• Then, RAD6000 time is 1.4*7=9.8 times the 700 MHz Pentium time• Vision algorithm timing can be estimated as follows:

Algorithm Fastest time RAD6000 timeDifference of Gaussian and Decimate

2047 msec 20,060 msec

Prepare Next Row 1163 msec 11,397 msecInner Loop 2928 msec 28,694 msecCompute Sub-Pixel 667 msec 6536 msec

Remember: This is a really rough estimate!!

ROUGH ESTIMATE


Section 5: Conclusions


What We Have Shown• We have shown that the concept DR computer

presented executes a 3x3 neighborhood-type algorithm “a lot” faster than it appears that a RAD6000 executes what are probably a bunch of neighborhood algorithms.

• The reader is cautioned to not try to quantify what “a lot” means based on the data given here.

• But, it’s a good enough estimate to tell us that this is worth looking into in more detail.

ROUGH ESTIMATE


Conclusions

• Xilinx-based DR computer shows promise for performance enhancement of a vision system

• By extension, the DR computer shows promise for the performance enhancement of other algorithms.

Documents

Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems