www.xilinx.com/xcell/ SOLUTIONS FOR A PROGRAMMABLE WORLD Xilinx FPGAs Beam Up Advanced Radio Astronomy in Australia Xcell journal Xcell journal ISSUE 75, SECOND QUARTER 2011 High-Performance FPGAs Fly in U.K. Microsatellites Excerpt: FPGA-Based Prototyping Methodology Manual ISE 13.1 Rolls, with PlanAhead Enhancements High-Performance FPGAs Fly in U.K. Microsatellites Excerpt: FPGA-Based Prototyping Methodology Manual ISE 13.1 Rolls, with PlanAhead Enhancements page 30 Xilinx Pioneers New Class of Device in Zynq-7000 EPP Family Xilinx Pioneers New Class of Device in Zynq-7000 EPP Family
Time to market, engineering expense, complex fabrication, and board trouble-shooting all point to a ‘buy’ instead of ‘build’ decision. Proven FPGA boards from theDini Group will provide you with a hardware solution that works — on time, andunder budget. For eight generations of FPGAs we have created the biggest, fastest,and most versatile prototyping boards. You can see the benefits of this experience inour latest Vir tex-6 Prototyping Platform.
We started with seven of the newest, most powerful FPGAs for 37 Million ASICGates on a single board. We hooked them up with FPGA to FPGA busses that runat 650 MHz (1.3 Gb/s in DDR mode) and made sure that 100% of the boardresources are dedicated to your application. A Marvell MV78200 with Dual ARMCPUs provides any high speed interface you might want, and after FPGA configu-ration, these 1 GHz floating point processors are available for your use.
Stuffing options for this board are extensive. Useful configurations start below$25,000. You can spend six months plus building a board to your exact specifications,or start now with the board you need to get your design running at speed. Best ofall, you can troubleshoot your design, not the board. Buy your prototyping hardware,and we will save you time and money.
www.dinigroup.com • 7469 Draper Avenue • La Jolla, CA 92037 • (858) 454-3419 • e-mail: [email protected]
DN2076K10 ASIC Prototyping Platform
Why build your own ASIC prototyping hardware?
JTAG Mictor SATA
SMAs forHigh Speed
PCIe to MarvellCPU
All the gates and features you need are — off the shelf.
The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.
It’s springtime! The flowers are blooming, the bees are buzzing and the IC design com-munity is readying its latest wares for the beginning of its trade show season. TheEmbedded Systems Conference and Design Automation Conference are my two
favorites, as they display the latest embedded software and hardware design tools fromvendors across the world.
At this year’s ESC in San Jose (May 2-5), Xilinx will showcase its Zynq™-7000 ExtensibleProcessing Platform (EPP) in Booth #1332. I fully expect this exciting new technology—thesubject of this issue’s cover story (see page 8)—to be a game-changer for the electronicsindustry. From silicon decisions to the design software infrastructure, this extremely well-thought-out architecture will immediately snatch up sockets now held by two-chipsolutions (pairing an ARM SoC with an FPGA) as well as standalone ASICs, as process tech-nology complexity, cost and thus risk simply become more prohibitive for a growingnumber of applications. With the Zynq-7000 EPP, design teams get a high-end ARM MPUtightly coupled with programmable logic on the same IC—a powerful one-two punch thatsimply isn’t available with other SoCs.
In fact, I predict that with an architecture that boots the processor first, the Zynq-7000EPP will appeal to a much broader base of designers, opening up new design possibilitiesfor folks with software backgrounds, tinkerers as well as those traditionally designing andprogramming ICs. With a starting price at less than $15 (in volume) for the smallest of theserelatively immense devices, the Zynq-7000 EPP is simply too cool not to at least try.
In addition to the excitement of the Embedded Systems Conference, I always look for-ward to the Design Automation Conference (San Diego, June 5-10), at which the small butmighty EDA industry shows its latest wares. Over the last two years, I’ve served as a mem-ber of DAC’s Pavilion Panel organizing committee. This year I’m putting together threeintriguing panels (see http://www.dac.com/conference+program+panels.aspx).
On Monday, Herb Reiter, GSA consultant for 3-D IC programs, will moderate a panel called“3-D IC: Myth or Miracle?” Panelist Ivo Bolsens, Xilinx’s own CTO, will be discussing Xilinx’snew stacked silicon interconnect technology (see cover story, Xcell Journal Issue 74,http://www.xilinx.com/publications/archives/xcell/issue74/cover-story-stacked-loaded.pdf).Joining Ivo and Herb will be Riko Radojcic from Qualcomm and TSMC’s Suk Lee.
On Tuesday, my buddy over at EE Journal, Kevin Morris, is moderating a panel called“C-to-FPGA Tools: Ready for the Mass Market?” This event has a stellar panel, including JeffBier of BDTI, Mentor Graphics’ Simon Bloch and Synopsys’ Johannes Stahl. As the title sug-gests, they’ll be discussing the viability and market opportunities for C-to-FPGA tool flows.
Last but not least, venture capitalist Lucio Lanza will command a panel on Wednesdaycalled “What EDA Isn’t Doing Right.” This will be a no-holds-barred discussion betweenLucio and a group of power EDA users—Behrooz Abdi of NetLogic Microsystems, CharlesMatar of Qualcomm and PMC Sierra’s Alan Nakamoto—on the real pain points of designtoday and what EDA needs to do to address them and, in turn, grow the EDA market further.
To learn more, visit http://esc.eetimes.com/siliconvalley/ and http://www.dac.com/
dac+2011.aspx. There’s a lot to look forward to as the weather warms and the trade showseason begins.
X ilinx has just unveiled the first devicesin a new family built around itsExtensible Processing Platform (EPP),
a revolutionary architecture that mates a dualARM Cortex™-A9 MPCore processor with low-power programmable logic and hardenedperipheral IP all on the same device (see coverstory, spring 2010 issue of Xcell Journal,ht tp : / /www.xi l inx .com/pub l i ca t ions /
archives/xcell/Xcell71.pdf). In March of thisyear, Xilinx officially announced the first fourdevices of what it has now dubbed the Zynq™-7000 EPP family.
Implemented in 28-nanometer process tech-nology, each Zynq-7000 device is built with anARM dual-core Cortex-A9 MPCore processingsystem equipped with a NEON media engineand a double-precision floating-point unit, aswell as Level 1 and Level 2 caches, a multimem-ory controller and a slew of commonly usedperipherals (Figure 1). While FPGA vendorshave previously fielded devices with both hard-wired and soft onboard processors, the Zynq-7000 EPP is unique in that the ARM processorsystem, rather than the programmable logic,runs the show. That is, Xilinx designed the pro-cessing system to boot at power-up (before theFPGA logic) and to run a variety of operatingsystems independent of the programmable logicfabric. Designers then program the processingsystem to configure the programmable logic onan as-needed basis.
With this approach, the software program-ming model is exactly the same as in standard,fully featured ARM processor-based systems-on-chip (SoCs). Previous implementations requireddesigners to program the FPGA logic to get theonboard processor to work. That meant you hadto be an FPGA designer to use the devices. Thisis not the case with the Zynq-7000 EPP.
Xilinx’s Zynq-7000 Extensible Processing Platform family mates a dual ARM Cortex-A9 MPCore processor-based system on the same device with programmable logic and hardened IP peripherals, offering the ultimate mix of flexibility, configurability and performance.
The new product family eliminatesthe delay and risk of designing a chipfrom scratch, meaning system designteams can quickly create innovativeSoCs leveraging advanced hardwareand software programming versatilitysimply not achievable in any othersemiconductor device. As such, theZynq-7000 EPP stands poised to allowa broader number of innovators—whether they are professional hard-ware, software or systems designers,or simply “makers”—to explore thepossibilities of combining processingplus programmable logic to createapplications no one has yet imagined.
“At its most basic level, Zynq-7000EPP is an entirely new class of semi-conductor product,” said LarryGetman, vice president of processingplatforms at Xilinx. “It is not just aprocessor and it is not just an FPGA.We are combining the best of boththose worlds, and because of that wetake away many of the limitations youhave with existing solutions, especial-ly two-chip solutions and ASICs.”
Getman notes that many electronicsystems today pair an FPGA and eithera standalone processor or an ASIC withan onboard processor on the samePCB. Xilinx’s new offering will allow
companies using these types of two-chip solutions to build next-generationsystems with just one Zynq-7000 chip,saving bill-of-material costs and PCBspace, and reducing overall powerbudgets. And because the processorand FPGA are on the same fabric, theperformance increase is immense.
Zynq-7000 EPP will also speed upthe natural market migration fromASICs to FPGAs, Getman said.Implementing ASICs in the latestprocess technologies is too expen-sive and too risky for a growing num-ber of applications. As a result, moreand more companies are embracing
10 Xcell Journal Second Quarter 2011
C O V E R S T O R Y
Static Memory ControllerQuad-SPI, NAND, NOR
ARM CoreSight Multi-core and Trace Debug
Cortex-A9 MPCore32/32/KB I/O Caches
512 KB L2 Cache Snoop Control Unit (SCU)
Timer Counters 256 KB On-Chip Memory
General Interrupt Controller DMA Configuration
AMBA SwitchesAMBA Switches
System GatesDSP, RAM
Multi Standards IOs (3.3V & High Speed 1.8V) Multi Gigabit Transceivers
Cortex-A9 MPCore32/32/KB I/O Caches
Dynamic Memory ControllerDDR2, DDR3, LPDDR2
2x SDIOwith DMA
2x USBwith DMA
2x GigEwith DMA
Figure 1 – Unlike previous chips that combine MPUs in an FPGA fabric, Xilinx’s new Zynq-7000 EPP family lets the ARM processor, rather than the programmable logic, run the show.
FPGAs. Many of those attempting tohold onto their old ASIC ways areimplementing their designs in olderprocess geometries in what analystscall “value-minded SoC ASICs.” AnyASIC still requires lengthy designcycles and is at risk of multiplerespins—which can be expensive andcan delay products from going tomarket in a timely manner. “WithZynq-7000 EPP on 28 nm, the pro-grammable logic portion of thedevice has no size or performancepenalty vs. older technologies, andyou also get the added benefit of ahardened 28-nm SoC in the process-ing subsystem. With a starting pricepoint below $15, we are really mak-ing it hard for companies to justifythe cost and risk of designing anyASIC that is not extremely high vol-ume,” said Getman. “You can get yoursoftware and hardware teams up andrunning from day one. That alonemakes it hard for engineering teamsto justify staying with ASICs.”
Getman notes that ever since Xilinxannounced the architecture last year,interest in and requests for the Zynq-7000 EPP have been remarkable. “Aselect number of alpha customers arealready prototyping systems that willuse Zynq-7000 devices. The technologyis very exciting.”
SMART ARCHITECTURALDECISIONSThe Zynq-7000 EPP design team,under the direction of VidyaRajagopalan, vice president of pro-cessing solutions at Xilinx, designed avery well-thought-out architecture forthis new class of device. Above andbeyond choosing the ubiquitous and
immensely popular ARM processorsystem, a key architectural decisionwas to extensively use the high-band-width AMBA® Advanced ExtensibleInterface (AXI™) interconnectbetween the processing system andthe programmable logic. This enablesmultigigabit data transfers betweenthe ARM dual-core Cortex-A9MPCore processing subsystem andthe programmable logic at very lowpower, thereby eliminating commonperformance bottlenecks for control,data, I/O and memory.
In fact, Xilinx teamed with ARM tomake the ARM architecture an evenbetter fit for FPGA applications. “AXI4has a memory-mapped version and astreaming version,” said Rajagopalan.“Xilinx drove the streaming definitionfor ARM because a lot of IP that peo-ple build for applications such as high-bandwidth video is streaming IP. ARMdidn’t have a product that had thisstreaming interface so they partneredwith us to do it.”
Getman said another key aspect ofthe architecture is that Xilinx hard-ened a healthy mix of standard inter-face IP into Zynq-7000 EPP silicon.“We tried to choose peripherals thatwere more ubiquitous—things likeUSB, Ethernet, SDIO, UART, SPI, I2Cand GPIO are all pretty standard,” saidGetman. “The one exception is that wealso added CAN to the device. CAN isone of the more specialized hardenedcores, but it is heavily used in two ofour key target markets: industrial andautomotive. Having it hardened in thedevice is just one more comfort factorof the Zynq-7000 EPP.”
In terms of memory, Zynq-7000devices offer up to 512 kbytes of L2
cache that is shared by both proces-sors. “The Zynq-7000 EPP devices have256 kbytes of scratchpad, which is ashared memory that the processor andFPGA can both access,” said Getman.
A unique multistandard DDR con-troller supports three types of double-data-rate memory. “Where most ASSPstarget a particular segment of a market,we target LPDDR2, DDR2 and DDR3,so the user can make the trade-off ofwhether they want to go after power orperformance,” said Rajagopalan. “It is amultistandard DDR controller and weare one of the first companies to offer acontroller like this.”
In addition to being a new class ofdevice, Zynq-7000 EPP is also the lat-est Xilinx Targeted Design Platform,offered with base developmentboards, software, IP and documenta-tion to get customers up and runningquickly. Further, the company willroll out over the coming years verti-cal-market- and application-specificZynq-7000 EPP Targeted DesignPlatforms—boards or daughtercards,IP and documentation—all to helpdesign teams get products to marketfaster (see cover story, Xcell Journal
Issue 68, http://www.xilinx.com/pub-
lications/archives/xcell/Xcell68.pdf). Xilinx Alliance Program members
and the ARM Connected Communitywill also offer customers a wealth ofZynq-7000 EPP resources, includingpopular operating systems, debuggers,IP, reference designs and other learn-ing and development materials.
In addition to creating great siliconand the tools to go with it, Xilinx hasmeticulously put together user-friend-ly design and programming flows forthe Zynq-7000 EPP.
Second Quarter 2011 Xcell Journal 11
C O V E R S T O R Y
'With a starting price point below $15, we are really making it hard for companies to justify the cost and risk of designing any ASIC that is not extremely high volume.'
PROCESSOR-CENTRICDEVELOPMENT FLOWThe Zynq-7000 EPP relies on a famil-iar tool flow that allows embedded-software and hardware engineers toperform their respective develop-ment, debug and implementationtasks in much the same way as theydo now—using familiar embedded-design methodologies already deliv-ered through the Xilinx® ISE® DesignSuite and third-party tools (Figure 2).
Getman notes that software applica-tion engineers can use the same devel-opment tools they have employed forprevious designs. Xilinx provides theSoftware Development Kit (SDK), an
Eclipse-based tool suite, for embedded-software application projects. Engineerscan also use other third-party develop-ment environments, such as the ARMDevelopment Studio 5 (DS-5™), ARMRealView Development Suite (RVDS™)or any other development tools from theARM ecosystem
Linux application developers canfully leverage both the Cortex-A9 CPUcores in Zynq-7000 devices in a sym-metric-multiprocessor mode for thehighest performance. Alternatively,they can set up the CPU cores in auniprocessor or asymmetric-multi-processor mode running Linux, a real-time operating system (RTOS) like
VxWorks or both. To jump-start soft-ware development, Xilinx providescustomers with open-source Linuxdrivers as well as bare-metal driversfor all the processing peripherals(USB, Ethernet, SDIO, UART, CAN,SPI, I2C and GPIO). Fully supportedOS/RTOS board support packages withmiddleware and application softwarewill also be available from the Xilinxand ARM partner ecosystem.
Meanwhile, the hardware designflow is similar to the embedded-proces-sor design flow in the ISE Design Suite,with a few new steps for the ExtensibleProcessing Platform. The processingsubsystem is a complete dual-processor
C O V E R S T O R Y
One Processing System, Four Devices Each of the Zynq-7000 EPP family’s four devices has the exact same ARM processing system, but the programmablelogic resources vary for scalability and fit different applications (see figure).
The Cortex-A9 Multi-Processor core (MPCore) consists of two CPUs—each a Cortex A9 MPCore processor withdedicated NEON coprocessor (a media and signal-processing architecture that adds instructions targeted at audio,video, 3-D graphics, image and speech processing) and a double-precision floating-point unit. The Cortex-A9 proces-sor is a high-performance, low-power, ARM macrocell with a Level 1 cache subsystem that provides full virtual-mem-ory capabilities. The processor implements the ARMv7™ architecture and runs 32-bit ARM instructions, 16-bit and32-bit Thumb instructions, and 8-bit Java byte codes in Jazelle state. In addition, the processing system includes asnoop control unit, a Level 2 cache controller, on-chip SRAM, timers and counters, DMA, system control registers,device configuration and an ARM CoreSight™ system. For debug, it contains an embedded trace buffer, instrumen-tation trace macrocell and cross-trigger module from ARM, alongwith AXI monitor and fabric trace modules from Xilinx.
The two larger devices, the Zynq-7030 and Zynq-7040, includehigh-speed, low-power serial connectivity with built-in multigigabittransceivers operating at up to 10.3125 Gbits/second. These devicesoffer approximately 1.9 million and 3.5 million equivalent ASIC gates(125k and 235k logic cells) respectively, along with DSP resourcesthat deliver 480 GMACs and 912 GMACs respectively of peak per-formance. The two smaller devices, the Zynq-7010 and Zynq-7020,provide roughly 430,000 and 1.3 million ASIC-gate equivalents (30kand 85k logic cells) respectively, with 58 GMACs and 158 GMACs ofpeak DSP performance.
Each device contains a general-purpose analog-to-digital converter (XADC)interface, which features two 12-bit, 1-Msample/s ADCs, on-chip sensors andexternal analog input channels. The XADC offers enhanced functionality overthe system monitor found in previous generations of Virtex® FPGAs. The two12-bit ADCs, which can sample up to 17 external-input analog channels, sup-port a diverse range of applications that need to process analog signals withbandwidths of less than 500 kHz.
12 Xcell Journal Second Quarter 2011
The Zynq-7000 Extensible Processing Platform debuts with a family of four devices. All sport the same ARM processing system but vary in programmable logic resources. Gate counts range from 430,000 to 3.5 million equivalent ASIC gates.
system with an extensive set of com-monly used peripherals. Hardwaredesigners can extend the processingpower by attaching additional soft-IPperipherals in the programmable logicto the processing subsystem. Thehardware development tool XilinxPlatform Studio automates many ofthe common hardware developmentsteps and can also assist designerswith optimized device pinouts. “We’vealso added to ISE some abilities to co-debug for hardware breakpoints andcross-triggering,” said Getman. “Themost important thing for us was togive software developers and hard-ware designers their comfortabledesign environments.”
A MINDFUL PROGRAMMINGMETHODOLOGY In the Xilinx scheme of things, userscan configure the programmable logicand connect it to the ARM corethrough AXI “interconnect” blocks toextend the performance and capabili-ties of the processor system. TheXilinx and ARM partner ecosystemprovides a large set of soft AMBA inter-face IP cores for implementation in theFPGA programmable logic. Designerscan use them to build any custom func-tions their targeted applicationrequires. Because the device usesfamiliar programmable logic structuresfound in the 7 series FPGAs, designerscan load a single static programmablelogic configuration, multiple configura-tions or even employ partial reconfigu-ration techniques to allow the deviceto reprogram programmable logicfunctionality as needed on the fly.
The interconnect operationbetween the two regions of the deviceis largely transparent to the designers.
Access between master and slave isrouted through the AXI interconnectbased on the address range assigned toeach slave device. Multiple masterscan access multiple slaves simultane-ously, and each AXI interconnect usesa two-level arbitration scheme toresolve contention.
GET READY, GET IN EARLY…Customers can start evaluating theZynq-7000 EPP family today by joiningthe Early Access program. First sili-con devices are scheduled for the sec-ond half of 2011, with general engi-
neering samples available in the firsthalf of 2012. Designers can immedi-ately use tools and development kitsthat support ARM to familiarize them-selves with the Cortex-A9 MPCorearchitecture and begin porting code.
Pricing varies and depends on vol-ume and choice of device. Based onforward volume production pricing,the Zynq-7000 EPP family will have anentry point below $15 in high vol-umes. Interested customers shouldcontact their local Xilinx representa-tive. For more information, please visitwww.xilinx.com/zynq.
Second Quarter 2011 Xcell Journal 13
C O V E R S T O R Y
‘The most important thing for us was to give software developers and hardware designers their
comfortable design environments.’
TOOL DEVELOPMENT FLOW
Figure 2 – The Zynq-7000 EPP relies on a familiar tool flow for systemarchitects, software developers and hardware designers alike.
High-PerformanceFPGAs Take Flight in Microsatellites
XCELLENCE IN AEROSPACE
Artist’s rendition of the UKube1 CubeSat in flight
Second Quarter 2011 Xcell Journal 15
The UKube1 mission is the pilot mission for the U.K. SpaceAgency’s planned CubeSat program. CubeSats are a class ofnanosatellites that are scalable from the basic 1U satellite
(10 x 10 x 10 cm) up to 3U (30 x 10 x 10 cm) and beyond, and whichare flown in low-earth orbit. The typical development cost of aCubeSat payload is less than $100,000, and development time isshort. This combination makes CubeSats an ideal platform for veri-fying new and exciting technologies in orbit without the associatedoverhead or risks that would be present in flying these payloads ona larger mission. Of course, this class of satellites can present itsown series of design challenges for the engineers involved.
The EADS Astrium payload for the UKube1 mission comprisestwo experiments, both of which are FPGA based. The first experi-ment is the validation of a patent held by Astrium on random-num-ber generation. True random-number generation is an essential com-ponent of secure communications systems. The second experimentis the flight of a large, high-performance Xilinx® Virtex®-4 FPGAwith the aim of achieving additional in-flight experience with thistechnology while gaining an understanding of the device’s radiationperformance and capabilities in the low-earth orbit (LEO). Figure 1shows the architecture of the payload.
Utilizing XilinxVirtex-4 devices in a U.K. CubeSatmission presentssome interestingdesign challenges
REQUIREMENTS AND CHALLENGESDesigning for a CubeSat mission pro-vides the engineers with many chal-lenges, not least of which is the avail-able power—there is not much of it, at400 milliwatts on average in sunlightorbit for the UKube1. Weight restric-tions come in at a shade over 300grams for a 3U, 4.5-kg satellite.Combined with the space envelopeavailable for a payload, these limita-tions present the design team with an
interesting set of challenges to addressif they are to develop a successful pay-load. The engineers must also addresssingle-event upsets (SEUs) and otherradiation effects, which can affect theperformance of a device in orbitregardless of the class of satellite.
The power architecture in UKube1provides regulated 3.3-, 5- and 12-voltsupplies to each of the payloads. It ispermissible to take up to 600 mA fromeach of these rails. However, the sunlit-orbit average must be less than 400 mW.
Unfortunately, the voltages supplied arenot at levels suitable to supply today’shigh-performance space-grade FPGAs,which typically require 1.5 V or less tosupply the core voltage. For example, theVirtex-4 space-grade device we selectedfor this mission, the XQR4VSX55 FPGA,requires a core voltage of 1.2 V alongwith supporting voltages of 2.5 and 3.3 V.The configuration PROMs needed to sup-port the FPGA require 1.8 V.
We selected the XQR4VSX55 becauseit was the largest high-performance
16 Xcell Journal Second Quarter 2011
X C E L L E N C E I N A E R O S P A C E
FPGA PROM FPGA PROM RAM
3V33V3 & 1V5 Data to be checked
Address and Data
Figure 1 – The EADS Astrium payload for the UKube1 mission comprises two experiments, both of which are FPGA based.
FPGA that could be accommodatedwithin the UKube1 payload while stillachieving both the footprint and powerrequirements. The power engineer andthe FPGA engineer must give consider-able thought to the power architectureto ensure all of the power constraintsare achieved. Tools such as thePlanAhead™ and XPower Analyzersoftware are vital for the FPGA engi-neer to provide FPGA power budgets tothe power engineer. We chose high-effi-ciency switching regulators for this mis-sion to ensure we could achieve thecurrents required by the SX55 FPGAcould be achieved.
The space available to implement theFPGA and its supporting functions isvery limited, with the payload being con-strained to a PC104-size printed-circuitboard. However, mezzanine cards arepermitted provided they do not exceedthe height restrictions of 35 mm.Therefore, we developed the UKube1payload to include a mezzanine card thatcontained the FPGAs, SRAM and flashmemory, while the lower board con-tained all of the power management andconversion functionality (see Figure 2).
RADIATION EFFECTSOne of the most important aspects ofthis mission is to gain an understandingof the performance of the Virtex-4device in a LEO environment. WhileXilinx has hardened this FPGA forspaceflight, SEUs will still occur andaffect both the configuration data andthe FPGA registers, RAM and digitalclock managers (DCMs). This missiontherefore configures the FPGA to usemost of the internal logic, RAM andDCMs. We then monitor the perform-ance of this device using another one-time programmable hardened device,which passes performance statisticsback down to the ground for analysis.Interestingly, on this mission we areless concerned with SEU- and radia-tion-mitigation design techniques thanon ensuring that these events can becaptured such that they can be detect-ed by a monitoring FPGA, allowing for
the generation of real in-flight per-formance statistics.
This aspect of the design presentedsome interesting engineering chal-lenges. For example, an SEU couldaffect signals such as the DCM lockedsignal and incorrectly indicate that theDCM has been affected by an SEU,when in reality it has not. To counterthis potential problem, the FPGA designengineers came up with a method ofusing counters to monitor the DCMsand make sure they are locked to thecorrect frequency. Should the counterfreeze or increment at a different fre-quency, it is indicated by comparing itwith the other counters, allowing for anFPGA reconfiguration if required.
FUTURE DEVELOPMENTS The U.K. Space Agency will launchUKube1 in January 2012. The CubeSatwill provide data on both of our exper-iments for at least the mission lifetimeof one year. This mission will demon-strate the suitability of high-perform-ance FPGAs for use in LEO missions
and microsatellite architectures aswell as hopefully allowing the use ofhigh-performance FPGAs in other mis-sions of longer duration.
The UKube1 is expected to be justthe first of a national CubeSat pro-gram in the U.K. The architecturedeveloped for UKube1 lends itself toadaptation for use on future missionsto gain further understanding of, andexperience in, using large, high-per-formance FPGAs in orbit. Potentialadaptations for the next missionwould include a demonstration of in-orbit partial reconfiguration and in-orbit readback and verification of thedevice configuration. Unfortunately,due to the time scales involved in thedevelopment of UKube1, these exper-iments could not be incorporated onthis maiden voyage.
The UKube1 architecture using theXilinx Virtex-4 family of devices alsolends itself to evolution into a system-on-chip-based controller that couldserve as a micro- or nanosatellite mis-sion controller.
Second Quarter 2011 Xcell Journal 17
X C E L L E N C E I N A E R O S P A C E
Figure 2 – The FPGAs, SRAM and flash memory reside on a mezzanine card (top), while the power management and
conversion functionality occupies the lower board.
18 Xcell Journal Second Quarter 2011
I n many modern applications,embedded systems have to meetextremely tight timing specifica-
tions. One of these timing require-ments is the startup time—that is, thetime it takes for the electronic systemto be operative after power-up.Examples of electronic systems with atight startup-timing specification arePCI Express® products or CAN-basedelectronic control units (ECUs) inautomotive applications.
Just 100 milliseconds after poweringup a standard PCIe® system, the rootcomplex of the system starts scanningthe bus in order to discover the topolo-gy and, in so doing, initiate configura-
tion transactions. If a PCIe device is notready to respond to the configurationrequests, the root complex does notdiscover it and treats it as nonexistent.The device won’t be able to participateon the PCIe bus system. 
The automotive scenario is simi-lar. In CAN-based networks, ECUsgo into a sleep mode where they shutdown and disconnect from theirpower supply. Only a small bit of cir-cuitry remains alert to detect wake-up signals. Whenever a wake-upevent occurs, ECUs reconnect topower and start booting. While it isacceptable to miss any messagesduring the first 100 ms after the
wake-up event, afterward all ECUsmust be fully operational on the net-work (for example, CAN).
Advanced development workbetween Xilinx Automotive, XilinxResearch Labs and the KarlsruheInstitute of Technology in Karlsruhe,Germany is addressing this problemwith a two-step FPGA configurationmethodology.
The technology trends in the semi-conductor industry have enabledFPGA manufacturers to significantlyincrease the resources in their devices.But the bitstream size grows propor-tionally, and so does the time it takesto configure the device. Therefore,
Advanced research work demonstrates a new, two-step configuration method thatmeets the tight startup timing specs of automotive and PCIe applications.
even with midsize FPGAs, it is not pos-sible to meet demanding startup-tim-ing specifications using low-cost con-figuration solutions. Figure 1 showsthe configuration time for differentXilinx® Spartan®-6 FPGAs using thelow-cost SPI/Quad-SPI configurationinterface. Even when using a fast con-figuration solution (namely, Quad-SPIrunning at a 40-MHz configurationclock), only the small FPGAs meet the100-ms startup-timing specification.For Xilinx Virtex®-6 devices theresults look even more challenging,since these devices provide an abun-dance of FPGA resources.
To address this challenge, FastStartup configures the FPGA in twosteps, rather than a single (monolithic)full-device configuration. In this novelapproach, the strategy is to load onlythe timing-critical modules at power-upusing the first high-priority bitstream,and then load the non-timing-criticalmodules afterward. This techniqueminimizes the initial configuration dataand, thus, minimizes the FPGA startuptime for the timing-critical design.
FAST STARTUP VS. PARTIALRECONFIGURATIONFast Startup enables an FPGA designto start critical modules in the designas fast as possible, much faster thanin a standard full-configuration tech-nique.  Although Fast Startup fun-damentally makes use of partialreconfiguration, it differs from thetraditional concepts of this tech-nique. Partial reconfiguration intendsa full design to be used as an initialconfiguration that can be modifiedduring runtime. By contrast, FastStartup already uses an initial partialbitstream in order to configure justone specific (small) region of theFPGA at power-up. The first configu-ration contains only those parts ofthe full FPGA design that must be upand running quickly. The remainderis configured later, during runtime,using partial reconfiguration. Figure2 illustrates this sequential concept.
TOOL FLOW OVERVIEWThe tool flow for Fast Startup relieson the design preservation flow tocreate the partial bitstreams for boththe timing-critical and the non-timing-critical subsystems.
The design preservation flowdivides the FPGA design into logicalmodules called partitions. Partitionsform hierarchical boundaries, isolat-ing the modules inside from the othercomponents of your design. A parti-tion, once implemented (that is,
placed and routed), can be importedby other implementation runs in orderto implement the modules of the parti-tion in exactly the same way in eachinstance. 
Therefore, the first step in using theFast Startup technique is to separatethe complete FPGA design into twosections: a high-priority partition con-taining the timing-critical subsystemand a low-priority partition for the restof the components.
IMPLEMENTING THE HIGH-PRIORITY PARTITIONTo obtain a partial bitstream for thehigh-priority partition that is as smallas possible, there are some generaldesign considerations to take intoaccount. First of all, the partitionmust include only components thatare timing-critical or that the systemneeds to perform the partial reconfig-uration of the low-priority section(for example, the ICAP). The key toobtaining a small initial partial bit-
stream is to use an area as small aspossible to implement the high-priori-ty partition. That means you mustconstrain this partition to an appro-priate region in the FPGA.
In order to find a good physicallocation on the FPGA, this area shouldprovide just the right amount ofresources you need for the design.Accessing resources located outsideof this area is possible but discour-aged—though generally will be
X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S
SPI x 1 – cclk 2MHz
SPI x 1 – cclk 40MHz
SPI x 4 – cclk 2MHz
SPI x 4 – cclk 40MHz
LX9 LX16 LX25T LX45T LX75T LX100T LX150T
Figure 1 – Logarithmic illustration of calculated Spartan-6 configuration times (worst-case calculation)
Device Power-Up Timing-CriticalSubsystem Running
Figure 2 – Fast Startup concept: a sequential configuration
inevitable in the case of I/O pins. Whenlooking for an appropriate area, alsokeep in mind that this FPGA regionmight block resources for the non-tim-ing-critical part of the FPGA design.
When you have partitioned theFPGA and found appropriate areas forthe partitions, the next step is toimplement the high-priority partition,using an empty (black box) low-prior-ity partition. The resulting bitstreamcontains a lot of configuration framesfor resources that are not used. Youcan remove those frames in order toget a valid partial bitstream for the ini-tial configuration of the high-prioritypartition. 
IMPLEMENTING THE LOW-PRIORITY PARTITIONTo create the partial bitstream for thelow-priority partition, first you have tocreate an implementation of the fullFPGA design containing both parti-tions, the high and the low priority.Import the high-priority partition fromthe previous implementation in orderto make sure it is implemented in thesame way as before.
For Virtex-6, you will use the partialreconfiguration (PR) flow for all thedescribed implementations. As aresult, the partial bitstream for thelow-priority partition is obtained auto-matically. Since the Spartan-6 device
family does not support the PR flow, weused the BitGen option for difference-based partial reconfiguration to obtainthe partial bitstream for the low-prioritypartition when implementing FastStartup for our Spartan-6 design. Figure 3 gives a high-level overview ofthe tool flow.
EXPERIMENTS AND RESULTSFor verifying the Fast Startup configura-tion technique in hardware, our teamimplemented it on a Virtex-6 ML605 boardas well as on a Spartan-6 SP605 board.
The application scenario for a Virtex-6 implementation derives from thevideo domain. Whenever users power
20 Xcell Journal Second Quarter 2011
X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S
Build Full Bitstream Build Partial Bitstream
Build Partial Bitstream
Figure 3 – Fast Startup tool flow
AXI 2 AXI
Memory High-Priority Partition
CAN / TFTController
Clocks for P2design
Figure 4 – Basic block diagram of Virtex-6 and Spartan-6 demonstrator (the Virtex-6 includes the TFT block; the Spartan-6 includes the CAN block only)
up a video system they want to see animmediate response, and not waitseveral seconds. Therefore, in thesystem illustrated in Figure 4, a high-priority processor subsystemequipped with a TFT controller bringsup a TFT screen as quickly as possi-ble. For additional, low-priority appli-cations, the second design providesaccess to an Ethernet core, UART andhardware timer.
For this demonstrator we used anexternal flash memory with BPI as ourconfiguration interface. As soon as theinitial high-priority bitstream config-ures the processor subsystem, soft-ware running out of BRAM initializesthe TFT controller and writes data intothe frame buffer located in the DDRmemory. This ensures the screen willshow up on the TFT as fast as possibleat startup. Afterward, the second bit-stream is read from the BPI flashmemory and the low-priority partitionis configured to enable the processorsubsystem to run other applications,such as a Web server.
For easy expansion and a clean sep-aration of the two partitions, we usedan AXI-to-AXI bridge. This also mini-mized the nets crossing the border ofthe two design partitions. The low-pri-ority partition shares its system clockwith the high-priority partition.
Table 1 shows the FPGA resourceutilization while Table 2 presents con-figuration times for the traditionalstartup, the startup with a com-pressed bitstream of the high-prioritypartition only  and the Fast Startupconfiguration technique. In all cases,the configuration interface wasBPIx16, while the applied configura-tion rate—an option to determine thetarget configuration clock frequen-cy—was 2 and 10 MHz. We measuredthe data by using an oscilloscope tocapture the “init” and the “done” sig-nals of the FPGA.
The “Compressed” column in Table2 represents a compressed bitstreamof the high-priority partition only. Acompressed bitstream of the fullFPGA design containing both parti-tions would be 3.1 Mbytes.
SPARTAN-6 AUTOMOTIVE ECU DESIGN To verify the Fast Startup techniquefor Spartan-6, we chose the ECUapplication scenario from the automo-tive domain. Whenever you find anFPGA in an automotive electroniccontrol unit, usually it is used by the
Second Quarter 2011 Xcell Journal 21
X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S
Resource PartitionType High Priority % Low Priority %
Flip-Flops 8,849 2.9 1,968 0.7
Lookup Tables 7,039 4.7 2,197 1.5
I/Os 135 22.5 20 3.3
RAMB36s 34 8.2 2 0.5
XC6VLX240T Configuration Technique
Configuration Traditional Compressed Fast StartupInterface 8.9 MB 2.0 MB 1.4 MB
BPIx16 CR2 1,740 ms 389 ms 278 ms
BPIx16 CR10 450 ms 112 ms 84.4 ms
Table 1 – Occupied FPGA resources for XC6VLX240T
Table 2 – Measured configuration times for Virtex-6 video design
WAKEUP EVENTS(COMMS AND/OR DISCRETE)
Stage 1: Ultra LowCurrent Circuitry
V+ V+ V+
Main Power Supply Stage
Figure 5 – FPGA use in today’s automotive ECUs and with the processor incorporated into the FPGA (dotted lines)
main application processing unit of theECU only (see Figure 5). Our goal wasto implement a design that also put thesystem processor inside the FPGA. Bydoing so, we could eliminate the needfor an external processor, therebyreducing overall system cost, complex-ity, space and power consumption.
SYSTEM PARTITIONINGFor this scenario, the partitioning ofthe system is obvious. We separatedour ECU design into a system proces-sor part as our high-priority partitionand an application-processing part asthe low-priority partition.
This design had a lot of similaritiesto the Virtex-6 design, but instead ofBPI, we used SPI as an interface tothe external flash memory andinstead of a TFT controller, a CANcontroller was necessary. Uponpower-up, the system controller hasonly a limited amount of time to bootand be ready to process the first com-munication data. For ECUs using theCAN bus for communications thisboot time limit is typically 100 ms.With traditional configuration tech-niques it is hard to beat this tight tim-ing specification using a big Spartan-6with a low-cost configuration inter-face like SPI or Quad-SPI. But using afaster and therefore more expensiveconfiguration interface is unaccept-able in the automotive domain.
MEASUREMENT SETUPFor the SP605 automotive ECUdemonstrator we performed measure-ments in our labs; Figure 6 presents
the measurement setup. On the leftside is an X1500 automotive platformbased on a Spartan-3 implementing atraffic generator for the CAN bus,which is able to send and receive CANmessages and measure time betweenthem using hardware timers. On theright side is the target platform, whichis not connected directly to the CANbus but uses the CAN transceiver froman additional custom board. Besidesproviding a CAN PHY, this customboard also controls the power supplyof the target board.
The procedure to measure the con-figuration time starts with the traffic
generator in idle status, the CAN trans-ceiver on the CAN PHY board in sleepmode and therefore the SP605 discon-nected from power. In the next step,the traffic generator starts a hardwaretimer and sends a CAN message.Recognizing the activity on the CANbus, the CAN PHY awakens and recon-nects the SP605 to the power supply.The FPGA then starts to load the initialbitstream from SPI flash.
Because there is no receiveracknowledging the message sent bythe traffic generator, the message isresent immediately until the FPGAhas finished its configuration andalso configured the CAN core withthe valid baud rate. Whenever theCAN core of the Spartan-6 designacknowledges the message, the CANcore of the traffic generator triggersan interrupt, which stops the hard-ware timer. This timer is now holdingthe boot time for the SP605 design.Measurements, which included anadditional hardware timer inside theSP605 design, have shown that whenexecuting the software to configure
X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S
Figure 6 – Measurement setup for the automotive ECU
Resource PartitionType High Priority % Low Priority %
Flip-Flops 3,480 6% 1,941 4%
Lookup Tables 3,507 13% 1,843 7%
I/Os 58 20% 20 7%
RAMBs 12 10% 2 2%
Table 3 – Occupied FPGA resources in Spartan-6 design
Configuration TechniqueConfiguration Traditional Compressed Fast Startup
Interface 1,450 KB 920 KB 314 KB
SPIx1 CR2 5,297 ms 3,382 ms 1,157 ms
SPIx1 CR26 292 ms 196 ms 85 ms
SPIx2 CR2 2,671 ms 1,699 ms 596 ms
SPIx2 CR26 161 ms 113 ms 58 ms
SPIx4 CR2 1,348 ms 872 ms 311 ms
SPIx4 CR26 97 ms 73 ms 45 ms
Table 4 – Measured Spartan-6 configuration times
22 Xcell Journal Second Quarter 2011
the CAN core from internal BRAMmemory, the software startup timewas negligible.
Table 3 presents the FPGA resourceconsumption for each partition. Thepercentage information refers to thetotal amount of available resources ofthe used XC6S45LXT device.
Table 4 shows the results of theconfiguration time measurements.For these measurements, we imple-mented and compared a standard bit-stream of the full design, a com-pressed bitstream of the full designand the Fast Startup technique usinga partial initial bitstream. The tablelists the configuration times for dif-ferent SPI bus widths and differentconfiguration rate (CR) settings. Asexpected, the configuration times areproportional to the bitstream sizes.Because using a fast configurationclock does not affect the houseclean-ing process, the ratio in percentagechanges for high-CR settings.
VERIFIED IN HARDWAREThe advanced configuration techniquethat we developed might be called pri-oritized FPGA startup, since it config-ures the device in two steps. Such atechnique is essential to address thechallenge of increasing configurationtime in modern FPGAs and to enabletheir use in many modern applica-tions, like PCI Express or CAN-basedautomotive systems.
Besides proposing the high-priorityinitial configuration technique, we veri-fied it in hardware. We have used andtested the tool flow and methods forFast Startup to implement a CAN-basedautomotive ECU on a Spartan-6 evalua-tion board (the SP605) as well as avideo design on a Virtex-6 prototypingboard. By using this novel approach,we decreased the initial bitstream size,and hence, achieved a configurationtime improvement of up to 84 percentwhen compared with a standard full-configuration solution.
Xilinx will support the FastStartup concept for PCI Express
applications in the software for thenew 7 series FPGAs, with improvedimplementation techniques to simpli-fy its use. In the 7 series, the new two-stage bitstream method is the simplestand least expensive technique toimplement. The user directs theimplementation tools to create a two-stage bitstream via a simple softwareswitch when building the FPGAdesign. The first stage of the bitstreamcontains just the configuration framesnecessary to configure the timing-crit-ical modules. When configured, anFPGA STARTUP sequence occurs andthe critical blocks becomes active,thus easily satisfying the 100-ms tim-ing requirement. While the timing-crit-ical modules are working (e.g., PCIExpress enumeration/configurationsystem process is occurring), theremainder of the FPGA configurationgets loaded. The two-stage bitstreammethod can use an inexpensive flashdevice to hold the bitstreams.
 PCI Express Base Specification, Rev.
1.1, PCI-SIG, March 2005
 M. Huebner, J. Meyer, O. Sander, L.
Braun, J. Becker, J. Noguera and R.
Stewart, “Fast sequential FPGA startup
based on partial and dynamic reconfigu-
ration,” IEEE Computer Society Annual
Symposium on VLSI (ISVLSI), July 2010
 Hierarchical Design Methodology
Guide, UG748, v12.1, Xilinx, May 2010
 B. Sellers, J. Heiner, M. Wirthlin and
J. Kalb, “Bitstream compression through
frame removal and partial reconfigura-
tion,” International Conference on Field
Programmable Logic and Applications
(FPL), September 2009
 J. Meyer, J. Noguera, M. Huebner, L.
Braun, O. Sander, R. Mateos Gil, R.
Stewart, J. Becker, “Fast Startup for
Spartan-6 FPGAs using dynamic partial
reconfiguration,” Design, Automation
and Test in Europe (DATE ‘11), 2011
 “Fast Configuration of PCI Express
Technology through Partial
Reconfiguration,” XAPP883, v1.0, Xilinx,
November 2010, http://www.xilinx.com/
Second Quarter 2011 Xcell Journal 23
X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S
Over the last two decades, many supercom-puter architectures have included bothmicroprocessors, which serve as the main
brains of the system, and FPGAs, which typicallyoffload some of the compute tasks from thoseCPUs. In fact, this pairing of FPGAs with standalonemicroprocessors has proven to be a winning combi-nation for many generations of supercomputers.
But recently, our group of scientists at the KalyaevScientific Research Institute of MultiprocessorComputer Systems at Southern Federal University(Taganrog, Russia) designed a reconfigurable com-puter system (RCS) based largely on Xilinx® FPGAs.By interconnecting multiple FPGAs, we can createspecial-purpose computational structures that arewell suited to solving problems in a broad numberof application areas.
An FPGA-based RCS has several interestingdesign considerations and advantages. At its heartare several FPGAs that we’ve united to form a singlecomputational system. For our reconfigurable sys-tems we use an orthogonal communication system,placing FPGAs in the rectangular lattice points.Direct links interconnect adjacent FPGAs.
In addition, we create a parallel-pipeline compu-tational structure in the FPGA computation field byusing computer-aided design tools and tools forstructural programming. The architecture of thisparallel-pipeline computational structure is similarto the task structure, and this similarity ensures thehighest performance of the RCS.
If hardware limitations do not allow mapping ofthe entire informational graph of a task we are try-ing to perform or a problem we are attempting tosolve, then the system will automatically split thegraph into a number of subgraphs. The system’s
X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G
By interconnecting multiple Xilinx FPGAs, designers
can build a new kind of supercomputer and tailor
it for jobs in a wide range of application areas.
computation field sees each of thesesubgraphs as individual computa-tional structures, organizes the sub-graphs sequentially and solves themsequentially. Figure 1 illustrates thecomputational structures flow of theRCS and the procedures of datainput and output. As you can see,the size of each subgraph is propor-tional to the size of an FPGA field.Tools we have developed for struc-tural programming automaticallysplit a problem into a smaller num-ber of larger subgraphs and mapthem into hardware for efficientsolving, thus dramatically speedingRCS performance.
In more detail, the system inter-prets the informational graph of agiven task or problem as a sequenceof isomorphic basic subgraphs, eachof which can be independent ordependent (that is, information-related) on one another. The systemtransforms each informational sub-graph into a special computationalstructure called “cadr.” Essentially, acadr is a hardwired task subgraphwith an operands flow runningthrough it. Each group of operands(or results) corresponds to input (oroutput) nodes of the specified sub-graph in the sequence. The specialparallel program (or “procedure,” aswe call it) executes the sequence ofcadrs. Only one parallel program isneeded for the whole RCS.
We designed our RCS for high-per-formance computing in a modulararchitecture. The RCS consists ofbasic modules that host fragments ofthe computation field and have con-nections with other modules andauxiliary subsystems. Figure 2shows several basic modulesdesigned at the Kalyaev ScientificResearch Institute of MultiprocessorComputer Systems at SouthernFederal University (SRI MCS SFU).
The FPGAs in the basic modulescommunicate via LVDS channels run-ning at a frequency of 1.2 GHz. Allmodules in an RCS communicate by
26 Xcell Journal Second Quarter 2011
X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G
Information Graph of Task
Basic Module 16V5-75140 Gflops, Virtex-5
XC5VLX110-FFG1153,11*106 gates, 2008
Basic Module 16V5-5050 Gflops, Virtex-4
XC4VLX40-FF1148,11*106 gates, 2006
Basic Module 16V5-125OR250 Gflops, Virtex-5
XC5VLX110-FFG1153,11*106 gates, 2009
Figure 2 – Several basic RCS modules, all built around Virtex FPGAs
Figure 1 – Computational structures flow of the reconfigurable computer system (RCS)
means of these channels. Each basicmodule also includes discrete distrib-uted memory units along with a basicmodule controller to perform resourcemanagement, including loading frag-ments of parallel applications into thedistributed memory controllers andhandling data input and output.
Our 16V5-75 is the main module inour RCS family. The RCS series con-sists of systems with various per-formance levels ranging from 50gigaflops (using 16 Xilinx Virtex®-5FPGAs) to 6 teraflops (1,280 Virtex-5FGPAs). Our highest-end modelholds five ST-1R racks, each rackcontaining four 6U blocks of RCS-0.2-CB and each block consisting of fourbasic 16V5-75 modules. The peak per-formance of the basic module is up to200 gigaflops for single-precisiondata. So, the computation field of theST-1R rack contains 256 Virtex-5FPGAs (XC5VLX110) connected byhigh-speed LVDS (Figure 3).
Our Orion RCS (Figure 4), mean-while, boasts performance of 20 ter-aflops. It consists of 1,536 Virtex-5FPGAs placed in a 19-inch rack. Themain element of the system is a basicmodule 16V5-125OR with performanceof 250 gigaflops. Basic modules areconnected into a single computationalresource via high-speed LVDS. Fourbasic 16V5-125OR modules are con-nected within a 1U block. In contrastto the basic module 16V5-75, whichhas only four of 16 FPGAs that can beconnected with other basic modules,all 16 FPGAs of the basic module16V5-125OR have LVDS for computa-tion field expandability. This provideshigh data-exchange rates between thebasic modules within the Orion block.
Table 1 shows the real perform-ance of the block RCS-0.2-CB and
Orion at solving various tasks: digitalsignal processing, linear algebra andmathematical physics.
PROGRAMMING THE RCSRCS programming significantly differsfrom programming traditional cluster-
Second Quarter 2011 Xcell Journal 27
X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G
Tasks Specific performance(Gflops/dm3)
Real performance at task solving(Gflops)
RCS-0.2-CB 16.7 647.7 423.2 535.2
Orion 64.9 809.6 528.7 669.0
Computational Field of Rack ST-1R256 FPGAs: Virtex-5 XC5VLX110
16 Basic Modules16V5-75
Figure 3 – Computation field of the ST-1R rack containing 256 Virtex-5 FPGAs connected by high-speed LVDS
Table 1 – Performance of the block RCS-0.2-CB and the computing block Orion
We designed our RCS for high-performance computing in a modular architecture. It consists of basic modules that host fragments of the computation field
and have connections with other modules and auxiliary subsystems.
based supercomputer architectures.Traditional supercomputer architec-tures have fixed hardware circuitryand can only be programmed at a soft-ware level, traditionally by softwareengineers. An FPGA-based RCS hasprogrammable hardware as well soft-
ware. Therefore, we divide the RCSinto two kinds of programming: struc-tural and procedural. Structural pro-gramming is the task of mapping theinformational graph into the RCS struc-ture’s computational field (program-ming the FPGAs). Procedural program-ming is traditional programming inwhich users describe the organizationof computational processes in the sys-tem. Typically, software engineers havestruggled when it comes to program-ming FPGAs, and so to competently do
the job they must have some hardwaredesign skills.
Knowing this, we sought to make theRCS easier to program by developing aspecial software suite. The suite allowsusers to program the system withoutrequiring any specialized knowledge of
FPGA and hardware design. Using thissuite, the software developer will getthe solution two to three times fasterthan an engineer who designs a circuitsolution of the task and uses tradition-al methods and tools.
Based on the development environ-ment Fire!Constructor for circuitdesign, the suite includes a high-levelprogramming language called COLAMOand the Argus integrated developmentenvironment. Users create programs inCOLAMO and a built-in compiler/trans-
lator/synthesizer automatically cre-ates code for both the structural andprocedural components, with the helpof an IP library of computational unitsand interfaces.
Libraries of basic module descrip-tions, unit descriptions and a descrip-
tion of the entire RCS (RCS passport)make it possible to port parallel applica-tions to the RCS from various architec-ture platforms. The COLAMO translatorbreaks the system description downinto four data categories: control, proce-dural, stream and structural.
The control data is translated intoPascal and is executed in master con-trollers, which, depending on the RCShardware platform, are contained inbasic modules or computing blocks.The control data configures the FPGAs
28 Xcell Journal Second Quarter 2011
X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G
Figure 4 – The 20-Tflops Orion computing block
We sought to make the RCS easier to program by developing a special software suite. The suite allows users to program the system without
requiring any specialized knowledge of FPGA and hardware design.
for the computational tasks at hand,programs distributed memory andmaps the data exchange betweenblocks and basic modules.
Meanwhile, the procedural andstream components are translated intothe Argus assembler language. Theyare executed by distributed memorycontrollers, which specify the cadrs’execution and create parallel dataflows in the cadrs’ computationalstructures.
The most interesting feature of thesuite is the synthesis tool calledFire!Constructor that we developed tohelp users program the FPGAs that lieat the heart of the RCS. Using the IPlibrary, Fire!Constructor generates aVHDL description (*.vhd file) for eachFPGA of the RCS. The Xilinx ISE®
suite uses these files to create configu-ration files (*.bit) for all the FPGAs.
EFFECTIVE PROBLEM SOLVINGOwing to the principle of structuralprocedural calculations, we caneffectively solve various problems onthe RCS. The most suitable are theproblems that have an invariableinformation graph of the algorithmand variable sets of input data. Forinstance, problems of drug design,digital signal processing, mathemati-cal simulation and the like lend them-selves to RCS solution.
A spin-off company called ScientificResearch Centre of Supercomputersand Neurocomputers (Taganrog,Russia) delivers these reconfigurablesystems, manufacturing each RCSaccording to preorder specifications.The company can manufacture about100 product units per month.Performance of our RCS, based onVirtex-5 and Virtex-6 FPGAs, is from100 Gflops to 10 Tflops.
Overall, FPGA-based reconfig-urable computer systems offer uniqueadvantages for solving tasks of variousclasses. With an RCS, users can tailorcomputation to quickly tackle them.To learn more about our RCS, visithttp://parallel.ru/index_eng.html.
Second Quarter 2011 Xcell Journal 29
X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G
Get on Target
Hit your target by advertising your product or service in the Xilinx Xcell Journal, you’ll reach thousands of qualified engineers, designers, and engineering managers worldwide.
The Xilinx Xcell Journal is an award-winning publication, dedicated specifically to helping programmable logic users – and it works.
We offer affordable advertising rates and a variety of advertisement sizes to meet any budget!
As the largest and most sensitive telescope everenvisioned, the Square Kilometre Array (SKA), aradio telescope under development by a consor-
tium of 20 countries, will revolutionize our knowledge ofthe universe through large-area imaging from 70 MHz to a10 GHz when it is deployed in the middle of the nextdecade. On the path to the development of the SKA, theAustralian SKA Pathfinder (ASKAP) is at a key SKA sci-ence frequency band of 700 MHz to 1.8 GHz. ASKAP seeksto, achieve instantaneous wide-area imaging through thedevelopment and deployment of phased-array feed sys-tems on parabolic reflectors. This large field of viewmakes ASKAP an unprecedented survey telescope poisedto achieve substantial advances in SKA key science.
The Commonwealth Scientific and Industrial ResearchOrganisation, or CSIRO, is an Australian government-fund-ed research agency. A division called CSIRO Astronomyand Space Science (CASS) is undertaking construction ofthe ASKAP (see www.csiro.au/org/CASS.html). The com-bination of location, technological innovation and scientif-ic program will ensure that ASKAP is a world-leading radioastronomy facility, closely aligned with the scientific andtechnical direction of the SKA. CSIRO is also involved inthe development of the SKA, which is anticipated to beoperational in 2024.
Currently under construction in outback WesternAustralia at the Murchison Radio Astronomy Observatory(MRO), ASKAP will consist of an array of thirty-six 12-meterdishes (Figure 1), each employing advanced phased-arrayfeeds (PAFs) to provide a wide field of view: 30 squaredegrees. As opposed to conventional antenna optics, PAFsallow for multiple configurable beam patterns using digital
Researchers in Australia are using Virtex-6 FPGAs to economically meet the demanding requirements of an advanced telescope called ASKAP.
XCELLENCE IN SCIENTIFIC APPLICATIONS
by Grant Hampson, PhDASKAP Digital Systems Lead EngineerCSIRO Astronomy and Space [email protected]
John Tuthill, PhDHardware and Firmware EngineerCSIRO Astronomy and Space [email protected]
Andrew BrownHardware and Firmware EngineerCSIRO Astronomy and Space [email protected]
electronics to implement beam-formers. ASKAP’s location in theSouthern Hemisphere is advanta-geous for astronomy. Combinedwith a very large radio quiet zone(an area where radio frequencyemissions are managed), theresult will be a radio telescopewith unprecedented performance.
ASKAP’s wide field of view andbroad bandwidth will enableastronomers to survey the skyvery quickly, but present a signal-processing challenge two ordersof magnitude greater than that fac-ing existing telescopes. To satisfyperformance demands of approxi-mately 2 peta operations/secondwith 100-Tbit/s communications,Xilinx® FPGAs deliver a well-bal-anced mix of these features.FPGAs provide an economicalsolution for both power consump-tion and capital cost comparedwith other alternative solutions,such as high-performance com-puting (HPC).
STATEMENT OF THE PROBLEMThere are two major problems tosolve with ASKAP: huge com-pute capacity and very high-band-width data communications. It iseasy to find a solution to one, but notto both challenges simultaneously.The Xilinx Virtex®-6 family of FPGAsprovides an excellent balance of I/Obandwidth to computational capacityand is well suited to the repetitiveparallel operations in filter banks andbeamformers.
The ASKAP design is also con-strained by a number of other factorsthat influence the implementation.Power costs at the remote locationand the climate provide extra chal-lenges for cooling instrumentation.Radio astronomy instrumentation alsorequires an operating environmentvery low in electromagnetic interfer-ence (EMI)—telescopes are sensitive
to extremely low levels of radiation.The technological advances in Virtex-6FPGAs that reduce operating voltagesand I/O swings help reduce EMI further.A second advantage in using Virtex-6FPGAs is that the resulting lower-powersystems require less cooling, a doublesaving for an astronomy instrumentthat operates 24/7.
SYSTEM DESCRIPTIONWe have taken the divide-and-conquerapproach to determine a cost-effec-tive on-time solution. Figure 2 illus-trates that there are two digital signal-processing locations—antenna-basedelectronics and a central processingsite. The antenna-based electronicssample and process the outputs of 188PAF receptors plus four calibration
signals. It transports the dataover optical fiber to the cen-tral processing site.
In the central site, beamform-ers for each antenna form up to36 dual-polarization beams onthe sky. The outputs of all theantenna beamformers then con-nect to a correlator, whichforms complex-valued visibili-ties that are used to createimages of the sky.
Within each antenna are anumber of receiving cards thatprocess a small number of PAFreceptors or, in the case of thebeamformer, a small frequencyslice of the overall input band-width. This technique dividesthe data communications prob-lem while also reducing the costof each processing card.
DIGITAL RECEIVER DESIGNThe digital receiver subsystemcontains hardware to samplethe intermediate-frequency(IF) signals from the PAF andfirmware to divide the IFbandwidth into 1-MHz bandsand transport the resulting
data to the beamformer at the centralsite. The subsystem is divided intofour 3U x 160-mm Eurocard sub-racks, each consisting of 12 digitalreceiver cards, a power card and acontrol interface card. The digitalreceiver system is shown in Figure 3.
The power card is responsible forconverting a 48-VDC input to the volt-ages required by the digital receivercards. The control card contains a 1-Gbit Ethernet port to allow local con-trol and monitoring over UDP packets.We implemented the relatively simplerequest/response control firmware ona low-cost Virtex-5 XC5VLX30T-1FF665C FPGA. Control and moni-toring links fan out from the controlcard to the digital receiver cards viathe backplane.
X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S
Figure 1 – The first ASKAP reflector surface is beinginstalled in Western Australia; researchers will later
add a phased-array feed at the focus of the antenna.
Each digital receiver card sam-ples four analog IF receptor signalsfrom the PAF. The card consists ofthree major sections: analog-to-digi-tal converters, a Virtex-6 LX130TFPGA and 10-Gbps optical datatransmission components. Sampling
is performed using two dual-input 8-bit ADCs from the semiconductorcompany e2v, sampling at a rate of768 Msamples/s and providing a totalbandwidth of 384 MHz. The XilinxFPGA receives data from the ADCsas four source-synchronous 384-MHzDDR data streams. IDELAYs onthese data lines are calibrated usingan ADC test pattern.
The ADC data is piped into a cus-tom Radix-3 384-channel polyphasefilter bank, resulting in 1-MHz-widefrequency channels. The polyphase fil-ter bank oversamples the channels ata rate of 32/27 and thus runs at a clockrate of 303.4 MHz. The oversamplingoperation of the filter bank ensuresthat we lose minimal information atthe channel edges prior to the down-
stream fine-filter-bank operation thatfollows the beamformer.
We chose the Xilinx Virtex-6XC6VLX130T-1FF1156C for the digitalreceiver because it had sufficient multi-pliers and memory to fit this particularsize of polyphase filter bank. There is
some spare capacity for a number ofmonitoring functions such as his-tograms and spectral integrators. Formany radio astronomy applicationsthe ratio of RAM to DSP is typically1:1, but the relationship must be pro-portional to I/O bandwidth. The low-est-speed-grade device achieves 384-MHz performance at the lowest cost,and the footprint is compatible withfive other larger devices.
ANTENNA TO CENTRAL-SITEDATA COMMUNICATIONSOne of the significant challenges forthe ASKAP project team is to transportand distribute the vast amounts of real-time data flowing from the PAFs on thearray of antennas through to the digitalsignal-processing chain at the central
site. An overview of the data transportand distribution architecture for one ofthe 36 PAFs is shown in Figure 4.
Data leaves the digital receiver’sVirtex-6 FPGA by means of 16 GTXtransceivers, each operating at a linerate of 3.1875 Gbps. A 159.375-MHz
reference clock for the serializers isgenerated locally on each digitizercard, so there is no common clock ref-erence that could correlate and causeartifacts in the data. Each link carries1/16 (19 MHz) of the total processingbandwidth for the four PAF ports thedigital receiver card processes. Sinceeach beamformer card also processes1/16 of the bandwidth, each outputlink from the digitizer FPGA is des-tined for a different beamformer card.The serial output data from the digitalreceiver FPGA is packetized andaggregated into a custom XAUI (10GAttachment Unit Interface) packetstructure. Since the 16 links are unidi-rectional, point-to-point intercon-nects, there is no need for a full pack-et-switched protocol.
32 Xcell Journal Second Quarter 2011
X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S
Central Processing Site
Optical Fiber BeamformingEngines
Figure 2 – Up to 8 km of fiber separates the antenna-based processing from the commoncentral processing site of the beamformers and correlators.
The 16 serial output links from theFPGA are grouped into four sets offour lanes, each with XAUI frameencoding. This passes to a NetlogicXAUI transceiver device onboard thedigital receiver card that reserializesthe four input streams to a 10.51875-
Each antenna has 192 (48 digitiz-ers/antenna x 4 fibers/digitizer) single-mode fibers connecting to the central
site, for a total data rate of 2 Tbps. Thefiber spans vary from hundreds ofmeters for the closest antennas up to 8km for the farthest.
BEAMFORMER DATA ROUTINGTwo key components of the data-rout-ing and distribution architecture arethe Vitesse 6.5-Gbps 72 x 72 cross-point switch and the industry-stan-dard Advanced TelecommunicationsComputing Architecture (ATCA)backplane. Together, these compo-nents make it possible to route indi-vidual 3.1875-Gbps serial data streamsfrom the digital receivers directly tothe appropriate beamformer card forfurther processing.
We use discrete ROSA (receiveroptical subassembly) devices to con-vert the 10-Gbps optical data signalsfrom the antennas back to electricalsignals, then deserialize them to four3.1875-Gbps signals via the same XAUItransceiver device used on the digitalreceiver. This circuitry resides on theATCA rear transition module, witheach RTM designed to accept 12 x 10-Gbps data streams.
The ATCA backplane supports up to16 cards, each with four duplex 3-Gbpsdata links to every other card, meaning
X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S
XAUI SFP+ ROSA XAUI
ATCA RTM Board ATCA Front Board
Figure 3 – The digital receiver card samples, filters and transmits PAF receptor data to the central processing site.
Figure 4 – Each antenna passes 2 Tbps of data to the beamformer, which distributes it using an ATCA backplane.
Second Quarter 2011 Xcell Journal 33
that on a single card there are (16-1) x4 = 60 duplex links. One of the chal-lenges of using the ATCA backplane isthat the full-mesh connections are dif-ferent on each slot, so that the signalrouting must be dynamically assignedat startup depending on which slot aparticular card is located. It is herethat we use the 72 x 72 crosspointswitch to set the routing matrix ofconnections to each card.
Once a 3.1875-Gbps stream reachesits destination beamformer card, it isrouted to a GTX receiver on one of thefour LX240T processing FPGAs. Herethe 3.1875-Gbps stream is deserializedand the data packet decoded. Sinceeach beamformer FPGA only process-es approximately 5 MHz of the total304-MHz bandwidth, three-quarters ofthe 19-MHz bandwidth arriving on the3.1875-Gbps streams is forwarded tothe other three beamformer FPGAs onthe same card. This final hop to thebeamformer subsystem also makesuse of packetized links, but in this casethe physical transport is via LVDS I/Ointerconnects between each of theDSP FPGAs.
The reference clock for the GTXdeserializers is generated locally oneach DSP card. Hence, the data trans-port design makes use of the clock-cor-rection and elastic-buffer features with-in the Virtex-6 GTX transceiver core.
BEAMFORMER PROCESSING The data communications systemresults in all PAF receptors for a com-mon frequency channel arriving at aparticular beamformer FPGA. Timingsignals injected in the digital receiverare used to synchronize all 192 datastreams within the beamformer.Although the main function of thebeamforming system is to form multi-ple beams, there are also a number of
PAF support functions that go with cal-ibration and monitoring of the system.
The beamforming operation takesthe serial streams of PAF data andbuffers them in a dual-port RAM.While data is flowing in from thecommunication links it is possible toform beams on the previously storedtime slice. A set of beamformerweights contains a list of receptors tobe included in the beamforming com-putation and an associated complexweight. DSP48 elements then multi-ply and accumulate the result foreach time slice.
Data from each of the beamform-ers (there can be up to 36 beams inparallel) is then stored in externalDDR3 memory. Each FPGA is direct-ly connected to one stick of DDR3with an interface operating at 64 bitsx 800 Mbps.
Before the correlator processes thebeam data, we apply a secondpolyphase filter bank to set the finalfrequency resolution of the beams.The relatively slow data input into thissecond filter bank and the large num-ber of beams necessitate a larger stor-age area. Xilinx FPGA internal blockRAM operates very differently thanDDR3 memory. BRAM has a very highbandwidth but limited depth, vs. a rel-atively low bandwidth and essentiallyunlimited depth of DDR3. Beam data
X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S
Figure 5 – A 16-slot ATCA shelf holds the beamforming hardware,
the heart of which is a Virtex-6 LX240T.
Figure 6 – Anchoring the beamforming hardware are the ATCA mesh, crosspoint switch, Virtex-6 FPGAs, DDR3 memory and SFP+ optics.
34 Xcell Journal Second Quarter 2011
is written into DDR3 memory and readout in an order appropriate for the fil-ter bank and correlator.
One of the support functions thebeamformer FPGA also performs isthe calculation of an array co-vari-ance matrix (ACM). The ACM is used
to calibrate the array, and the calcu-lation process takes each PAF recep-tor and cross-correlates it withevery other receptor. The result isintegrated for several seconds so asto allow the result to converge andalso to reduce the output data rate.Both the beamformer and the ACMprocess data at 318.75 MHz (twicethe basic communications clock of159.375 MHz). Once the ACM is cal-culated, it is output from each beam-former card over Gigabit Ethernetusing UDP packets.
The beamformer card (see Figure6) is a standard ATCA card (8U x 280mm) consisting of four signal-process-ing FPGAs (Virtex-6 XC6VLX240T-1FF1156C), one control FPGA (Virtex-5 LX50T), a crosspoint switch (VitesseVSC3172), four sticks of DDR3 PC3-8500 memory and eight SFP+ ports(four full duplex, four transmit only).
DEVELOPMENT SOFTWAREThe design flow that we used for alldigital processing blocks (test signalgenerators, polyphase filters, beam-formers) centers on Xilinx SystemGenerator for MATLAB®/Simulink®.We wrote data communications, con-trol and monitoring, and interfaces inVHDL, using System Generator tosimulate the signal-processing blocksand ModelSim to simulate the HDL-based blocks.
After simulation, we independentlyverified each block in hardware andthen integrated them with other com-pleted blocks to confirm that the com-plete data and processing path is operat-ing as expected. The ChipScope™ inte-grated logic analyzer tools are in heavy
use during this stage to track down bugsand also for easy visualization of controland data processes on-chip.
We stitch together the entire designby incorporating the DSP netlists fromSystem Generator (in NGC format) viaVHDL wrappers, along with all the otherHDL-based blocks. We perform synthe-sis and implementation using XST andthe ISE® Project Navigator.
The digital receiver and beamformerdesigns are very BRAM and DSP48Eintensive (see Table 1), which tends tomake it difficult to meet timing withsimple constraints. Typically we needto apply some floorplanning. The pri-mary goal of floorplanning is toimprove timing by reducing routedelay. The PlanAhead™ tool providesthe insight into our design that we needto be able to apply pipelining on criticalnets and guide the map and place-and-route tools to meet our timing require-ments. Usually this means optimallyplacing key sets of BRAMs in additionto block-based design area constraints.This kind of design analysis also pro-vides the ability to greatly reduceimplementation runtime.
Table 1 provides a summary of theclock speeds for each FPGA devicetogether with the utilization statisticsfor the FPGA resources. The totalnumber of FPGAs in the digitalreceiver and beamforming systems isalso listed for reference.
LOOKING AHEADXilinx Virtex-6 FPGAs have provento be a success for the ASKAP digitalback end, both in processing per-formance and I/O bandwidth fordata communications. The release ofthe Virtex-7 device family has our
research team excited about evenfaster data rates, higher processingdensity and lower power consump-tion. Larger Virtex-7/Kintex-7 deviceswith enhanced I/O will significantlychange system architectures and leadto boards that are less expensive todesign and manufacture.
With the rapid development andadoption of software-defined radio inthe commercial world, SKA prototypeimplementations leveraging this newtechnology also become a strong possi-bility as faster, lower-cost, lower-powerADCs are released in combination withXilinx’s newer FPGAs. This will lead toa complete digital solution for directsampling and processing of PAF data.All of which augurs well for the mas-sive digital systems required for theultimate radio telescope–the SKA (seewww.skatelescope.org).
The authors are part of the Digital
Systems group, an arm of a larger
team within CASS working on the
ASKAP instrument. We would like to
acknowledge the work of other mem-
bers of the Digital Systems group,
including Alan Ng, Ron Koenig,
Matt Shields and Evan Davis. We’d
also like to acknowledge the Wajarri
Yamatji people as the traditional
owners of the observatory site.
Second Quarter 2011 Xcell Journal 35
X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S
FPGA-based designs require a careful examinationof how much area a design will occupy and thetiming performance of the design when imple-
mented. The primary goal is to ensure the design fits inthe target device while meeting its timing or throughputrequirements. In commercial FPGA-based designs,designers generally cap the utilization at around 80 per-cent of the device’s lookup table (LUT) resources so asto ensure that resources exist on the device for futureupgrades and feature enhancements, and to facilitatetiming closure. The roughly 20 percent or so of freeLUTs—and, therefore, free routing resources—can helpin meeting tight timing constraints.
FPGA designers know that the more logic they buildinto the FPGA fabric, the more routing resources theywill use. Hence, it is quite possible that while the synthe-sis tool may succeed in mapping additional logic to LUTsand other resources, it may fail to route connectionsbetween them. Because the interconnect utilization isalready high due to the existing logic, there are no end-to-end paths to route the signal through for the addition-al connections. The emphasis is on end-to-end pathsbecause even though free routing segments may exist,the router can’t find a way to link them to make end-to-end connections. In this sense, although LUT resourcesare available, the design is “interconnect limited” andhence probably cannot be expanded.
The reduction in area has an economic impact as well.If the design could be made smaller in some way whilestill meeting the application requirements, a smallerFPGA device would suffice and this would translate intoa lower design and product cost. Of course, if the designhas special needs such as Ethernet modules, embeddedprocessor or transceivers, you need to select an FPGAthat supports these modules through hard or soft IP.Nevertheless, reduction in area utilization for theremaining parts of the design can still help in picking aparticular device in a specific device family that sup-ports such special modules.
Here is a design methodology to reduce
resource utilization in FPGAs, along
with an example implementation of the
RESOURCE SHARING Resource sharing has traditionallybeen one way of reducing area orresource utilization while maintain-ing functionality. It involves sharingof arithmetic operators such asadders, multipliers and the like bymapping more than one operation toone instance of an operator. Forexample, three adders may performsix rather than three addition opera-tions when shared, halving the num-ber of adders used and thus reducingresource utilization. Xilinx® ISE®
software lets you share resourceswhen you enable the correspondingswitch (-resource_sharing) in the syn-thesis-properties dialog box. Thetools can detect and implementresource sharing when you describemutually exclusive assignments in anif-else block (Figure 1) or a case-end-case block (Figure 2).
As you can see, the assignments aremutually exclusive. Therefore, eightaddition operations can share twoadders when resource sharing isenabled. This type of resource sharingis dependent on identification ofmutually exclusive assignments thatmay exist in the RTL design. However,how can resources be shared if mutu-ally exclusive assignments do notexist and what are the pros and consof doing so? To answer this question,let’s delve into resource sharing at ahigher level of abstraction.
A STEP ABOVE RTL“Higher level of abstraction” refers todesign description at a level aboveRTL. It is possible to analyze theapplication to be implemented before
RTL is made ready. If we analyze theapplication for inherent parallelism,we can exploit that in the RTL design.At the same time, an analysis of thedata flow graph of the application canhelp us in designing a resource-shared implementation.
This type of graph gives informa-tion about the flow of data in the appli-cation. Data flows from one operationto another, and hence there can bedata dependencies between opera-tions. Operations that are not depend-ent on each other could execute in
40 Xcell Journal Second Quarter 2011
F P G A 1 0 1
parallel. However, parallel executionalmost always results in high resourceutilization; because our goal is toreduce resource utilization, we willsidestep parallel execution models.
Instead, let’s take as an exampleour implementation of the sum-of-absolute differences (SAD) algorithmto show how to analyze resourcesharing at a higher level of abstrac-tion than RTL, reducing resource uti-lization compared with relying onmutually exclusive assignments inRTL description.
It is possible to analyze the application to be implemented before RTL is made ready. If we analyze the application for inherent parallelism, we can exploit that in the RTL design.
SAD is an important algorithm inthe motion-estimation block in theMPEG-4 codec. It can also be used forobject recognition. In this pixel-basedmethod, the value of each pixel in animage block is subtracted from thevalue of the corresponding pixel inanother image block to determinewhich parts of the image have moved
from one frame to another. If theimage block size is 4 x 4, then all the 16absolute-difference values aresummed at the end. You’ll find exam-ple code in the sad.c file of the open-source xvid codec.
Figure 3 shows the data flow graph(DFG) of this algorithm for process-ing a 4 x 4 image block. You can
F P G A 1 0 1
Second Quarter 2011 Xcell Journal 41
implement the subtraction operationsin parallel because no subtractionoperation depends on any other sub-traction operation. However, as wewere interested in reducing resourceutilization, we did not implement theparallel-execution model. Instead, weused the resource-allocation andbinding algorithm called ExtendedCompatibility Path Based (ECPB)hardware binding that my colleaguesand I have proposed. 
Resource allocation and bindingbasically refer to allocating resourceslike adders, multipliers and so on tooperations in the DFG, and then bind-ing these resources to operations toeither reduce device resource utiliza-tion, increase maximum clock fre-quency or both. We focused on reduc-ing area subject to the final designmeeting our functional constraints.The algorithm exploits interoperationflow dependency and analyzes intra-operation flow dependency in a sched-uled DFG—that is, one whose opera-tions have been scheduled in differenttime steps or clock cycles.
The algorithm schedules operationsthat can execute in parallel in thesame time step, while operations thatdepend on data from other operationsare scheduled in different time steps.The algorithm constructs a weightedand ordered compatibility graph(WOCG) for each kind of operation inthe scheduled DFG. Thus, there is oneWOCG for the subtraction operationand another for addition. The method-ology uses the weight relation wij = 1+ α*Fij + β*Nij + γ*Rij to assignweights to the edges in the WOCG.Here, wij is the weight value betweenoperations “i” and “j” of the same type.Fij is the measure of flow dependencyand Nij is the number of commoninputs between operations “i” and “j.”Rij equals 1 if the output of operations“i” and “j” can be stored in the sameregister, otherwise it equals 0. We setthe tuning parameters, α, β and γ, to 1,1 and 2 respectively for the case weare studying here.
RP P R 
Figure 3 – Data flow graph of the SAD algorithm
Figure 4 – Datapath of implemented SAD algorithm
The next step involves finding alongest path in the WOCG using thelongest-path algorithm. All operationsin that longest path are mapped to justone operator. After removing thebound operations from the WOCG, werepeat the longest-path and mappingprocedure, continuing until we haveprocessed all WOCGs. Since the algo-rithm maps more than one operationto the same resource/operator, theoperator size is such that it accommo-dates the biggest operation. In thecase of the SAD implementation, wehandled all subtraction operations on8-bit data (gray-scale images) andhence the subtractor operator had 8-bit-wide inputs. We sized the accumu-lator operator, which sums up SADvalue iteratively, to input bit widths of23 bits and 8 bits.
In Figure 4, which shows the data-path of the implementation, you willsee that we used only one subtractorand one adder/accumulator in thisdesign. There is a very high degree ofresource sharing among the opera-tions. This kind of resource sharingwould not have been possible if thedesign were described directly at thebehavioral RTL level, because mutu-ally exclusive assignments do notexist in the SAD algorithm. Once wegenerated this datapath, we imple-mented the design hierarchicallyusing adder, subtractor and multi-plexer modules. This implementationinvolving instantiation of modules ismore structural than behavioral andexactly resembles the datapath.
While some readers may find ourimplementation simple, it should benoted that in large data-dominatedapplications, this kind of preprocess-ing at a higher level can help reduceresource utilization. Moreover, it’svery easy to automate this method.
To compare the physical-synthesisresults obtained for the datapath thatthe ECPB algorithm generated, wealso implemented a completely paral-lel implementation and two otherswith datapaths the same as in Figure 4,
F P G A 1 0 1
42 Xcell Journal Second Quarter 2011
Figure 5 – Honeycomb-like technology layout of Design I(RS)
Figure 6 – Honeycomb-like technology layout of Design II(RS)
but where the final RTL descriptionwas not hierarchical. The latter twowere flat behavioral designs with“forced,” mutually exclusive assign-ments to account for the muxes. Wesay “forced” because we tried toimplement the same datapath as inFigure 4 but using a behavioraldescription with mutually exclusiveassignments. One design had an if-else construct and the other had case-endcase. Table 1 shows the post-place-and-route results obtainedusing Xilinx ISE 12.2 (M.63C) soft-ware with default settings, Virtex®-4XC4VFX140-11FF1517 as the targetdevice and no timing constraints.There was no reset in any of theimplementations and hence internalregisters were initialized suitably.
In the table, RS and NRS stand forresource sharing enabled and dis-abled, respectively, in Xilinx ISE soft-ware. We explored designs I and IIbecause different synthesis tools caninfer different types of muxes fromdifferent styles of HDL code (if-else;case-endcase).  The clock perioddoes not take jitter into account andshould be derated by an amount
according to the clock source specifi-cation. Also, we did not use condi-tioning registers in any implementa-tion for the inputs.
MAJOR RESOURCE SAVINGSAs the table shows, we were able tobring down LUT consumption signifi-cantly, at most 56 percent and at theleast, 20 percent. However, unlike acompletely parallel implementationwhere data samples can be made avail-able every clock cycle in general,resource sharing puts some restrictionson the way data samples are madeavailable for processing. Becauseresources are now shared, a new datasample can be processed only after theprevious data sample has been partiallyor fully processed depending on thefinal datapath. In our ECPB implemen-tation, a new set of P and R values canbe made available only after a mini-mum of 16 clock cycles.
While this may not always beacceptable, it is definitely a techniquesuitable for use in applications thatwill allow such sample rates. Forinstance, the ECPB design can easilyprocess a DV-PAL frame of size 720 x
576 in 1.016 milliseconds without anyimpact on the PAL frame rate of 25frames per second.
Digressing a bit from the technicalimplementation details, Figures 5 and6 show the honeycomb-like technolo-gy layouts for our two designs, I(RS)and II(RS). Some of you might beaware of the recent exhibition of sci-ence and technology art at theMuseum of Modern Art in New York,which highlighted images and artworkinspired by scientific work. This kindof art gives a different perspective onthe work done by engineers and scien-tists and can help us in appreciatingour field of endeavor even more.
 Dhawan, U., Sinha, S., et al.,
“Extended compatibility path based
hardware binding for area-time effi-
cient designs,” Quality Electronic
Design (ASQED) 2010, Second Asia
 Stephenson, J. and Metzgen, P.,
“Logic optimization techniques for
multiplexers,” CP-01003-1.0, Altera
Corp., March 2004.
F P G A 1 0 1
Second Quarter 2011 Xcell Journal 43
Design II (if-else)
ECPB (hierarchical with instantiations)
RS NRS RS NRS RS NRS RS NRS
Four-input LUTs 368 368 193 272 193 272 160 160
Slices 200 200 110 149 110 149 93 93
Slice flip-flops 391 391 54 54 54 54 54 54
Best achievableclock period (ns)
3.609 3.609 2.662 3.592 2.662 3.592 2.450 2.450
Table I – Comparison of results (RS and NRS stand for resource sharing enabled or disabled, respectively)
A s product designs increase in complexity,there is a need to use integrated components,such as application-specific standard prod-
ucts (ASSPs), to address design requirements. Yearsago, engineers chose individual components forprocessor, memory and peripherals, and then piecedthese elements together with discrete logic. Morerecently, they would search through catalogs of ASSPprocessing systems attempting to find the nearestmatch to meet system requirements. When they needadditional logic or peripherals, they often mate anFPGA with an ASSP to complete the solution. Indeed,surveys indicate that FPGAs are now a part of 50 to 70percent of all embedded systems.
Over the last few years, FPGA sizes have increased,providing sufficient space to accommodate completeprocessor and logic systems within a single device.Software engineers are now faced with developing anddebugging code targeting a processor inside of anFPGA—and in some cases they fear doing so. But agrasp of FPGA basics and an understanding of how tocreate and debug code for FPGA embedded processorswill go a long way to settle their nerves.
WHAT IS AN FPGA?A field-programmable gate array (FPGA) is an integrat-ed circuit containing logic that may be configured andconnected after manufacturing, or “in the field.”Where in the past engineers purchased a variety oflogic devices from a catalog and then assembled theminto a logic design via connections on a printed-circuitboard, today hardware designers can implement com-plete designs within a single device. In their simplestform FPGAs contain:
■ Configurable logic blocks consisting of AND, OR,Invert and many other logic functions
■ Configurable interconnect enabling logic blocksto be connected together
■ I/O interfaces
With these elements, users may create an arbitrarylogic design.
Here are some practical tips on how to develop software for FPGA embedded processors.
Second Quarter 2011 Xcell Journal 45
Hardware engineers usually write code in HDL (typicallyeither Verilog or VHDL) and then compile the design into anobject file that they load into the device for execution. Onthe surface the HDL programs can look very much like high-level languages such as C. Take, for example, the followingimplementation of an 8-bit counter written in Verilog (cour-tesy of www.asic-world.com). In it you can see many con-structs taken from today’s high-level languages:
//-----------------------------------------// Design Name : up_counter// File Name : up_counter.v// Function : Up counter// Coder : Deepak//-----------------------------------------module up_counter (out , // Output of the counterenable , // enable for counterclk , // clock Inputreset // reset Input);//----------Output Ports-------------------
reg [7:0] out;//-------------Code Starts Here------------always @(posedge clk)if (reset) begin
out <= 8'b0 ;end else if (enable) beginout <= out + 1;
TECHNICAL ADVANTAGES OF AN FPGAShort of using an ASIC with its associated high maskcharges, FPGAs are the most flexible, high-performanceand cost-effective method of implementing data-processingelements. FPGAs, due to their flexible architecture, enablehardware designers to implement processing systems con-sisting of elements that are both paralleled and pipelined.This allows designers to tune a system for both perform-ance and latency. Typically these data-processing systemsprovide higher levels of performance than is obtainableusing general-purpose processors, and at a lower cost.
You can couple an external microprocessor to a data-processing system in an FPGA and use it for control.However, having a processor inside the FPGA can presentseveral advantages. An internal processor dramaticallyreduces the control latency between the processor and the
46 Xcell Journal Second Quarter 2011
F P G A 1 0 1
data-processing system, to the point of eliminating manyprocessor cycles. The communication channel between theprocessor and data-processing system may be 32 or morebits, with additional wires for address and control. For anexternal processor, these additional wires may necessitate alarger package for both the processor and FPGA, driving upsystem cost. Alternatively, PCI Express® (PCIe®) can providea dramatic reduction in the number of pins. Unfortunately,being a relatively new interface, not all processors andFPGAs support PCIe. While inherently high performance,PCIe is a serial interface and thus adds latency between aprocessor and the data-processing system.
Implementation of a processor inside an FPGA along withthe data-processing elements reduces parts count, boardspace and sometimes power requirements. This can result ina much lower-cost solution. Hardened implementations ofprocessors such as the ARM® Cortex™-A9 processor, or softimplementations such as the Xilinx® MicroBlaze™ proces-sors, are available in FPGAs. FPGA-based processors mayalso be configured based upon application requirements.FPGA-based systems enable system-level tuning by provid-ing the ability to move decision and computational functionsbetween the processor and the FPGA logic.
IMPLEMENTATION TECHNIQUES Multiple methods exist for the implementation of FPGAembedded processing systems, but they generally fall intothree categories: assembling the system from scratch, put-ting it together by using wizards or getting there by modify-ing an existing design.
FPGA tools allow you to assemble a processing systemfrom scratch by selecting the necessary IP from a list, andthen connecting the IP via buses and wires. Such assembly,while effective, can be time-consuming.
To speed things up, FPGA tools also allow the rapid assem-bly of a microprocessor system via wizards. Using drop-downlists or checkboxes, you simply specify the targeted part andthe desired processor and peripherals. Figure 1 presents theintroductory window to start the wizard and shows the finalsystem the wizard has built. Similarly, you may use tools suchas MATLAB® software to rapidly assemble a data-processingsystem with interfaces to the processor bus for control. Youcan then connect the processor and data-processing systemby simply matching bus interfaces.
The third way of implementing an embedded processingsystem is to modify an existing reference design or add toan existing hard-processor system. FPGA reference designsand hard-processor systems continue to evolve and manyare becoming more market focused. In many cases thedesigns are so complete that the hardware designer neednot add any extra components. The software designer typi-cally will find complete drivers and prebuilt operating sys-tems for these reference designs.
The first two methods noted above are both effectiveways to create a processor system. However, the thirdmethod begins with an existing, verified reference designand significantly reduces the development time for bothhardware and software engineers.
DISPELLING MYTHS A number of myths have sprung up in the engineering com-munity about the presumed difficulty of developing codefor processors inside of FPGAs. We would like to dispelthese myths.
Myth: Coding for a processor in an FPGA is difficult. Truth: Most FPGA embedded processing development isdone in C or C++ within modern software developmentenvironments.
Most FPGA suppliers now support software develop-ment using Eclipse. Eclipse is a flexible software devel-opment environment supporting plug-ins and providingtext editors, compilers, linkers, debuggers, trace mod-ules and code management.
As an open environment, Eclipse enjoys a large com-munity of developers that are constantly adding newcapabilities. For example, if a coder does not like the
F P G A 1 0 1
Second Quarter 2011 Xcell Journal 47
provided editor they can installan editor that better meets theirneeds. Figure 2 shows theEclipse code editor with the“hello world” program.
Myth: There are no ASSP-likeprocessor systems for FPGAs. Truth: There are now prebuiltFPGA soft embedded processordesigns as well as hard-proces-sor designs with ASSP-likeperipheral sets.
FPGAs with soft or hardprocessors add an extra dimen-sion of capability. FPGA embed-ded soft-processor referencedesigns integrate 32-bit RISCprocessors, memory interfacesand industry-standard peripher-als. The flexible nature of suchprocessors allows users to tradeoff logic for additional perform-ance capabilities, such as theaddition of an MMU supporting amodern operating system. Abroad array of FPGA choicesallows users to select a proces-sor configuration, peripherals,
data-processing logic and logic performance levels tomeet their system requirements. Prebuilt ASSP-like refer-ence designs enable software designers to begin codingwithout a hardware engineer having to first implement aprocessor system. In many cases, the prebuilt design willmeet the embedded processor system requirements, elim-inating the need for the hardware engineer to do furtherprocessor system design. In other cases, the hardwareengineer has an excellent platform upon which to addperipherals and connect custom hardware accelerators.
Myth: Debugging code with a processor in an FPGA is hard. Truth: Software debug of an FPGA embedded processor isas easy as debugging a non-embedded processor. Debuggerssupport downloading of code, running programs, single-stepping at the source-code and object-code levels, settingbreakpoints, examining memory and registers. Additionaltools are available for code profiling and trace.
Myth: My favorite OS is not supported. Truth: The most popular operating systems are supportedfor the most popular embedded processors, and the list isgrowing. The Xilinx MicroBlaze supports Linux, ThreadX,MicroC/OS-II and eCos, among other operating systems.
Figure 1 – Wizard start screen and completed system
Myth: Drivers don’t exist. Truth: FPGA embedded processors have large libraries ofperipherals with included drivers. Table 1 is a representa-tive list of some of the soft peripherals available for anFPGA soft processor. Appropriate drivers exist for all ofthese devices.
Myth: Hardware engineers have to build it before I can code.Truth: Prebuilt and tested processor system designs areavailable, enabling immediate software development.
These prebuilt ASSP-like processor systems include theprocessor, memory controller and memory, flash memorycontrollers and peripherals such as UARTs, GPIOs andEthernet interfaces. These systems are delivered with refer-ence software design examples including demonstratingLinux boot.
F P G A 1 0 1
Myth: I can’t profile or trace with FPGAembedded processors.Truth: Profiling and trace tools are also avail-able. Profiling allows developers to see howmuch time the processor has spent in whatfunctions as well as how many calls it hasmade to any given function.
Myth: FPGA software development tools areexpensive.Truth: Both ASSP and FPGA suppliers pricetheir embedded-software development capa-bilities between $200 and $500. In addition,many vendors offer trial and free or limitedversions as well as discounted evaluation kits.
CREATING AND DEBUGGING CODEThe software development process for FPGAembedded processor systems follows a fewgeneral steps:
1. Create a software development workspace and importthe hardware platform.
2. Create the software project and board support package.
3. Create the software.
4. Run and debug the software project.
5. Optional: Profile the software project.
Steps 3, 4 and 5 are familiar to most developers. Steps 1and 2 may be new to some developers but they are straight-forward. We will use the Eclipse development environmentas an example in looking more closely at each of the steps.
Creating a Workspace and Importing the Hardware PlatformAfter starting Eclipse, the user is prompted for a workspaceto use. A workspace is simply a directory path where project
Ethernet MAC Lite Multiport Memory Controller (SDRAM, DDR, DDR2, DDR3) Clock generator
CAN External Memory Controller (Flash, SRAM) System reset block
FlexRay Compact Flash Central DMA
MOST On-Chip Memory
48 Xcell Journal Second Quarter 2011
Table 1 – Sample soft-processor peripheral list
Figure 2 – Eclipse IDE with code
files are to be stored. Next, the user spec-ifies the hardware platform (design). Thehardware development tools automatical-ly generate this file, which describes theprocessor system including memory inter-faces and peripherals along with memorymaps. The file is output from the hard-ware development tools and the hardwareengineer will typically supply this file tothe software developer. Once specified,the hardware platform is imported andthis step is complete.
Creating the Software Project and Board Support Package The board support package (BSP) con-tains the libraries and drivers for softwareapplications use. A software project is thesoftware application source and settings.
In versions of Eclipse customized forXilinx projects, you select File ➔ New ➔Xilinx C Project. For Xilinx C projects,Eclipse automatically creates Makefilesthat will compile the source files into object files and linkthose object files into an executable. During this step theuser defines the Project Name, associates it with the hard-ware platform by providing the hardware platform name cre-ated in Step 1 and then names the project.
Next, the system confirms the generation of the BSP andautomatically loads the applicable drivers based upon thedefined hardware platform and operating system. Thesedrivers are then compiled and the BSP has been generated.
Creating the SoftwareAt this point, you may either import a software example orcreate code from scratch. As you save code, Eclipse auto-matically compiles and links the code, reporting any com-piler or linker errors.
Running and Debugging the Software ProjectWith FPGAs there is one step that you must complete priorto executing code—programming the FPGA. In Eclipse,you simply selects Tools ➔ Program FPGA. This step takesthe hardware design that the hardware engineer has createdand downloads it to the FPGA. Once that job is completed,you may select the type of software to be built. “Debug”turns off code optimization and inserts debugging symbols,while “Release” turns on code optimization. For profiling,use the –pg compile option.
Finally, you may run the code by selecting Run and defin-ing the type of run configuration and compiler options. Ifyou have selected Release, the processor will immediatelybegin code execution. Otherwise, the processor will exe-
cute a few boot instructions and will stop at the first line ofsource code and the Debug perspective will appear in Eclipse.
The Debug perspective gives you a view of the source orobject code, registers, memory and variables. You may single-step code at either the source or object level and may setbreakpoints for code execution.
Profile the Software ProjectIf you wish, at this point you may profile code and view thenumber of function calls as well as see the percentage of timespent in any given function. Figure 3 shows an example of theEclipse Profile Perspective including profiler results.
THE FPGA ADVANTAGEFor cost, power, size and overall system efficiency, FPGAembedded processors are becoming primary design choices.The good news is that software engineers do not need to con-sider an FPGA embedded processor as a mystery or any moredifficult to program than an external processor.
FPGA vendors are providing industry-standard develop-ment environments such as Eclipse at competitive costsand customized for FPGA embedded processing. Withinthese environments users can create, compile, link anddownload code, and as necessary debug their designs in thesame manner as they have done in the past with externalprocessors. Prebuilt processor reference designs enablesoftware engineers to start coding and testing without thehardware engineer having to provide the final design.Finally, FPGA embedded processors have extensive IPlibraries, drivers and OS support.
F P G A 1 0 1
Second Quarter 2011 Xcell Journal 49
Figure 3 – Eclipse profile perspective
50 Xcell Journal Second Quarter 2011
What Can FPGA-BasedPrototyping Do for You?
The greatest benefit may be when first SoC silicon arrives in the lab and the embedded software is running that same day.
By Doug AmosBusiness Development ManagerSolutions MarketingSynopsys [email protected]
René RichterCorporate Application Engineering ManagerHardware Platforms GroupSynopsys [email protected]
Xcell Journal is pleased to run in its entirety Chapter 2 of an exciting new book called FPGA-Based Prototyping Methodology Manual (FPMM). Written by Synopsys’ Doug Amosand René Richter and Xilinx’s own Austin Lesea, the FPMM is a comprehensive and practical guide to Design-for-Prototyping. To learn more about the FPMM and to download itas an e-book, go to www.synopsys.com/fpmm. Print copies are available from Amazon.com, Synopsys Press and other bookstores. While the book presents examples using Xilinxand Synopsys products, the methodology and lessons are applicable to any prototyping projects, regardless of the tools or FPGAs used.
As advocates for FPGA-basedprototyping, we may be expect-ed to be biased toward its bene-
fits and blind to its deficiencies.However, that is not our intent. OurFPGA-Based Prototyping Methodology
Manual (FPMM) is intended to give abalanced view of the pros and cons ofFPGA-based prototyping because, atthe end of the day, we do not want peo-ple to embark on a long prototypingproject if their aims would be better metby other methods, for example by usinga SystemC-based virtual prototype.
Let’s take a deeper look at the aimsand limitations of FPGA-based proto-typing and its applicability to system-level verification and other goals. Bystaying focused on the aim of the pro-totype project, we can simplify ourdecisions regarding platform, IPusage, design porting, debug and otheraspects of design. In this way we canlearn from the experience others havehad in their projects by examiningsome examples from prototypingteams around the world.
FPGA-BASED PROTOTYPING FOR DIFFERENT AIMSPrototyping is not a pushbuttonprocess and requires a great deal ofcare and consideration at its differentstages. As well as explaining the effortand expertise involved in the process,we should also offer some incentive asto why we should (or maybe shouldnot) perform prototyping during ourSoC projects.
In conversation with prototypersover many years, one of the questionswe liked to ask was “why do you doit?” There are many answers, but weare able to group them into the gener-al reasons shown in Table 1. So forexample, “real-world data effects”might describe a team that is proto-typing in order to have an at-speedmodel of a system available to inter-connect with other systems orperipherals, perhaps to test compli-ance with a particular new interfacestandard. Their broad reason to pro-
totype is “interfacing with the realworld” and prototyping does indeedoffer the fastest and most accurateway to do that in advance of real sili-con becoming available.
A structured understanding of theseproject aims and why we should proto-type will help us to decide if FPGA-
based prototyping is going to benefitour next project.
Let us, therefore, explore each ofthe aims in Table 1 and how FPGA-based prototyping can help achievethem. In many cases we shall also giveexamples from the real world and theauthors wish to thank in advance thosewho have offered their own experi-ences as guides to others in this effort.
HIGH PERFORMANCE AND ACCURACYOnly FPGA-based prototyping pro-vides both the speed and accuracy nec-essary to properly test many aspects ofthe design. We put this reason at thetop of the list because it is the mostlikely underlying reason of all for ateam to be prototyping, despite themany given deliverable aims of theproject. For example, the team mayaim to validate some of the SoC’s
embedded software and see how itruns at speed on real hardware, but theunderlying reason to use a prototype isfor both high performance and accura-cy. We could validate the software ateven higher performance on a virtualsystem, but we lose the accuracy thatcomes from employing the real RTL.
REAL-TIME DATAFLOWPart of the reason that verifying anSoC is hard is because its statedepends upon many variables,including its previous state, thesequence of inputs and the wider sys-tem effects (and possible feedback)of the SoC outputs. Running the SoCdesign at real-time speed connectedinto the rest of the system allows usto see the immediate effect of real-time conditions, inputs and systemfeedback as they change.
A very good example of this is real-time dataflow in the HDMI prototypeperformed by the Synopsys IP groupin Porto, Portugal. Here, a high-defini-tion (HD) media data stream wasrouted through a prototype of a pro-cessing core and out to an HD display,as shown in the block diagram inFigure 1. Notice that across the bot-tom of the diagram there is audio and
X P E R T S C O R N E R
Project Aim Why Prototype?
Test real-time dataflow
Early hardware-software integration
Early software validation
Test real-world data effects
Test real-time human interface
Debug rare data dependencies
Feasibility (proof of concept)
Extended RTL test and debug
High performance and accuracy
Interfacing with the real world
Table 1 – General aims and reasons to use FPGA-based prototypes
HD video dataflow in real time fromthe receiver (from an externalsource) through the prototype andout to a real-time HDMI PHY connec-tion to an external monitor.
By using a presilicon prototype, wecan immediately see and hear the effectof different HD data upon our design,and vice versa. Only FPGA-based pro-totyping allows this real-time dataflow,giving great benefits not only to suchmultimedia applications but to manyother applications where real-timeresponse to input dataflow is required.
HARDWARE-SOFTWARE INTEGRATIONIn the above example, readers mayhave noticed that there is a smallMicroBlaze™ CPU in the prototypealong with peripherals and memories,so all the familiar blocks of an SoC arepresent. In this design the softwarerunning in the CPU is used mostly toload and control the A/V processing.However, in many SoC designs it isthe software that requires most of thedesign effort.
Given that software has alreadycome to dominate the SoC developmenteffort, it is increasingly common thatthe software effort is on the critical pathof the project schedule. It is softwaredevelopment and validation that governthe actual completion date when theSoC can usefully reach volume produc-tion. In that case, what can systemteams do to increase the productivity ofsoftware development and validation?To answer this question, we need to seewhere software teams spend their time.
MODELING AN SoC FOR SOFTWARE DEVELOPMENTSoftware is complex and hard to makeperfect. We are all familiar with thesoftware upgrades, service packs andbug fixes in our normal day-to-day useof computers. However, in the case ofsoftware embedded in an SoC, thisperpetual fine-tuning of software isless easily achieved. On the plus side,the system with which the embedded
software interacts, its intended usemodes and the environmental situa-tion are all usually easier to determinethan for more general-purpose com-puter software. Furthermore, embed-ded software for simpler systems canbe kept simple itself and so is easier tofully validate. For example, an SoCcontrolling a vehicle subsystem or anelectronic toy can be fully tested moreeasily than a smartphone runningmany apps and processes on a real-time operating system (RTOS).
If we look more closely at the soft-ware running in such a smartphone, forexample the Android software shownin Figure 2, then we see a multilayeredarrangement, called a software stack.(This diagram is based on an originalby software designer Frank Abelson inhis book Unlocking Android.)
Taking a look at the stack, weshould realize that the lowest levels—i.e., those closest to the hardware—aredominated by the need to map the soft-ware onto the SoC hardware. Thisrequires absolute knowledge of thehardware to an address and clock-cyclelevel of accuracy. Designers of the low-est level of a software stack, often call-ing themselves platform engineers,
have the task of describing the hard-ware in terms that the higher levels ofthe stack can recognize and reuse. Thisdescription is called a BSP (board sup-port package) by some RTOS vendorsand is also analogous to the BIOS(basic input/output system) layer in ourday-to-day PCs.
The next layer up from the bottom ofthe stack contains the kernel of theRTOS and the necessary drivers tointerface the described hardware withthe higher-level software. In these low-est levels of the stack, platform engi-neers and driver developers will needto validate their code on either the realSoC or a fully accurate model of theSoC. Software developers at this levelneed complete visibility of the behaviorof their software at every clock cycle.
At the other extreme for softwaredevelopers, at the top layer of thestack, we find the user space, whichmay be running multiple applicationsconcurrently. In the smartphoneexample these could be a contact man-ager, a video display, an Internetbrowser and, of course, the phone sub-system that actually makes calls. Eachof these applications does not havedirect access to SoC hardware and is
52 Xcell Journal Second Quarter 2011
X P E R T S C O R N E R
CoreConnect - PLB
HDMI TX PHY
External PHYTest Chip
Figure 1 – Block diagram of HDMI prototype
actually somewhat divorced from anyconsideration of the hardware. Theapplications rely on software runningon lower levels of the stack to commu-nicate with the SoC hardware and therest of the world on its behalf.
We can generalize that, at eachlayer of the stack, a software develop-er only needs a model with enoughaccuracy to fool his own code intothinking it is running in the target SoC.More accuracy than necessary willonly result in the model running moreslowly on the simulator. In effect, SoCmodeling at any level requires us torepresent the hardware and the stackup to the layer just below the currentlevel to be validated and optimally, weshould work with just enough accura-cy to allow maximum performance.
For example, application develop-ers at the top of the stack can test theircode on the real SoC or on a model. Inthis case the model need only be accu-rate enough to fool the application intothinking that it is running on the realSoC; it does not need cycle accuracy orfine-grained visibility of the hardware.However, speed is important becausemultiple applications will be runningconcurrently and interfacing with real-world data in many cases.
This approach of the model having“just enough accuracy” for the softwarelayer gives rise to a number of differentmodeling environments being used bydifferent software developers at differ-ent times during an SoC project. It ispossible to use transaction-level simu-lations, modeled in languages such asSystemC, to create a simulator modelthat runs with low accuracy but at highenough speed to run many applicationstogether. If handling of real-time, real-world data is not important, then wemight be better considering such a vir-tual prototyping approach.
However, FPGA-based prototypingbecomes most useful when the wholesoftware stack must run together orwhen real-world data must beprocessed.
EXAMPLE PROTOTYPE USAGE FOR SOFTWARE VALIDATIONOnly FPGA-based prototyping breaksthe inverse relationship between accu-racy and performance inherent in mod-eling methodologies. By using FPGAswe can achieve speeds up to real-timeand yet still be modeling at the full RTLcycle accuracy. This enables the sameprototype to be used not only for the
accurate models required by low-levelsoftware validation but also for thehigh-speed models needed by the high-level application developers. Indeed,the whole SoC software stack can bemodeled on a single FPGA-based pro-totype. A very good example of thissoftware validation using FPGAs isseen in a project performed by ScottConstable and his team at FreescaleSemiconductor’s Cellular ProductsGroup in Austin, Texas.
Freescale was very interested inaccelerating SoC development because
short product life cycles of the cellularmarket demand that products get tomarket quickly not only to beat thecompetition, but also to avoid quicklybecoming obsolete. Analyzing thebiggest time sinks in its flow, Freescaledecided that the greatest benefit wouldbe achieved by accelerating their cellu-lar 3G protocol testing. If this testingcould be performed presilicon, then
Freescale would save considerablemonths in a project schedule. Whencompared to a product lifetime that isonly one or two years, this is very sig-nificant indeed.
Protocol testing is a complexprocess that even at high real-timespeeds requires a day to complete.Using RTL simulation would takeyears and running on a faster emula-tor would take weeks, neither ofwhich was a practical solution.FPGAs were chosen because thatwas the only way to achieve the nec-
Second Quarter 2011 Xcell Journal 53
X P E R T S C O R N E R
BIOS or BSP
Contacts Phone Browser Other
Dalvik VirtualMachine (DVM)
OpenGL/ES Free Type WebKit
SGL SSL libc
Figure 2-- The Android software stack
essary clock speed to complete thetesting in a timely manner.
Protocol testing requires the devel-opment of various software aspects ofthe product, including hardware driv-ers, operating system and protocolstack code. While the main goal wasprotocol testing, as mentioned, byusing FPGAs all of these softwaredevelopments would be accomplishedpresilicon and greatly accelerate vari-ous end-product schedules.
Freescale prototyped a multichipsystem that included a dual-coreMXC2 baseband processor plus thedigital portion of an RF transceiverchip. The baseband processor includ-ed a Freescale StarCore DSP core formodem processing and an ARM®926core for user application processing,plus more than 60 peripherals.
A Synopsys HAPS-54 prototypeboard was used to implement the pro-totype, as shown in Figure 3. The base-band processor was more than 5 mil-lion ASIC gates and Scott’s team usedSynopsys Certify tools to partition thisinto three of the Xilinx® Virtex®-5FPGAs on the board, while the digitalRF design was placed in the fourthFPGA. Freescale decided not to proto-
type the analog section but insteaddelivered cellular network data in dig-ital form directly from an Anritsu pro-tocol test box.
Older cores use some design tech-niques that are very effective in anASIC, but they are not very FPGAfriendly. In addition, some of the RTLwas generated automatically from sys-tem-level design code, which can alsobe fairly unfriendly to FPGAs owing tooverly complicated clock networks.Therefore, some modifications had tobe made to the RTL to make it moreFPGA compatible, but the rewardswere significant.
Besides accelerating protocol test-ing, by the time Freescale engineersreceived first silicon they were able to:
• Release debugger software withno major revisions after silicon.
• Complete driver software.
• Boot up the SoC to the OS software prompt.
• Achieve modem camp and registration.
The Freescale team was able toreach the milestone of making a cellu-lar phone call through the system only
one month after receipt of first silicon,accelerating the product schedule bymore than six months.
As Scott Constable said, “In additionto our stated goals of protocol testing,our FPGA system prototype deliveredproject schedule acceleration in manyother areas, proving its worth manytimes over. And perhaps most impor-tant was the immeasurable human ben-efit of getting engineers involved earli-er in the project schedule, and havingall teams from design to software tovalidation to applications very familiarwith the product six months before sil-icon even arrived. The impact of thisaccelerated product expertise is hardto measure on a Gantt chart, but maybe the most beneficial.
“In light of these accomplishments,using an FPGA prototype solution toaccelerate ASIC schedules is a no-brainer. We have since spread thismethodology into the FreescaleNetwork and Microcontroller Groupsand also use prototypes for new IP val-idation, driver development, debuggerdevelopment and customer demos.”
This example shows how FPGA-based prototyping can be a valuableaddition to the software team’s toolbox and brings significant return oninvestment in terms of product qualityand project time scales.
INTERFACING BENEFIT: TEST REAL-WORLD DATA EFFECTSIt is hard to imagine an SoC designthat does not comply with the basicstructure of having input data uponwhich some processing is performedin order to produce output data.Indeed, if we push into the SoC designwe will find numerous sub-blocks thatfollow the same structure, and so ondown to the individual gate level.
Verifying the correct processing ateach of these levels requires us to pro-vide a complete set of input data and toobserve that the correct output data arecreated as a result of the processing.For an individual gate this is trivial, andfor small RTL blocks it is still possible.
54 Xcell Journal Second Quarter 2011
X P E R T S C O R N E R
MXC BasebandProcessor Chip
RF Front EndChip
Anritsu 8480External ProtocolTester
FPGA B:Baseband RF Interface
FPGA A:Baseband RF Interface
FPGA C:StarCoreSC140e +Peripherals
FPGA D:Digital Portionof RF Chip
Figure 3 -- The Freescale SoC design partitioned into a HAPS-54 board
However, as the complexity of a systemgrows it soon becomes statisticallyimpossible to ensure completeness ofthe input data and initial conditions,especially when there is software run-ning on more than one processor.
There has been huge research andinvestment in order to increase effi-ciency and coverage of traditional ver-ification methods and to overcome thechallenge of this complexity. At thecomplete SoC level, we need to use avariety of different verification meth-ods in order to cover all the likelycombinations of inputs and to guardagainst unlikely combinations.
This last point is importantbecause unpredictable input data canupset all but the most carefullydesigned critical SoC-based systems.The very many possible previousstates of the SoC coupled with newinput data, or with input data of anunusual combination or sequence,can put an SoC into a nonverifiedstate. Of course, that may not be aproblem and the SoC recovers with-
out any other part of the system, orindeed the user, becoming aware.
However, unverified states are to beavoided in final silicon and so we needways to test the design as thoroughly aspossible. Verification engineers usepowerful methods such as constrained-random stimulus and advanced testharnesses to perform a wide variety oftests during functional simulations ofthe design, aiming to reach an accept-able coverage. However, completenessis still governed by the direction andconstraints given by the verificationengineers and the time available to runthe simulations themselves. As a result,
constrained-random verification isnever fully exhaustive but it will greatlyincrease confidence that we have test-ed all combinations of inputs, both like-ly and unlikely.
In order to guard against corner-case combinations, we can comple-ment our verification results withobservations of the design running onan FPGA-based prototype running inthe real world. By placing the SoC
design into a prototype, we can run ata speed and accuracy that comparevery well with the final silicon, allow-ing “soak” testing within the finalambient data, much as would be donewith the final silicon.
One example of this immersion ofthe SoC design into a real-world sce-nario is the use made of FPGA-basedprototyping at DS2 in Valencia, Spain.
EXAMPLE: IMMERSION IN REAL-WORLD DATABroadband-over-power line (BPL)technology uses normally undetectablesignals to transmit and receive infor-
mation over electrical-mains powerlines. A typical use of BPL is to distrib-ute HD video around a home from areceiver to any display via the mainswiring, as shown in Figure 4.
At the heart of DS2’s BPL designslie sophisticated hardware and embed-ded software algorithms, whichencode and retrieve the high-speedtransmitted signal into and out of thepower lines. These power lines can be
Second Quarter 2011 Xcell Journal 55
X P E R T S C O R N E R
Room BRoom A
BPL – WiFiTx/Rx
Figure 4 – Broadband-over-power line (BPL) technology used in a WiFi range extender
very noisy electrical environments, soa crucial part of the development is toverify these algorithms in a wide vari-ety of real-world conditions.
Javier Jimenez, ASIC design manag-er at DS2, explains what FPGA-basedprototyping did for them: “It is neces-sary to use a robust verification tech-nology in order to develop reliable andhigh-speed communications. It requiresvery many trials using different channeland noise models, and only FPGA-based prototypes allow us to fully testthe algorithms and to run the design’sembedded software on the prototype.In addition, we can take the prototypesout of the lab for extensive field testing.We are able to place multiple proto-types in real home and workplace situ-ations, some of them harsh electricalenvironments indeed. We cannot con-sider emulator systems for this purposebecause they are simply too expensiveand are not portable.”
This usage of FPGA-based proto-typing outside of the lab is instructivebecause we see that making the plat-form reliable and portable is crucial tosuccess.
BENEFITS FOR FEASIBILITY LAB EXPERIMENTSAt the beginning of a project, funda-mental decisions are made about chiptopology, performance, power con-sumption and on-chip communicationstructures. Some of these are best per-formed using algorithmic or system-level modeling tools, but some extraexperiments could also be performedusing FPGAs. Is this really FPGA-basedprototyping? We are using FPGAs toprototype an idea but it is differentthan using algorithmic or mathemati-cal tools because we need some RTL,perhaps generated by those high-leveltools. Once in FPGA, however, earlyinformation can be gathered to helpdrive the optimization of the algorithmand the eventual SoC architecture. Theextra benefit that FPGA-based proto-types bring to this stage of a project isthat more accurate models can be
used, which can run fast enough tointeract with real-time inputs.
Experimental prototypes of thiskind are worth mentioning, as they areanother way to use FPGA-based proto-typing hardware and tools in betweenfull SoC projects, hence deriving fur-ther return on our investment.
PROTOTYPE USAGE OUT OF THE LABOne truly unique aspect of FPGA-based prototyping for validating SoCdesign is its ability to work stand-alone. This is because the FPGAs canbe configured, perhaps from a flashEEPROM card or other self-containedmedium, without supervision from ahost PC. The prototype can thereforerun standalone and be used for testingthe SoC design in situations quite dif-ferent than those provided by othermodeling techniques, such as emula-tion, which rely on host intervention.
In extreme cases, the prototypemight be taken completely out of thelab and into real-life environments inthe field. A good example of thismight be the ability to mount the pro-totype in a moving vehicle andexplore the dependency of a designto variations in external noise,motion, antenna field strength and soforth. For example, the authors areaware of mobile-phone basebandprototypes that have been placed invehicles and used to make on-the-move phone calls through a publicGSM network.
Chip architects and other productspecialists need to interact with early-adopter customers and demonstratekey features of their algorithms. FPGA-based prototyping can be a crucial ben-efit at this very early stage of a project,but the approach is slightly differentthan the mainstream SoC prototyping.
Another very popular use of FPGA-based prototypes out of the lab is forpreproduction demonstration of newproduct capabilities at trade shows.Let’s consider a use of FPGA-basedprototyping by the research-and-
development division of the BBC inEngland (yes, that BBC) that illus-trates both out-of-lab usage and useat a trade show.
EXAMPLE: A PROTOTYPE IN THE REAL WORLDThe powerful ability of FPGAs to oper-ate standalone is demonstrated by aBBC R&D project to launch DVB-T2 inthe United Kingdom. DVB-T2 is a new,state-of-the-art open standard thatallows HD television to be broadcastfrom terrestrial transmitters.
The reason for using FPGA-basedprototyping was that, like most inter-national standards, the DVB-T2 techni-cal specification took several years tocomplete, in fact 30,000 engineer-hours by researchers and technolo-gists from all over the world. OnlyFPGAs gave the flexibility required incase of changes along the way. Thespecification was frozen in March 2008and published three months later as aDVB Blue Book on June 26.
Because the BBC was using FPGA-based prototyping, in parallel with thespecification work, a BBC implementa-tion team, led by Justin Mitchell fromBBC Research & Development, wasable to develop a hardware-based mod-ulator and demodulator for DVB-T2.
The modulator is based on aSynopsys HAPS-51 card with a Virtex-5FPGA from Xilinx. The HAPS-51 cardwas connected to a daughtercard thatwas designed by BBC Research &Development. This daughtercard pro-vided an ASI interface to accept theincoming transport stream. Theincoming transport stream was thenpassed to the FPGA for encodingaccording to the DVB-T2 standard andpassed back to the daughtercard fordirect upconversion to UHF.
The modulator was used for theworld’s first DVB-T2 transmissionsfrom a live TV transmitter that wereable to start the same day that thespecification was published.
The demodulator, also using HAPSas a base for another FPGA-based pro-
56 Xcell Journal Second Quarter 2011
X P E R T S C O R N E R
totype, completed the working end-to-end chain and this was demonstratedat the IBC exhibition in Amsterdam inSeptember 2008, all within threemonths of the specification beingagreed upon. This was a remarkableachievement and helped to build confi-dence that the system was ready tolaunch in 2009.
BBC Research & Development alsocontributed to other essential strandsof the DVB-T2 project, including a verysuccessful plugfest in Turin in March2009, at which five different modula-tors and six demodulators were shownto work together in a variety of modes.The robust and portable constructionof the BBC’s prototype made it idealfor this kind of plugfest event.
Justin Mitchell had this to sayabout FPGA-based prototyping: “Oneof the biggest advantages of theFPGA was the ability to track latechanges to the specification in therun-up to the transmission launchdate. It was important to be able tomake quick changes to the modulatoras changes were made to the specifi-cation. It is difficult to think of anoth-er technology that would haveenabled such rapid development ofthe modulator and demodulator andthe portability to allow the modulatorand demodulator to be used stand-alone in both a live transmitter and ata public exhibition.”
WHAT CAN’T FPGA-BASED PROTOTYPING DO FOR US?We started this chapter with the aim ofgiving a balanced view of the benefitsand limitations of FPGA-based proto-typing, so it is only right that weshould highlight here some weakness-es to balance against the previouslystated strengths
First and foremost, an FPGA proto-type is not an RTL simulator. If our aimis to write some RTL and then imple-ment it in an FPGA as soon as possiblein order to see if it works, then weshould think again about what is beingbypassed. A simulator has two basic
components; think of them as theengine and the dashboard. The enginehas the job of stimulating the modeland recording the results. The dash-board allows us to examine thoseresults. We might run the simulator insmall increments and make adjust-ments via our dashboard; we mightuse some very sophisticated stimu-lus—but that’s pretty much what asimulator does. Can an FPGA-basedprototype do the same thing? Theanswer is no.
It is true that the FPGA is a muchfaster engine for running the RTL“model,” but when we add in the effortto set up that model, then the speedbenefit is soon swamped. On top ofthat, the dashboard part of the simula-tor offers complete control of the stim-ulus and visibility of the results. Weshall consider ways to instrument anFPGA in order to gain some visibilityinto the design’s functionality, buteven the most instrumented designoffers only a fraction of the informa-tion that is readily available in an RTLsimulator dashboard. The simulator istherefore a much better environmentfor repetitively writing and evaluatingRTL code and so we should alwayswait until the simulation is mostly fin-ished and the RTL is fairly maturebefore passing it over to the FPGA-based prototyping team.
AN FPGA-BASED PROTOTYPE IS NOT ESLElectronic system-level (ESL) or algo-rithmic tools such as Synopsys’Innovator or Synphony allow designsto be entered in SystemC or to be builtfrom a library of predefined models. Wethen simulate these designs in the sametools and explore their system-levelbehavior including running softwareand making hardware-software trade-offs at an early stage of the project.
To use FPGA-based prototyping weneed RTL, therefore it is not the bestplace to explore algorithms or archi-tectures, which are not oftenexpressed in RTL. The strength of
FPGA-based prototyping for softwareis when the RTL is mature enough toallow the hardware platform to bebuilt; then software can run in a moreaccurate and real-world environment.There are those who have blue-skyideas and write a small amount of RTLfor running in an FPGA for a feasibili-ty study. This is a minor but importantuse of FPGA-based prototyping, but isnot to be confused with running a sys-tem-level or algorithmic exploration ofa whole SoC.
CONTINUITY IS THE KEYGood engineers always choose theright tool for the job, but there shouldalways be a way to hand over work-in-progress for others to continue. Weshould be able to pass designs fromESL simulations into FPGA-based pro-totypes with as little work as possible.Some ESL tools also have an imple-mentation path to silicon using high-level synthesis, which generates RTLfor inclusion in the overall SoC proj-ect. An FPGA-based prototype cantake that RTL and run it on a boardwith cycle accuracy. But once again,we should wait until the RTL is rela-tively stable, which will be after com-pletion of the project’s hardware-soft-ware partitioning and architecturalexploration phase.
SO WHY USE FPGA-BASED PROTOTYPING?Today’s SoCs are a combination of thework of many experts, from algorithmresearchers to hardware designers, tosoftware engineers, to chip layoutteams—and each has its own needs asthe project progresses. The success ofan SoC project depends to a largedegree on the hardware verification,hardware-software co-verification andsoftware validation methodologiesused by each of the above experts.FPGA-based prototyping brings differ-ent benefits to each of them.
For the hardware team, the speedof verification tools plays a major rolein verification throughput. In most
Second Quarter 2011 Xcell Journal 57
X P E R T S C O R N E R
SoC developments it is necessary torun through many simulations andrepeated regression tests as the proj-ect matures. Emulators and simulatorsare the most common platforms usedfor that type of RTL verification.However, some interactions within theRTL or between the RTL and externalstimuli cannot be re-created in a simu-lation or emulation, owing to long run-time, even when TLM-based simula-tion and modeling are used. FPGA-based prototyping is therefore used bysome teams to provide a higher-per-formance platform for such hardwaretesting. For example, we can run awhole OS boot in relatively real-time,saving days of simulation time itwould take to achieve the same thing.
For the software team, FPGA-basedprototyping provides a unique presili-con model of the target silicon, fastand accurate enough to enable debugof the software in near-final conditions.
For the whole team, a criticalstage of the SoC project is when thesoftware and hardware are intro-duced to each other for the first time.The hardware will be exercised bythe final software in ways that werenot always envisaged or predicted bythe hardware verification plan in iso-lation, exposing new hardware issuesas a result. This is particularly preva-lent in multicore systems or thoserunning concurrent real-time applica-tions. If this hardware-software intro-duction were to happen only after
first silicon fabrication, then discov-ering new bugs at that time is notideal, to put it mildly.
An FPGA-based prototype allows thesoftware to be introduced to a cycle-accurate and fast model of the hard-ware as early as possible. SoC teamsoften tell us that the greatest benefit ofFPGA-based prototyping is that whenfirst silicon is available, the system andsoftware are up and running in a day.
The authors gratefully acknowledge signifi-
cant contribution to this chapter from Scott
Constable of Freescale Semiconductor, Austin,
Texas; Javier Jimenez of DS2, Valencia,
Spain; and Justin Mitchell of BBC Research &
Development, London. For further informa-
tion, see www.synopsys.com/fpmm.
58 Xcell Journal Second Quarter 2011
X P E R T S C O R N E R
Versatile FPGA Platform
The Versatile FPGA Platform provides a cost-effective
way of undertaking intensive calculations
and high speedcommunications in an
PCI Express 4x Short CardXilinx Virtex FamiliesI/0 enabled through an FMC site (VITA 57)Development kit and drivers optimized for Windows and Linux
Xilinx has developed an applicationcalled Documentation Navigator thatallows users to view and manageXilinx design documentation (soft-ware, hardware, IP and more) fromone place, with easy-to-use down-load, search and notification fea-tures. To try the new XilinxDocumentation Navigator, now inopen beta release, use the downloadlink at www.xilinx.com/support.
ISE DESIGN SUITE:LOGIC EDITION
Front-to-Back FPGA Logic DesignLatest version number: 13.1Date of latest release: March 2011Previous release: 12.4URL to download the latest patch:www.xilinx.com/download
Revision highlights:A newly redesigned user interface forPlanAhead™ and the IP suiteimproves productivity across SoCdesign teams and contributes to theprogression toward true plug-and-playIP that targets Spartan®-6, Virtex®-6and 7 series FPGAs, including theindustry-leading Virtex-7 2000Tdevice, with 2 million logic cells.
PlanAhead design and analysistool: Xilinx has further enhancedPlanAhead’s graphical user interface(GUI) to provide an intuitive environ-ment for both new and advancedusers. PlanAhead release 13 has powerestimation available on Virtex-5,Virtex-6 and Spartan-6 device families.The GUI’s new Clocking Resourceview aids in the visualization andassignment of clocking-related sites.Also, PlanAhead release 13 integratesthe Xilinx ISE Simulator (ISim) intothe design flow and adds the abilityto tag RTL nets using a new HDLDebug Probe feature. In addition,PlanAhead release 13 supports hier-archical design.
Team design: New to ISE DesignSuite 13 is a team design methodolo-gy. Using PlanAhead, it addresses thechallenge of multiple engineers work-ing on a single project by providing amethodology for groups of develop-ers to work in parallel. Building onthe Design Preservation capabilitymade available in ISE Design Suite 12,the team design flow provides addi-tional functionality in allowing earlyimplementation results on completedportions of the design to be lockeddown without having to wait for the
rest of the design team. This newcapability facilitates faster timing clo-sure and timing preservation for theremainder of the design, increasingoverall productivity and reducingdesign iterations.
ISE Simulator (ISim): PlanAheadrelease 13 has integrated the XilinxISE Simulator (ISim) into the designflow. The Flow Navigator providesaccess to this tool.
Project Navigator: Improvementsin Embedded Development Kit(EDK) integration include supportfor multiple ELF files and automaticdetection of ELF files referenced byEDK designs.
FPGA Editor: A new Lock Layers tool-bar button locks the current layer’svisibility settings for all zoom levels.
Xilinx Power Analyzer (XPA): Thistool boasts improved vectorless algo-rithms for activity propagation.
iMPACT: The tool now supportsSPI/BPI programming.
ISE DESIGN SUITE:EMBEDDED EDITION
Integrated Embedded Design SolutionLatest version number: 13.1Date of latest release: March 2011Previous release: 12.4URL to download the latest patch:www.xilinx.com/download
Revision highlights:All ISE Design Suite editions includethe enhancements listed above forthe Logic Edition. The following
60 Xcell Journal Second Quarter 2011
Xilinx Tool & IP UpdatesXilinx is continually improving its products, IP and design tools as it
strives to help designers work more effectively. Here we report on the
most current updates to the flagship FPGA development environment,
the ISE® Design Suite, as well as to Xilinx® IP, as of March 2011.
Product updates offer significant enhancements and new features
to three versions of the ISE Design Suite: the Logic, Embedded and
DSP editions. Keeping your installation of ISE up to date is an easy
way to ensure the best results for your design. Updates to the ISE
Design Suite are available from the Xilinx Download Center at
www.xilinx.com/download. For more information or to download
a free 30-day evaluation of ISE, visit www.xilinx.com/ise.
Project Navigator: Usabilityimprovements make it possible toselect a Xilinx board during projectcreation. Software ELF files forsimulation and implementation cannow be source files. Multipleprocessor instances are now recog-nized and displayed in the sourcewindow. Designers can also exporthardware design capability to theSDK and launch the SDK fromProject Navigator.
Xilinx Platform Studio (XPS): Thistool also has usability improve-ments, including a wizard for cus-tom AXI slave peripheral creation,testbench generation with AXI BFMfor ModelSim, ISim and NC Sim, andautomatic bus connections to easilycreate systems with multipleMicroBlaze™ processors. Moreover,AXI systems are now the default inBase System Builder for Spartan-6and Virtex-6 designs. Base SystemBuilder supports AXI systems onlyfor 7 series designs.
Configuration wizard for Zynq-7000: The wizard allows the user toselect and configure peripherals.
EDK overall enhancements: TheEmbedded Development Kit nowoffers consistent SDK workspaceselection behavior across ProjectNavigator, Xilinx Platform Studio(XPS) and the SDK, along with TDPdevice-based licensing support.
SDK enhancements: The SoftwareDevelopment Kit offers initial sup-port of the 7 series in the XilinxMicroprocessor Debugger (XMD).
Project Navigator/EDK integra-tion: Enhancements include recog-nition of processor instances in.xmp and separate ELF sources forimplementation and simulation.
vides functionality and performancesimilar to ASSPs, but with flexibilitynot possible with ASSPs.
Latest version number: 13.1Date of latest release: March 2011URL to access the latest version:www.xilinx.com/download
Listing of all IP in this release: www.xilinx.com/ipcenter/
Revision highlights: Starting with Release 13.1, all ISE COREGenerator™ IP supports Kintex™-7 andVirtex-7 devices.
Audio, video and image processing IP:Object segmentation v1.0 (AXI4-Lite),used in conjunction with the image-characterization LogiCORE™ IP, con-verts statistical data provided into alist of objects that meet a user-definedset of object characteristics. AXIvideo direct memory access v1.0(AXI4, AXI4-Stream, AXI4-Lite) pro-vides a flexible interface for control-ling and synchronizing video framestores from external memory.Designers can link together multipleVDMAs from different clock domainsto control frame store reads andwrites from multiple sources.
Communication DSP building blocks:The Linear Algebra Toolkit v1.0 (AXI4-Stream) implements basic matrix opera-tions including matrix-matrix addition,subtraction, matrix-scalar multiplicationand matrix-matrix multiplication. ThisIP provides flexible and optimized build-ing blocks for developing complex com-posite functions for various signal- anddata-processing applications.
MicroBlaze: The new version 8.10.a ofthe embedded processor supports 7series Virtex devices. AXI is now thedefault interface for 7 series designs.Also, the MicroBlaze configuration wiz-ard supports fault-tolerant features.
ISE DESIGN SUITE: DSP EDITION
For High-Performance DSP Systems Latest version number: 13.1Date of latest release: March 2011Previous release: 12.4URL to download the latest patch: www.xilinx.com/download
Revision highlights:All ISE Design Suite editions includethe enhancements listed above for theLogic Edition. Enhancements specificto the DSP Edition deliver a 33 percentimprovement in first-time initializationof a typical model and a 1.5x to 3ximprovement in simulation speed ontypical models. The DSP Edition nowalso supports the fast simulation modelof the AXI4 fast Fourier transform, for a42x speedup.
System Generator for DSP: This toolsupports MATLAB®/Simulink® 2011a.All System Generator blocks now sup-port Kintex-7 and Virtex-7 devices. Newblocks include the 7 series DSP48E1,Complex Multiply 5.0, DSP48 Macro2.1, FIR Compiler 6.2 and VDMAInterface 3.0. Also new in this release isSystem Generator support for the AXIPCore and for hardware co-simulation.
XILINX IP UPDATES Name of IP: ISE IP Update 13.1Type of IP: All
Targeted application: Xilinx developsIP cores and partners with third-partyIP providers to decrease customertime-to-market. The powerful combina-tion of Xilinx FPGAs with IP cores pro-
FPGA features and support: The7 series FPGA transceiver wizardv1.3 configures one or moreVirtex-7 and Kintex-7 FPGA GTXtransceiver, either from scratch orusing industry-standard templates,by means of a custom Verilog orVHDL wrapper. The wizard alsoprovides an example design, test-bench and scripts that allow youto observe the transceivers operat-ing in simulation and in hardware.The XADC wizard v1.2 generatesan HDL wrapper to configure asingle 7 series FPGA XADC primi-tive for user-specified channelsand alarms.
Standard bus interfaces and I/O:The 7 series integrated block forPCI Express® (PCIe®) v1.0 (AXI4-Stream) implements one-, two-,four- or eight-lane configurations.The IP uses the 7 series’ integratedhard-IP block for PCI Express inconjunction with flexible architec-tural features to implement a PCIExpress Base Specification v2.1-compliant PCIe endpoint or rootport. The LogiCORE IP for PCIExpress employs the high-per-formance AXI interface. It usesoptimal buffering for high-band-width applications, along withBAR checking and filtering.
Wireless IP: Triple-rate SDI v1.0(AXI4-Stream) IP provides receiverand transmitter interfaces for theSMPTE SD-SDI, HD-SDI and 3G-SDI standards. The triple-rate SDIreceiver and transmitter come asunencrypted source code in bothVerilog and VHDL, allowing you tofully customize these interfaces asrequired by your specific applica-tions. The 3GPP LTE PUCCHReceiver v1.0 (AXI4-Stream) pro-vides designers with an LTEPhysical Uplink Control Channelreceiver block for the 3GPP TS36.211 v9.0.0 Physical Channels
and Modulation (Release 9) specifi-cation. The receiver block supportschannel estimation, demodulationand decoding.
Additional IP supporting AXI4interfaces: Xilinx has updated thelatest versions of CORE GeneratorIP with production AXI4 interfacesupport. For more detailed supportinformation, see www.xilinx.com/
ipcenter/axi4_ip.htm. In general, theAXI4 interface will be supported bythe latest version of an IP block, forVirtex-7, Kintex-7, Virtex-6 andSpartan-6 device families. Older pro-duction versions of IP will continueto support the legacy interface for therespective core on Virtex-6, Spartan-6, Virtex-5, Virtex-4 and Spartan-3device families only. For generalinformation on Xilinx AXI4 supportsee www.xilinx.com/axi4.htm. Youcan view a comprehensive listing ofcores updated in this release atw w w . x i l i n x . c o m / i p c e n t e r /
coregen/13_1_datasheets.htm. Formore information on LogiCORE IPin this release, see www.xilinx.com/
CORE Generator and PlanAheaddesign flow enhancements: COREGenerator and PlanAhead now sup-port IP-XACT-based IP repositoriesfor Xilinx and Alliance Programmember IP. (Their use requires nochanges to existing CORE Generator,PlanAhead and Project Navigatoruser flows.) A new Manage IP pull-down menu provides a repositoryand IP management features. Displayof AXI4 support by each core in theIP catalog now places the variousAXI4 interfaces in separate sortablecolumns: AXI4, AXI4-Stream andAXI4-Lite. Individual ports in IP sym-bols can now be grouped into AXI4channels for simplified symbolviews. In this release, PlanAhead alsoadds support for an automatic IPupgrade flow.
Here we have MarcoPolo VX series, the market proven solution!
Have you ever dreamed of a more flexible FPGA design platform based on Xilinx Virtex-6 LX760 or Virtex-5 LX330? Have you ever experienced the shortcomings of most conventional platformsavailable today?
■ Redundancy FPGAs?
■ Restricted FPGA interconnections?
■ Not enough from the conventional platform support?
Meritech is proud to introduce MarcoPolo platform which solves these common issues, and is revolutionizing the way design engineers prototype FPGA based IP verification systems. It is already adopted by Samsung Electronics and world MP3 & SmartPhone OEM industry-leaders.MarcoPolo is quickly becoming industry standard development platform.
Suitable for projects ranging from commercial grade SoC through leading-edge high speed FPGAsystems. You will be impressed by the flexibility, convenience and power of MarcoPolo platform.If you are about to embark on the design of your next-generation system, we urge you to first learnmore about how MarcoPolo solution can save your time and money.
Life just got easier…. Just click on www.marcopolofpga.co.kr
FEATURED WHITE PAPER: LOWERING POWER AT 28 NM WITH XILINX 7 SERIES FPGAShttp://www.xilinx.com/support/documentation/white_
This new white paper by Jameel Hussein, Matt Klein andMichael Hartwhich discusses several aspects of powerrelated to the Xilinx® 28-nanometer 7 series FPGAs, includ-ing the TSMC 28-nm high-k metal gate (HKMG), high-per-formance, low-power (HPL) process choice. The paperdescribes the power benefits of the 28-nm HPL process andits usefulness across Xilinx’s full range of product offerings,as well as the architectural innovations and features forpower reduction across the dimensions of static power,dynamic power and I/O power.
Xilinx has achieved a dramatic power reduction in the 7series by applying a holistic approach to curbing FPGA andsystem power. The FPGAs offer up to 50 percent total powerreduction—and an even greater power reduction at maxi-mum (worst-case) process—compared with the previous gen-eration. Designers can further cut power by leveraging newI/O features and advanced clock- and logic-gating software,resulting in the industry’s lowest total FPGA power ratings.
XAPP883: FAST CONFIGURATION OF PCI EXPRESS TECHNOLOGY THROUGH PARTIAL RECONFIGURATION http://www.xilinx.com/support/documentation/
The PCI Express® specification requires ports to be readyfor link training at a minimum of 100 milliseconds after thepower supply is stable. This becomes a difficult task due tothe ever-increasing configuration memory size of each newgeneration of FPGA, such as the Xilinx Virtex®-6 family.One innovative approach to addressing this challenge isleveraging the advances made in the area of FPGA partialreconfiguration to split the overall configuration of a PCIe®
specification-based system in a large FPGA into twosequential steps: initial PCIe system link configuration andsubsequent user application reconfiguration.
As Simon Tam and Martin Kellermann explain in thisapplication note, it is feasible to configure only the FPGAPCIe system block and associated logic during the first stagewithin the 100-ms window before the fundamental reset isreleased. Using the Virtex-6 FPGA’s partial-reconfigurationcapability, the host can then reconfigure the FPGA to applythe user application via the now-active PCIe system link.
This methodology not only provides a solution for faster
PCIe system configuration, it enhances user applicationsecurity because the bitstream is accessible only by the hostand can be better encrypted. This approach also helps lowerthe system cost by reducing external configuration compo-nent costs and board space.
This application note describes the methodology for build-ing a Fast PCIe Configuration (FPC) module using this two-step configuration approach. A reference design is availableto help designers quick-launch a PlanAhead™ software par-tial-reconfiguration project. The reference design implementsan eight-lane PCIe technology generation-1 link and targetsthe Virtex-6 FPGA ML605 Evaluation board.
This application note by Kyle Locke describing a parameteriz-able content-addressable memory (CAM) is accompanied by areference design that replaces the CAM core previously deliv-ered through the CORE Generator™ software. Designersshould use this CAM reference design for all new FPGAdesigns targeting Virtex-6, Virtex-5, Virtex-4, Spartan®-6,Spartan-3, Spartan-3E, Spartan-3A and Spartan-3A DSPFPGAs, and newer architectures. All the features and inter-faces included in the reference design are backward compati-ble with the LogiCORE™ IP CAM v6.1 core. In addition,because the reference design is provided in plain-text VHDLformat, the implementation of the function is fully visible,allowing for easy debug and modification of the code.
Unlike standard memory cores, a CAM performs contentmatching rather than address matching. Using content val-ues as an index into a database of address values, the con-tent-matching approach enables faster data searches thanare possible by sequentially checking each address locationin a standard memory for a particular value. The additionalability to compare content in parallel enables even higher-speed searches. A set of scripts included with the CAM ref-erence design allows the customization of width, depth,memory type and optional features.
You can configure the CAM using one of two memory imple-mentations: SRL16E-based CAM with a 16-clock-cycle writeoperation and a one-clock-cycle search operation, or BRAM-based CAM with only a two-clock-cycle write operation and aone-clock-cycle search operation. The BRAM-based CAM alsosupports an optional additional output register that adds alatency of one clock cycle to all read operations.
XAMPLES. . .
Application NotesIf you want to do a bit more reading about how our FPGAs lend themselves to a broadnumber of applications, we recommend these application notes and the white paper.
This application note is written for logic designers who arenew to HDL verification flows, and who do not have exten-sive experience writing testbenches to verify HDL designs.Author Mujtaba Hamid provides guidelines for laying out andconstructing efficient testbenches, along with an algorithmto develop a self-checking testbench for any design.
Due to increases in design size and complexity, digitaldesign verification has become an increasingly difficult andlaborious task. To meet this challenge, verification engineersrely on several tools and methods. For large, multimillion-gate designs, engineers typically use a suite of formal verifi-cation tools. However, for smaller designs, design engineersusually find that HDL simulators with testbenches work best.Thus, testbenches have become the standard method to ver-ify high-level language designs. Typically, testbenches per-form the following tasks:
• Instantiate the design under test (DUT)
• Stimulate the DUT by applying test vectors to the model
• Output results to a terminal or waveform window for visual inspection
• Optionally compare actual results to expected results
Typically, designers write testbenches in the industry-standard VHDL or Verilog hardware description languages.Testbenches invoke the functional design, then stimulateit. Complex testbenches perform additional functions—forexample, they contain logic to determine the proper designstimulus for the design or to compare actual to expectedresults.
Testbenches provide engineers with a portable, upgrad-able verification flow. With the availability of mixed-languagesimulators, designers are free to use their HDL language ofchoice to verify both VHDL and Verilog designs. High-levelbehavioral languages facilitate the creation of testbenchesthat use simple constructs and require a minimum amount ofsource code.
This application note describes the structure of a well-composed testbench and provides an example of a self-checking testbench—one that automates the comparison ofactual to expected testbench results. Designs benefit fromself-checking testbenches that automate the verification ofcorrect design results during simulation.
The triple-rate serial digital interface (SDI) supporting theSMPTE SD-SDI, HD-SDI and 3G-SDI standards enjoys wide
use in professional broadcast video equipment.Broadcast studios and video production centers use itto carry uncompressed digital video, along with embed-ded ancillary data, such as multiple audio channels.
Spartan-6 FPGA GTP transceivers are well-suited forimplementing triple-rate SDI receivers and transmitters,providing a high degree of performance and reliabilityin a low-cost device. In this application note, ReedTidwell describes the triple-rate SDI receiver and trans-mitter reference designs for the GTP transceivers inSpartan-6 devices.
The Spartan-6 FPGA triple-rate SDI reference designsupports SD-SDI, HD-SDI and 3G-SDI (both level A andlevel B). The Spartan-6 FPGA GTP transceiver triple-rate SDI transmitter has a 20-bit GTP interface, support-ed on Spartan-3 and faster Spartan-6 devices. Only tworeference clock frequencies are required to support allSDI modes: 148.5 MHz (for SD-SDI at 270 Mbits/s, HD-SDI at 1.485 Gbits/s and 3G-SDI at 2.97 Gbits/s); and148.5/1.001 MHz (for HD-SDI at 1.485/1.001 Gbits/s and3G-SDI at 2.97/1.001 Gbits/s). The transmitter directlysupports 3G-SDI level A transmission of 1080p 50-Hz,59.94-Hz and 60-Hz video as well as preformatted dual-link HD-SDI streams via either dual-link HD-SDI or 3G-SDI level B formats.
The receiver has a bit rate detector that can distin-guish between the two HD-SDI and two 3G-SDI bitrates. An output signal from the receiver indicateswhich bit rate is being received. The receiver automati-cally detects the SDI standard of the input signal (3G-SDI, HD-SDI or SD-SDI) and reports the current SDImode on an output port. It detects and reports the videotransport format (e.g., 1080p 30 Hz, 1080i 50 Hz), andsupports both 3G-SDI level A and level B formats, auto-matically detecting whether 3G-SDI data streams arelevel A or B. The receiver performs CRC error checkingfor HD-SDI and 3G-SDI modes.
XAPP978: FPGA CONFIGURATION FROM FLASHPROMS ON THE SPARTAN-3E 1600E BOARhttp://www.xilinx.com/support/documentation/applica-
This application note describes three FPGA configura-tion modes using flash PROMs: BPI Up mode, BPI Downmode and SPI mode. Author Casey Cain details the step-by-step process to boot-load a software application inflash in each of the configuration modes. The processincludes creating the boot loader, programming the soft-ware application files and the system configuration intoflash memory, and setting which configuration mode theFPGA will use when the board is powered on. A refer-ence system targeted for the Spartan-3E 1600E develop-ment board is configured for programming the IntelStrataFlash parallel NOR flash PROM on the SP3E1600Eboard. Sample software applications illustrate the boot-loading process.
NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia, or Canada (excluding Quebec) to enter. Entries must be entirely original and must bereceived by 5:00 pm Pacific Time (PT) on July 1, 2011. Official rules available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.
If you have a yen to Xercise your funny bone, here’s your opportunity.We invite readers to step up to our verbal challenge and submit an engi-
neering- or technology-related caption for this editorial cartoon by DanielGuidera, featuring a meeting room presentation by an elephant. Theimage might inspire a caption like “He can’t program worth a damn but hehas an incredible memory for circuit theory.”
Send your entries to [email protected]. Include your name, job title, companyaffiliation and location, and indicate that you have read the contest rules atwww.xilinx.com/xcellcontest. After due deliberation, we will print thesubmissions we like the best in the next issue of Xcell Journal and award thewinner the new Xilinx® SP601 Evaluation Kit, our entry-level developmentenvironment for evaluating the Spartan®-6 family of FPGAs (approximateretail value, $295; see http://www.xilinx.com/sp601). Runners-up will gainnotoriety, fame and a cool, Xilinx-branded gift from our SWAG closet.
The deadline for submitting entries is 5:00 pm Pacific Time (PT) onJuly 1, 2011. So, get writing!
Austin, Texas-based hardware engineer LARRY A. STELL won an SP601 Evaluation Kit with his
caption for the cartoon of the overactive office chair in Issue 74 of Xcell Journal.
Congratulations as well
to our two runners-up:
“HR just found out he turned 55 today.”
– Vinay Agarwala, MTS, SeaMicro Inc.
(Santa Clara, Calif.)
“While developing the DSP algorithm for the motor speed control of the
new massage-chair project, Rodney learns the hard way what can happen when an unexpected accumulator overflow occurs.”
– Steven Spain, hardware design
engineer, Kay Pentax Corp.
(Lincoln Park, NJ)
“Now, that’s the worst case ofground bounce I’ve ever seen!”