Xcell Journal issue 74

www.xilinx.com/xcell/

S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

FPGAs Speed theComputation ofComplex CreditDerivatives

Xcell journalXcell journalI S SUE 74 , F I RST QUAR TER 2011

Using MicroBlaze to Build Mil-grade Navigation Systems

FPGA Houses Single-BoardComputer with SATA

Old School: Using Xilinx Tools in Command-Line Mode

Stacked & Loaded Stacked & Loaded Xilinx SSI, 28-Gbps I/O Yield Amazing FPGAs

page18

http://www.xilinx.com/xcell

©Avnet, Inc. 2011. All rights reserved. AVNET is a registered trademark of Avnet, Inc. All other brands are the property of their respective owners.

Expanding Upon theIntel® Atom™ ProcessorAvnet Electronics Marketing has combined an

Emerson Network Power Nano-ITX motherboard

that provides a complete Intel® Atom™ processor

E640-based embedded system with a Xilinx®

Spartan®-6 FPGA PCI Express (PCIe) daughter card

for easy peripheral expansion. A reference design,

based on an included version of the Microsoft®

Windows® Embedded Standard 7 operating system,

shows developers how to easily add custom FPGA-

based peripherals to user applications.

Nano-ITX/Xilinx® Spartan®-6 FPGADevelopment Kit Features

Emerson Nano-ITX-315 motherboard Xilinx Spartan-6 FPGA PCIe boardFPGA Mezzanine Card (FMC)connector for expansion modulesGPIO connector for custom peripheral or LCD panel interfacesTwo independent memory banks of DDR3 SDRAMMaxim SHA-1 EEPROM for FPGA design securityDownloadable documentation and reference designs

To purchase this kit, visit www.em.avnet.com/spartan6atomor call 800.332.8638.

http://www.em.avnet.com/spartan6atom

Toggle among banks of internal signals for incremental real-time internal measurements without:

See how you can save time by downloading our free application note.

www.agilent.com/find/fpga_app_note

Quickly see inside your FPGA

u.s. 1-800-829-4444 canada 1-877-894-4414© Agilent Technologies, Inc. 2009

Also works with all InfiniiVisionand Infiniium MSO models.

http://www.agilent.com/find/fpga_app_note

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2011 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

PUBLISHER Mike [email protected]

EDITOR Jacqueline Damian

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan [email protected]

INTERNATIONAL Melissa Zhang, Asia [email protected]

Christelle Moraga, Europe/Middle East/[email protected]

Miyuki Takegoshi, [email protected]

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

Xilinx Alliance Program Streamlines IP-Based FPGA Design

In the last quarter, Xilinx has made several major announcements, including our 28-nanometer 7 series FPGAs with their new “More than Moore” stacked-die architec-ture, and even a Xilinx® Virtex®-7 FPGA family member that enables 400-Gbps

bandwidth, as described in this issue’s cover story. One announcement that you may haveoverlooked in this flurry of news came from the Xilinx Alliance Program group. Under theleadership of Dave Tokic, the team has completely revamped Xilinx’s activities to strengthenthe ecosystem of design and integration services, embedded software, design tools, devel-opment and evaluation boards, complementary silicon device technology and, especially,IP providers. This massive effort marks quite an accomplishment—one that should yieldgreat benefits to Xilinx users as well as partners for years to come.

The two biggest issues that have plagued the IP industry since its emergence in the 1990sare how to know that the piece of IP you are using is of high quality, and which bus to useto connect all these different pieces of IP. Tokic’s group at Xilinx has made huge inroads inaddressing these problems. With the Extensible Processing Platform (EPP) architecture,Xilinx is currently creating a device that will marry an ARM® MPU core with FPGA fabricon the same device (see cover story, Xcell Journal, issue 71). We also worked with ARM tocreate the next version of the industry-standard AMBA® interconnect protocol, calledAXI4, that is optimized for use in FPGAs. All of the EPP products as well as the 7 seriesdevices will use the AXI4 interconnect. This companywide standardization on AXI4 allowscustomers and ecosystem providers that have already developed IP for AMBA to moreeasily migrate their IP into Xilinx FPGAs. It also makes it easier for Xilinx to build our ownIP and better maintain our existing IP libraries moving forward.

Xilinx has also adopted IP industry standards such as IP-XACT, or IEEE 1685 (formerlyknown as SPIRIT), which establishes a standard structure for packaging, integrating andreusing IP within tool flows. We have settled on a common set of quality and implementa-tion metrics (leveraging the efforts first undertaken by the VSIA) to create transparency forcustomers into IP quality and implementation details. Tokic’s group also helped establishthe industry’s “SignOnce” program to facilitate customer licensing. Many IP vendors with-in the Alliance Program offer licensing via these simplified contracts. And with the releaseof Virtex-6 devices over a year ago, Xilinx standardized on the VITA FMC PCB standard,making it easier for IP vendors to offer development boards and daughtercards that easilyplug into Targeted Design Platforms from Xilinx, Avnet and Tokyo Electron Devices.

What’s more, Tokic’s group took the guesswork out of choosing the right provider byestablishing a deeper qualification process and tiering these suppliers based on criteriaincluding business practices, technical and quality processes and track record. The pro-gram has three tiers: Alliance Program “Members” maintain credible customer satisfactionwith their products and services and a track record of success. “Certified Members” haveadditionally met specific technical qualification on FPGA implementation expertise. At thetop level, “Premier Partners” are audited to ensure they adhere to the highest standards.

For a reporter who has spent many years covering the wars and dramas involved inestablishing standards, it’s awesome to see them finalized and being put to fabulous use.For more information, visit http://www.xilinx.com/alliance/index.htm.

Mike SantariniPublisher



mailto:[email protected]





http://www.xilinx.com/alliance/index.htm

www.innovative-dsp.com

C O N T E N T S

VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES 1818

3838

Letter From the PublisherXilinx Alliance Program

Streamlines IP-Based Design…4

Xpert OpinionBreaking Barriers with

FPGA-Based Design-for-Prototyping…58

Xcellence in A&DMicroBlaze Hosts

Mobile Multisensor

Navigation System…14

Xcellence in High-Performance ComputingFPGAs Speed the Computation of

Complex Credit Derivatives…18

Xcellence in IndustrialFPGA Houses Single-Board

Computer with SATA…26

Xcellence in University ResearchGeneral-Purpose SoC Platform

Handles Hardware/Software

Co-Design…32

Xcellence in WirelessMIMO Sphere Detector

Implemented in FPGA…38

Cover StoryStacked & Loaded: Xilinx SSI & 28-Gbps I/O Yield Novel FPGAs

88

1818

F I R S T Q U A R T E R 2 0 1 1 , I S S U E 7 4

Xperts CornerUsing Xilinx Tools in

Command-Line Mode…46

FPGA101When It Comes to Runtime Chatter,

Less Is Best…52

XTRA READING

THE XILINX XPERIENCE FEATURES

5252

Xamples… A mix of new and

popular application notes…62

Xclamations! Share your wit and

wisdom by supplying a caption for our

techy cartoon, and you may win

a Spartan®-6 development kit…66

5858

4646

Xcell Journal recently received 2010 APEX Awards of Excellence in the

categories “Magazine & Journal Writing” and “Magazine & Journal Design and Layout.”

8 Xcell Journal First Quarter 2011

COVER STORY

Stacked & Loaded: Xilinx SSI, 28-Gbps I/OYield Amazing FPGAs

by Mike SantariniPublisher, Xcell JournalXilinx, [email protected]

First Quarter 2011 Xcell Journal 9

X ilinx recently added to its line-up two innovations that willfurther expand the application

possibilities and market reach ofFPGAs. In late October, Xilinx®

announced it is adding stacked siliconinterconnect (SSI) FPGAs to the highend of its forthcoming 28-nanometerVirtex®-7 series (see Xcell Journal,Issue 72). The new, innovative archi-tecture connects several dice on a sin-gle silicon interposer, allowing Xilinxto field Virtex-7 FPGAs that pack asmany as 2 million logic cells—twicethe logic capacity of any otherannounced 28-nm FPGA—enablingnext-generation capabilities in the cur-rent generation of process technology.

Then, in late November, Xilinxtipped the Virtex-7 HT line of devices.Leveraging this SSI technology to com-bine FPGA and high-speed transceiverdice on a single IC, the Virtex-7 HTdevices are a giant technological leapforward for customers in the commu-nications sector and for the growingnumber of applications requiring high-speed I/O. These new FPGAs carrymany 28-Gbit/second transceiversalong with dozens of 13.1-Gbps trans-ceivers in the same device, facilitatingthe development of 100-Gbps commu-nications equipment today and 400-Gbps communications line cards wellin advance of established standardsfor equipment running at that speed.

MORE THAN MOOREEver since Intel co-founder GordonMoore published his seminal article“Cramming More Components ontoIntegrated Circuits” in the April 19,1965 issue of Electronics magazine, thesemiconductor industry has doubled

the transistor counts of new devicesevery 22 months, in lockstep with theintroduction of every new siliconprocess. Like other companies in thesemiconductor business, Xilinx haslearned over the years that to lead themarket, it must keep pace with Moore’sLaw and create silicon on each new gen-eration of process technology—or bet-ter yet, be the first company to do so.

Now, at a time when the complexity,cost and thus risk of designing on thelatest process geometries are becomingprohibitive for a greater number ofcompanies, Xilinx has devised a uniqueway to more than double the capacityof its next-generation devices, theVirtex-7 FPGAs. By introducing one ofthe semiconductor industry’s firststacked-die architectures, Xilinx willfield a line of the world’s largestFPGAs. The biggest of these, the 28-nmVirtex-7 XC7V2000T, offers 2 millionlogic cells along with 46,512 kbits ofblock RAM, 2,160 DSP slices and 36GTX 10.3125-Gbps transceivers. TheVirtex-7 family includes multiple SSIFPGAs as well as monolithic FPGAconfigurations. Virtex-7 is the high endof the 7 series, which also includes thenew low-cost, low-power Artix™FPGAs and the midrange Kintex™FPGAs—all implemented on a unifiedApplication Specific Modular BlockArchitecture (ASMBL) architecture.

The new SSI technology is more thanjust a windfall for customers itching touse the biggest FPGAs the industry canmuster. The successful deployment ofstacked dice in a mainstream logic chipmarks a huge semiconductor engineer-ing accomplishment. Xilinx is deliver-ing a stacked silicon chip at a timewhen most companies are just evaluat-

‘More than Moore’ stacked silicon interconnect technology and 28-Gbps transceivers lead new era of FPGA-driven innovations.

ing stacked-die architectures in hopesof reaping capacity, integration, PCBreal-estate and even yield benefits.Most of these companies are lookingto stacked-die technology to simplykeep up with Moore’s Law—Xilinx isleveraging it today as a way to exceedit and as a way to mix and match com-plementary types of dice on a single ICfootprint to offer vast leaps forward insystem performance, bill-of-materials(BOM) savings and power efficiency.

THE STACKED SILICON ARCHITECTURE“This new stacked silicon intercon-nect technology allows Xilinx to offernext-generation density in the currentgeneration of process technology,”said Liam Madden, corporate vicepresident of FPGA development andsilicon technology at Xilinx. “As diesize gets larger, the yield goes downexponentially, so building large diceis quite difficult and very costly. Thenew architecture allows us to build a

number of smaller dice and then usea silicon interposer to connect thosesmaller dice lying side-by-side on topof the interposer so they appear tobe, and function as, one integrateddie” (Figure 1).

Each of the dice is interconnectedvia layers in the silicon interposer inmuch in the same way that discretecomponents are interconnected on themany layers of a printed-circuit board(Figure 2). The die and silicon inter-poser layer connect by means of multi-ple microbumps. The architecture alsouses through-silicon vias (TSVs) thatrun through the passive silicon inter-poser to facilitate direct communica-tion between regions of each die onthe device and resources off-chip(Figure 3). Data flows between theadjacent FPGA die across more than10,000 routing connections.

Madden said using a passive sili-con interposer rather than going witha system-in-package or multichip-module configuration has huge

advantages. “We use regular siliconinterconnect or metallization to con-nect up the dice on the device,” saidMadden. “We can get many moreconnections within the silicon thanyou can with a system-in-package.But the biggest advantage of thisapproach is power savings. Becausewe are using chip interconnect toconnect the dice, it is much moreeconomical in power than connect-ing dice through big traces, throughpackages or through circuit boards.”

In fact, the SSI technology providesmore than 100 times the die-to-dieconnectivity bandwidth per watt, atone-fifth the latency, without consum-ing any high-speed serial or parallelI/O resources.

Madden also notes that themicrobumps are not directly con-nected to the package. Rather, theyare interconnected to the passiveinterposer, which in turn is linked tothe adjacent die. This setup offersgreat advantages by shielding the


C O V E R S T O R Y

Silicon Interposer

>10K routing connectionsbetween slices

~1ns latency

ASMBLOptimizedFPGASlice

FPGA SlicesSide-by-Side

SiliconInterposer

Figure 1 – The stacked silicon architecture places several dice (aka slices) side-by-sideon a silicon interposer.

microbumps from electrostatic dis-charge. By positioning dice next toeach other and interfaced to the ball-grid array, the device avoids the ther-mal flux, signal integrity and designtool flow issues that would haveaccompanied a purely vertical die-stacking approach.

As with the monolithic 7 seriesdevices, Xilinx implemented the SSImembers of the Virtex-7 family inTSMC’s 28-nm HPL (high-perform-ance, low-power) process technolo-gy, which Xilinx and TSMC devel-oped to create FPGAs with the rightmix of power efficiency and perform-ance (see cover story sidebar, Xcell

Journal, Issue 72).

NO NEW TOOLS REQUIREDWhile the SSI technology offers someradical leaps forward in terms ofcapacity, Madden said it will notforce a radical change in customerdesign methodologies. “One of thebeautiful aspects of this architectureis that we were able to establish theedges of each slice [individual die inthe device] along natural partitionswhere we would have traditionallyrun long wires had these structuresbeen in our monolithic FPGA archi-tecture,” said Madden. “This meantthat we didn’t have to do anythingradical in the tools to support thedevices.” As a result, “customersdon’t have to make any major adjust-

ments to their design methods orflows,” he said.

At the same time, Madden said thatcustomers will benefit from addingfloor-planning tools to their flowsbecause they now have so many logiccells to use.

A SUPPLY CHAIN FIRSTWhile the design is in and of itselfquite innovative, one of the biggestchallenges of fielding such a devicewas in putting together the supplychain to manufacture, assemble, testand distribute it. To create the endproduct, each of the individual dicemust first be tested extensively at thewafer level, binned and sorted, and


C O V E R S T O R Y

28nm FPGA Slice 28nm FPGA Slice

Package Substrate

28nm FPGA Slice 28nm FPGA Slice

Microbumps

• Access to power / ground / IOs

• Access to logic regions

• Leverages ubiquitous image sensor microbump technology

Through-Silicon Vias (TSVs)

• Only bridge power / ground / IOs to C4 bumps

• Coarse pitch, low density aid manufacturability

• Etch process (not laser drilled)

Passive Silicon Interposer

• 4 conventional metal layers connect microbumps & TSVs

• No transistors means low risk and no TSV-induced performance degradation

• Etch process (not laser drilled)

Side-by-Side Die Layout

• Minimal heat flux issues

• Minimal design tool flow impact

Microbumps

Silicon Interposer

Through-Silicon Vias

C4 Bumps

BGA Balls

Figure 2 – Xilinx’s stacked silicon technology uses passive silicon-basedinterposers, microbumps and TSVs.

then attached to the interposer. Thecombined structure then needs to bepackaged and given a final test toensure connectivity before the endproduct ships to customers.

Madden’s group worked with TSMCand other partners to build this supplychain. “This is another first in theindustry, as no other company has putin place a supply chain like this acrossa foundry and OSAT [outsourcedsemiconductor assembly and test],”said Madden.

“Another beautiful aspect of thisapproach is that we can use essential-ly the same test approach that we usein our current devices,” he wenton.“Our current test technology allowsus to produce known-good dice, andthat is a big advantage for us becausein general, one of the biggest barriersof doing stacked-die technology is howdo you test at the wafer level.”

Because the stacked silicon tech-nology integrates multiple XilinxFPGA dice on a single IC, it logicallyfollows that the architecture wouldalso lend itself to mixing and matchingFPGA and other dice to create entirelynew devices. And that’s exactly what

Xilinx did with its ultrafast Virtex-7 HTline, announced just weeks after theSSI technology rollout.

DRIVING COMMUNICATIONS TO 400 GBPSThe new Virtex-7 HT line of devices istargeted squarely at communicationscompanies that are developing 100- to400-Gbps equipment. The Virtex-7 HTcombines on a single IC multiple 28-nm FPGA dice, bearing dozens of13.1-Gbps transceivers, with 28-Gbpstransceiver dice. The result is toendow the final device with a formi-dable mix of logic cells as well as cut-ting-edge transceiver performanceand reliability.

The largest of the Virex-7 HT lineincludes sixteen GTZ 28-Gbps trans-ceivers, seventy-two 13.1-Gbps trans-ceivers plus logic and memory, offeringtransceiver performance and capacityfar greater than competing devices(see Video 1, http://www.youtube.com/

user/XilinxInc#p/c/71A9E924ED61

B8F9/1/eTHjt67ViK0).

C O V E R S T O R Y


Figure 3 –Actual cross-section of the 28-nm Virtex-7 device. TSVs can be seenconnecting the microbumps (dotted line, top) through the silicon interposer.

Video 1 – Dr. Howard Johnsonintroduces the 28-Gbps transceiver-laden Virtex-7 HT.http://www.youtube.com/user/XilinxInc#p/c/71A9E924ED61B8F9/1/eTHjt67ViK0.

http://www.youtube.com/user/XilinxInc#p/c/71A9E924ED61B8F9/1/eTHjt67ViK0

http://www.youtube.com/user/XilinxInc#p/c/71A9E924ED61B8F9/1/eTHjt67ViK0

“We leveraged stacked interconnecttechnology to offer Virtex-7 deviceswith a 28G capability,” said Madden. “Byoffering the transceivers on a separatedie, we can optimize our 28-Gbps trans-ceiver performance and electrically iso-late functions to offer an even higherdegree of reliability for applicationsrequiring cutting-edge transceiver per-formance and reliability.”

With the need for bandwidthexploding, the communications sectoris franticly racing to establish new net-works. The wireless industry is scram-bling to produce equipment support-ing 40-Gbps data transfer today, whilewired networking is approaching 100Gbps. FPGAs have played a key role injust about every generation of net-working equipment since their incep-tion (see cover stories in Xcell

Journal, Issues 65 and 67).

Communications equipment designteams traditionally have used FPGAsto receive signals sent to equipmentin multiple protocols, translate thosesignals to common protocols that theequipment and network use, and thenforward the data to the next destina-tion. Traditionally companies haveplaced a processor in between FPGAsmonitoring and translating incomingsignals and those FPGAs forwardingsignals to their destination. But asFPGAs advance and grow in capacityand functionality, a single FPGA canboth send and receive, while also per-forming processing, to add greaterintelligence and monitoring to thesystem. This lowers the BOM and,more important, reduces the powerand cooling costs of networkingequipment, which must run reliably24 hours a day, seven days a week.

In a white paper titled “Industry’sHighest-Bandwidth FPGA EnablesWorld’s First Single-FPGA Solution for400G Communications Line Cards,”Xilinx’s Greg Lara outlines several com-munications equipment applicationsthat can benefit from the Virtex-7 HTdevices (see http://www.xilinx.com/

support/documentation/white_papers/

wp385_V7_28G_for_400G_Comm_

Line_Cards.pdf).To name a few, Virtex-7 HT FPGAs

can find a home in 100-Gbps linecards supporting OTU-4 (OpticalTransfer Unit) transponders. Theycan be used as well in muxponders orservice aggregation routers, in lower-cost 120-Gbps packet-processing linecards for highly demanding data pro-cessing, in multiple 100G Ethernetports and bridges, and in 400-GbpsEthernet line cards. Other potentialapplications include base stationsand remote radio heads with 19.6-Gbps Common Public Radio Interfacerequirements, and 100-Gbps and 400-Gbps test equipment.

JITTER AND EYE DIAGRAMA key to playing in these markets isensuring the FPGA transceiver sig-nals are robust, reliable and resistantto jitter or interference and to fluctu-ations caused by power system noise.For example, the CEI-28G specifica-

tion calls for 28-Gbps networkingequipment to have extremely tightjitter budgets.

Signal integrity is an extremelycrucial factor for 28-Gbps operation,said Panch Chandrasekaran, seniormarketing manager of FPGA compo-nents at Xilinx. To meet the stringentCEI-28G jitter budgets, the trans-ceivers in the new Xilinx FPGAsemploy phase-locked loops (PLLs)based on an LC tank design andadvanced equalization circuits to off-set deterministic jitter.

“Noise isolation becomes a veryimportant parameter at 28-Gbps sig-naling speeds,” said Chandrasekaran.“Because the FPGA fabric and trans-ceivers are on separate dice, the sen-sitive 28-Gbps analog circuitry is iso-lated from the digital FPGA circuits,providing superior isolation com-pared to monolithic implementa-tions” (Figures 4a and 4b).

The FPGA design also includesfeatures that minimize lane-to-laneskew, allowing the devices to sup-port stringent optical standards suchas the Scalable Serdes FramerInterface standard (SFI-S).

Further, the GTZ transceiverdesign eliminates the need fordesigners to employ external refer-ence resistors, lowering the BOMcosts and simplifing the boarddesign. A built-in “eye scan” func-tion automatically measures theheight and width of the post-equal-ization data eye. Engineers can usethis diagnostic tool to perform jitterbudget analysis on an active chan-nel and optimize transceiver param-eters to get optimal signal integrity,all without the expense of special-ized equipment.

ISE® Design Suite software toolsupport for 7 series FPGAs is avail-able today. Virtex-7 FPGAs with mas-sive logic capacity, thanks to the SSItechnology, will be available thisyear. Samples of the first Virtex-7 HTdevices are scheduled to be availablein the first half of 2012. For moreinformation on Virtex-7 FPGAs andSSI technology, visit http://www.

xilinx.com/technology/roadmap/

7-series-fpgas.htm.


C O V E R S T O R Y

Figure 4a –Xilinx 28-Gbps transceiver displays an excellent eye opening

and jitter performance (using PRBS31 data pattern).

Figure 4b – This is a competing device’s 28-Gbps signal using a much simpler

PRBS7 pattern. The signal is extremelynoisy with a significantly smaller

eye opening. Eye size is shown close to relative scale.

http://www.xilinx.com/support/documentation/white_papers/wp385_V7_28G_for_400G_Comm_Line_Cards.pdf

http://www.xilinx.com/technology/roadmap/7-series-fpgas.htm


MicroBlaze Hosts Mobile Multisensor Navigation System

Researchers used Xilinx’s soft-core processor to develop an integrated navigation solution that works in places where GPS doesn’t.

XCELLENCE IN AEROSPACE & DEFENSE

by Walid Farid AbdelfatahResearch AssistantNavigation and Instrumentation Research Group Queen's University [email protected]

Jacques GeorgyAlgorithm Design EngineerTrusted Positioning Inc.

Aboelmagd NoureldinCross-Appointment Associate Professor in ECE Dept. Queen's University / Royal Military College of Canada



T he global positioning system(GPS) is a satellite-based navi-gation system that is widely

used for different navigation applica-tions. In an open sky, GPS can providean accurate navigation solution.However, in urban canyons, tunnelsand indoor environments that blockthe satellite signals, GPS fails to pro-vide continuous and reliable coverage.

In the search for a more accurate andlow-cost positioning solution in GPS-denied areas, researchers are develop-ing integrated navigation algorithmsthat utilize measurements from low-cost sensors such as accelerometers,gyroscopes, speedometers, barometersand others, and fuses them with themeasurements from the GPS receiver.The fusion is accomplished using eitherKalman filter (KF), particle filter or arti-ficial-intelligence techniques.

Once a navigation algorithm isdeveloped, verified and proven to beworthy, the ultimate goal is to put iton a low-cost, real-time embeddedsystem. Such a system must acquireand synchronize the measurements

from the different sensors, then applythe navigation algorithm, which willintegrate these aligned measure-ments, yielding a real-time solution ata defined rate.

The transition from algorithmresearch to realization is a crucial stepin assessing the practicality and effec-tiveness of a navigation algorithm and,consequently, the developed embeddedsystem, either for a proof of concept orto be accepted as a consumer product.In the transition process, there is nounique methodology that designers canfollow to create an embedded system.Depending on the platform of choice—such as microcontrollers, digital signalprocessors and field-programmablegate arrays—system designers use dif-ferent methodologies to develop thefinal product.

MICROBLAZE FRONT AND CENTERIn research at Queen’s University inKingston, Ontario, we utilized a Xilinx®

MicroBlaze® soft-core processor builton the ML402 evaluation kit, whichfeatures a Virtex®-4 SX35 FPGA. The

applied navigation algorithm is a 2DGPS/reduced inertial sensor system(RISS) integration algorithm thatincorporates measurements from agyroscope and a vehicle’s odometeror a robot’s wheel encoders, alongwith those from a GPS receiver. Inthis way, the algorithm computes anintegrated navigation solution that ismore accurate than standalone GPS.For a vehicle moving in a 2D plane, itcomputes five navigation states con-sisting of two position parameters(latitude and longitude), two velocityparameters (East and North) and oneorientation parameter (azimuth).

The 2D RISS/GPS integration algo-rithm involves two steps: RISS mecha-nization and RISS/GPS data fusionusing Kalman filtering. RISS mecha-nization is the process of transform-ing the measurements of an RISS sys-tem acquired in the vehicle frame intoposition, velocity and heading in ageneric frame. It is a recursiveprocess based on the initial condi-tions, or the previous output and thenew measurements. We fuse the posi-

TimerLocal

Memory Debugger

UART(1)

UART(2)

UART(3)

MicroBlazeCore

UART(4)

LEDs GPIO

PPS GPIOInterrupt

Controller

PushbuttonsGPIO

FPU

GPSMeasurements

PPS Signal

IMUMeasurements

EncoderMeasurements

Host Machine

Simple UserInterfaceCrossbow

IMU

Microcontrollersending the

robot’s encodersdata

Nov/Atel SPAN

Virtex-4

X C E L L E N C E I N A E R O S PA C E & D E F E N S E

Figure 1 – MicroBlaze connections with the mobile robot sensors and the host machine

tion and velocity measurements fromthe GPS with the RISS-computedposition and velocity using the con-ventional KF method in a closed-loop, loosely coupled fashion. [1, 2]

A soft-core processor such asMicroBlaze provides many advan-tages over off-the-shelf processors indeveloping a mobile multisensornavigation system. [3] They includecustomization, the ability to design amultiprocessor system and hard-ware acceleration.

With a soft-core device, designersof the FPGA-based embedded proces-sor system have the total flexibility toadd any custom combination ofperipherals and controllers. You canalso design a unique set of peripheralsfor specific applications, adding asmany peripherals as needed to meetthe system requirements.

Moreover, more-complex embed-ded systems can benefit from the exis-tence of multiple processors that canexecute the tasks in parallel. By usinga soft-core processor and accompany-ing tools, it’s an easy matter to createa multiprocessor-based system-on-chip. The only restriction to the num-

ber of processors that you can add isthe availability of FPGA resources.

Finally, one of the most compellingreasons for choosing a soft-coreprocessor is the ability to concurrentlydevelop hardware and software, andhave them coexist on a single chip.System designers need not worry if asegment of the algorithm is found to bea bottleneck; all you have to do to elim-inate such a problem is design a cus-tom coprocessor or a hardware circuitthat utilizes the FPGA’s parallelism.

INTERFACES AND ARITHMETICFor the developed navigation systemat Queen’s University, we mainly need-ed a system capable of interfacingwith three sensors and a personalcomputer via a total of four serialports, each of which requires a univer-sal asynchronous receiver transmitter(UART) component. We use the serialchannel between the embedded sys-tem and a PC to upload the sensor’smeasurements and the navigationsolution for analysis and debugging. Inaddition, the design also requires gen-eral-purpose input/output (GPIO)channels for user interfacing, a timer

for code profiling and an interruptcontroller for the event-based system.

We also needed a floating-pointunit (FPU), which can execute thenavigation algorithm once the meas-urements are available in the leasttime possible. Although the FPU usedin the MicroBlaze core is single-preci-sion, we developed our algorithmmaking use of double-precision arith-metic. Consequently, we used soft-ware libraries to emulate our double-precision operations on the single-precision FPU, resulting in a slightlydenser code but a far better naviga-tion solution.

We first tested the developed navi-gation system on a mobile robot toreveal system bugs and integrationproblems, and then on a standardautomobile for further testing. Wechose to start the process with itera-tive testing on the mobile robot due tothe flexibility and ease of systemdebugging compared with a vehicle.We compared the solution with an off-line navigation solution computed on aPC and with the results from a high-end, high-cost, real-time NovAtelSPAN system. The SPAN system com-bines a GPS receiver and high-end tac-tical-grade inertial measurement unit(IMU), which acts as a reference.Figure 1 shows the MicroBlaze proces-sor connections with the RISS/GPSsystem sensors on a mobile robot.

Figure 2 presents an open-sky tra-jectory which we acquired at theRoyal Military College of Canada for9.6 minutes; the solution of the devel-oped real-time system is compared tothe GPS navigation solution and theSPAN reference system. The differ-ence between the NovAtel SPAN ref-erence solution and our solution isshown in Figure 3. Both solutions arereal-time, and the difference in per-formance is due to our use of thelow-cost MEMS-based CrossbowIMU300CC-100 IMU in the developedsystem. The SPAN unit, by contrast,uses the high-cost, tactical-gradeHG1700 IMU.



Reference (SPAN)

GPS

MicroBlaze (real-time)

Figure 2 – Mobile robot trajectory, open-sky area

Figure 4 presents the differencebetween the positioning solutionsfrom the C double-precision off-linealgorithm and the C single-precisionoff-line algorithm computations,where the difference is in the range ofhalf a meter. The results show that thechoice of emulating the double-preci-sion operations on the MicroBlazecore is far more accurate than usingsingle-precision arithmetic, whichwould have consumed less memoryand less processing time.

MULTIPROCESSING ANDHARDWARE ACCELERATIONUtilizing a soft-core processor such asthe MicroBlaze in a mobile navigationsystem composed of various sensorsprovides the flexibility to communi-cate with any number of sensors thatuse various interfaces, without anyconcern with the availability of theperipherals beforehand, as would betrue if we had used off-the-shelfprocessors. In terms of computation,designers can build the navigation

algorithm using embedded softwarethat utilizes the single-precision FPUavailable within the MicroBlaze, as isthe case in this research. You can alsopartition a design between softwareand hardware by implementing thesegments of the algorithm that arefound to be a bottleneck as a customcoprocessor or a hardware circuit thatspeeds up the algorithm.

The future of this research canprogress in two directions. The first isto use a double-precision floating-point unit directly with the MicroBlazeprocessor instead of double-precisionemulation. The second is related toimplementing more powerful and chal-lenging navigation algorithms that uti-lize more sensors to be integrated viaparticle filtering or artificial-intelli-gence techniques, which can make useof the power of multiprocessing andhardware acceleration.

More information about the embed-ded navigation system design processand more results can be found in themaster’s thesis from which this workwas derived. [4]

REFERENCES

[1] U. Iqbal, A. F. Okou and A.

Noureldin, “An integrated reduced iner-

tial sensor system — RISS / GPS for land

vehicle,” Position, Location and

Navigation Symposium proceedings,

2008 IEEE/ION.

[2] U. Iqbal and A. Noureldin, Integrated

Reduced Inertial Sensor System/GPS for

Vehicle Navigation: Multisensor position-

ing system for land applications involving

single-axis gyroscope augmented with a

vehicle odometer and integrated with

GPS, VDM Verlag Dr. Müller, 2010.

[3] B. H. Fletcher, “FPGA Embedded

Processors—Revealing True System

Performance,” Embedded Systems

Conference, 2005.

[4] Walid Farid Abdelfatah, “Real-Time

Embedded System Design and Realization

for Integrated Navigation Systems” 2010.

http://hdl.handle.net/1974/6128



Figure 3 – Mobile robot trajectory: Difference between NovAtel SPAN reference and MicroBlaze real-time solutions

Figure 4 – Mobile robot trajectory: Difference between C double-precision and single-precision off-line solutions

http://hdl.handle.net/1974/6128


Innovation in the global creditderivatives markets rests on thedevelopment and intensive use of

complex mathematical models. Asportfolios of complicated credit instru-ments have expanded, the process ofvaluing them and managing the riskhas grown to a point where financialorganizations use thousands of CPUcores to calculate value and risk daily.This task in turn requires vast amounts

of electricity for bothpower and cooling.

In 2005, the world’s estimat-ed 27 million servers consumed

around 0.5 percent of all electricityproduced on the planet, and the figureedges closer to 1 percent when the ener-gy for associated cooling and auxiliaryequipment (for example, backup power,power conditioning, power distribution,air handling, lighting and chillers) isincluded. Any savings from falling hard-ware costs are increasingly offset byrapidly rising power-related indirectcosts. No wonder, then, that large finan-cial institutions are searching for a wayto add ever-greater computationalpower at a much lower operating cost.

FPGAs Speed theComputation of ComplexCredit Derivatives

XCELLENCE IN HIGH-PERFORMANCE COMPUTING

by Stephen WestonManaging DirectorJ.P. [email protected]

James SpoonerHead of Acceleration (Finance)Maxeler [email protected]

Jean-Tristan MarinVice PresidentJ.P. [email protected]

Oliver PellVice President of EngineeringMaxeler [email protected]

Oskar MencerCEOMaxeler [email protected]

Financial organizations can assess value and risk

30x faster on Xilinx-accelerated systems than those

using standard multicore processors.

© 2010 IEEE. Reprinted, with permission, from Stephen Weston, Jean-Tristan Marin, James Spooner, Oliver Pell, Oskar Mencer,“Accelerating thecomputation of portfolios of tranched credit derivatives,” IEEE Workshop on High Performance Computational Finance, November 2010.







Toward this end, late in 2008 theApplied Analytics group at J.P. Morganin London launched a collaborativeproject with the acceleration-solutionsprovider Maxeler Technologies thatremains ongoing. The core of the proj-ect was the design and delivery of aMaxRack hybrid (Xilinx® FPGA andIntel CPU) cluster solution byMaxeler. The system has demonstrat-ed more than 31x acceleration of thevaluation for a substantial portfolio ofcomplex credit derivatives comparedwith an identically sized system that

uses only eight-core Intel CPUs.The project reducesoperational expendi-

tures more thanthirtyfold by build-

ing a customizedhigh-efficiency high-performance com-

puting (HPC) system.At the same time, it deliv-

ers a disk-to-disk speedup ofmore than an order of magnitude

over comparable systems, enablingthe computation of an order-of-mag-nitude more scenarios, with directimpact on the credit derivatives busi-ness at J.P. Morgan. The Maxeler sys-tem clearly demonstrates the feasibil-ity of using FPGA technology to sig-nificantly accelerate computationsfor the finance industry and, in par-ticular, complex credit derivatives.

RELATED WORKThere is a considerable body of pub-lished work in the area of accelerationfor applications in finance. A sharedtheme is adapting technology to accel-erate the performance of computation-ally demanding valuation models. Ourwork shares this thrust and is distin-guished in two ways.

The first feature is the computa-tional method we used to calculatefair value. A common method is asingle-factor Gaussian copula, amathematical model used in finance,to model the underlying credits,along with Monte Carlo simulation

to evaluate the expected value of theportfolio. Our model employs stan-dard base correlation methodology,with a Gaussian copula for defaultcorrelation and a stochastic recoveryprocess. We compute the expected(fair) value of a portfolio using a con-volution approach. The combinationof these methods presents a distinctcomputational, algorithmic and per-formance challenge.

In addition, our work encompassesthe full population of live complexcredit trades for a major investmentbank, as opposed to relying on theusual approach of using synthetic datasets. To fully grasp the computingchallenges, it’s helpful to take a closerlook at the financial products knownas credit derivatives.

WHAT ARE CREDIT DERIVATIVES?Companies and governments issuebonds which provide the ability to fundthemselves for a predefined period,rather like taking a bank loan for afixed term. During their life, bonds typ-ically pay a return to the holder, whichis referred to as a “coupon,” and atmaturity return the borrowed amount.The holder of the bond, therefore, faces

a number of risks, the most obviousbeing that the value of the bond may fallor rise in response to changes in thefinancial markets. However, the risk weare concerned with here is that when thebond reaches maturity, the issuer maydefault and fail to pay the coupon, thefull redemption value or both.

Credit derivatives began life by offer-ing bond holders a way of insuringagainst loss on default. They haveexpanded to the point where they nowoffer considerably greater functionalityacross a range of assets beyond simplebonds, to both issuers and holders, byenabling the transfer of default risk inexchange for regular payments. The sim-plest example of such a contract is acredit default swap (CDS), where thedefault risk of a single issuer isexchanged for regular payments. The keydifference between a bond and a CDS isthat the CDS only involves payment of aredemption amount (minus any recov-ered amount) should default occur.

Building on CDS contracts, creditdefault swap indexes (CDSIs) allow thetrading of risk using a portfolio ofunderlying assets. A collateralizeddefault obligation (CDO) is an exten-sion of a CDSI in which default losses

X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

ReferencePortfolio

100 EquallyWeighted

Credits100%

6%attachment

Super

Mezzanine

Equity

5%attachment

5-6%tranche

Tranche1%

5%subordination

Source: J.P. Morgan

Figure 1 –Tranched credit exposure

on the portfolio of assets are dividedinto different layers or “tranches,”according to the risk of default. Atranche allows an investor to buy orsell protection for losses in a certainpart of the pool.

Figure 1 shows a tranche between 5percent and 6 percent of the losses ina pool. If an investor sells protectionon this tranche, he or she will not beexposed to any losses until 5 percentof the total pool has been wiped out.The lower (less senior) tranches havehigher risk and those selling protec-tion on these tranches will receivehigher compensation than thoseinvesting in upper (more senior, hence

less likely to default) tranches. CDOs can also be defined on non-

standard portfolios. This variety ofCDO, referred to as “bespoke,” pres-ents additional computational chal-lenges, since the behavior of a givenbespoke portfolio is not directlyobservable in the market.

The modeling of behavior for a poolof credits becomes complex, as it isimportant to model the likelihood ofdefault among issuers. Corporatedefaults, for example, tend to be sig-nificantly correlated because all firmsare exposed to a common or correlat-ed set of economic risk factors, so thatthe failure of a single company tends

to weaken the remaining firms.The market for credit derivatives

has grown rapidly over the past 10years, reaching a peak of $62 trillionin January 2008, in response to a widerange of influences on both demandand supply. CDSes account forapproximately 57 percent of thenotional outstanding, with theremainder accounted for by productswith multiple underlying credits.“Notional” refers to the amount ofmoney underlying a credit derivativecontract; “notional outstanding” isthe total amount of notional in cur-rent contracts in the market.

A CREDIT DERIVATIVES MODELThe standard way of pricing CDOtranches where the underlying is abespoke portfolio of reference creditsis to use the “base correlation”approach, coupled with a convolutionalgorithm that sums conditionallyindependent loss-random variables,with the addition of a “coupon model”to price exotic coupon features. Sincemid-2007 the standard approach in thecredit derivatives markets has movedaway from the original Gaussian cop-ula model. Two alternative approach-es to valuation now dominate: the ran-dom-factor loading (RFL) model andthe stochastic recovery model.

Our project has adopted the latterapproach and uses the followingmethodology as a basis for accelerat-ing pricing: In the first step, we dis-cretize and compute the loss distri-bution using convolution, given theconditional survival probabilitiesand losses from the copula model.Discretization is a method of chunk-ing a continious space into a discretespace and separating the space intodiscrete “bins,” allowing algorithms(like integration) to be done numeri-



for i in 0 ... markets - 1

for j in 0 ... names - 1

prob = cum_norm ((inv_norm(Q[j])- sqrt(rho)*M[i])/sqrt(1 - rho);

loss=calc_loss(prob,Q2[j],RR[j],RM[j])*notional[j];

n = integer(loss);

L = fractional(loss);

for k in 0 ... bins - 1

if j == 0

dist[k] = k == 0 ? 1.0 : 0.0;

dist[k] = dist[k]*(1- prob) +

dist[k - n]* prob *(1 - L) +

dist[k - n - 1]* prob*L;

if j == credits - 1

final_dist [k] += weight[ i ] * dist[k];

end # for k

end # for j

end # for i

The modeling of behavior for a pool of credits becomes complex, as it is important to modelthe likelihood of default among issuers. Corporate defaults, for example, tend to be

significantly correlated—the failure of a single company tends to weaken the remaining firms.

Figure 2 – Pseudo code for the bespoke CDO tranche pricing

cally. We then use the standardmethod of discretizing over the twoclosest bins with a weighting suchthat the expected loss is conserved.We compute the final loss distribu-tion using a weighted sum over all ofthe market factors evaluated, usingthe copula model.

Pseudo code for this algorithm isshown in Figure 2. For brevity, we’veremoved the edge cases of the convolu-tion and the detail of the copula andrecovery model.

ACCELERATED CDO PRICINGEach day the credit hybrids businesswithin J.P. Morgan needs to evaluatehundreds of thousands of credit deriv-ative instruments. A large proportionof these are standard, single-nameCDSes that need very little computa-tional time. However, a substantialminority of the instruments aretranched credit derivatives thatrequire the use of complex models likethe one discussed above. These dailyruns are so computationally intensivethat, without the application ofMaxeler’s acceleration techniques,they could only be meaningfully car-ried out overnight. To complete thetask, we used approximately 2,000standard Intel cores. Even with suchresources available, the calculationtime is around four and a half hoursand the total end-to-end runtime isclose to seven hours, when batchpreparation and results writeback aretaken into consideration.

With this information, the acceler-ation stage of the project focused ondesigning a solution capable of deal-ing with only the complex bespoketranche products, with a specific goalof exploring the HPC architecturedesign space in order to maximizethe acceleration.

MAXELER’S ACCELERATION PROCESSThe key to maximizing the speed ofthe final design is a systematic anddisciplined end-to-end acceleration

methodology. Maxeler follows the fourstages of acceleration shown in Figure3, from the initial C++ design to thefinal implementation, which ensureswe arrive at the optimal solution.

In the analysis stage, we conduct-ed a detailed examination of the algo-rithms contained in the original C++model code. Through extensive codeand data profiling with the MaxelerParton profiling tool suite, we wereable to clearly understand and mapthe relationships between the com-putation and the input and outputdata. Part of this analysis involvedacquiring a full understanding of howthe core algorithm performs in prac-tice, which allowed us to identify themajor computational and data move-ments, as well as storage costs.Dynamic analysis using call graphs ofthe running software, combined withdetailed analysis of data values and



Original Application

Analysis

Transformation

Partitioning

Implementation

AcceleratedApplication

Achievesperformance

Sets theoreticalperformancebounds

// Shared library and call overhead 1.2 %

for d in 0 ... dates - 1

// Curve Interpolation 19.7%


Q[j] = Interpolate(d, curve)

end # for j



// Copula Section 22.0%

prob = cum_norm (( inv_norm (Q[j]) - sqrt(rho)*M[i])/sqrt (1 - rho);

loss= calc_loss (prob,Q2[j],RR[j],RM[j])*notional[j];

n = integer(loss);

lower = fractional(loss);


if j == 0

dist[k] = k == 0 ? 1.0 : 0.0 ;

// Convolution 51.4%

dist[k] = dist[k]*(1- prob) +

dist[k- n]* prob *(1 - L) +

dist[k- n - 1]* prob *L;

// Reintegration, 5.1%

if j == credits - 1


end # for k

end # for j

end # for i

end # for d

// Code outside main loop 0.5%

Figure 3 – Iterative Maxeler process for accelerating software

Figure 4 – Profiled version of original pricing algorithm in pseudo-code form

runtime performance, were neces-sary steps to identifying bottlenecksin execution as well as memory uti-lization patterns.

Profiling the core 898 lines of origi-nal source code (see Figure 4) focusedattention on the need to acceleratetwo main areas of the computation:calculation of the conditional-survivalprobabilities (Copula evaluation) andcalculation of the probability distribu-tion (convolution).

The next stage, transformation,extracted and modified loops and pro-gram control structure from the existingC++ code, leading to code and data lay-out transformations that in turn enabledthe acceleration of the core algorithm.We removed data storage abstractionusing object orientation, allowing datato pass efficiently to the accelerator,with low CPU overhead. Critical trans-formations included loop unrolling,reordering, tiling and operating on vec-tors of data rather than single objects.

PARTITIONING AND IMPLEMENTATIONThe aim for the partitioning stage ofthe acceleration process was to cre-ate a contiguous block of operations.This block needed to be tractable toaccelerate, and had to achieve themaximum possible runtime coverage,

when balanced with the overall CPUperformance and data input and out-put considerations.

The profiling process we identifiedduring the analysis stage provided thenecessary insight to make partitioningdecisions.

The implementation of the parti-tioned design led to migrating the loopscontaining the Copula evaluation andconvolution onto the FPGA accelera-tor. The remaining loops and associat-

ed computation stayed on the CPU.Within the FPGA design, we split theCopula and convoluter into separatecode implementations that could exe-cute and be parallelized independentlyof each other and could be sizedaccording to the bounds of the loopsdescribed in the pseudo code.

In the past, some design teams haveshied away from FPGA-based accelera-tion because of the complexity of theprogramming task. Maxeler’s program-

ming environment, called MaxCompiler,raises the level of abstraction of FPGAdesign to enable rapid development andmodification of streaming applications,even when faced with frequent designupdates and fixes.

IMPLEMENTATION DETAILSMaxCompiler builds FPGA designs in asimple, modular fashion. A design hasone or more “kernels,” which are high-ly parallel pipelined blocks for execut-

ing a specific computation. The “man-ager” dynamically oversees the data-flow I/O between these kernels.Separating computation and communi-cation into kernels and managerenables a high degree of pipeline paral-lelism within the kernels. This paral-lelism is pivotal to achieving the per-formance our application enjoys. Weachieved a second level of parallelismby replicating the compute pipelinemany times within the kernel itself, fur-ther multiplying speedup.

The number of pipelines that can bemapped to the accelerator is limitedonly by the size of the FPGAs used in theMaxNode and available parallelizationin the application. MaxCompiler lets youimplement all of the FPGA design effi-ciently in Java, without resorting tolower-level languages such as VHDL.

For our accelerator, we used theJ.P. Morgan 10-node MaxRack config-ured with MaxNode-1821 computenodes. Figure 5 sketches the systemarchitecture of a MaxNode-1821. Eachnode has eight Intel Xeon cores andtwo Xilinx FPGAs connected to theCPU via PCI Express®. A MaxRinghigh-speed interconnect is also avail-able, providing a dedicated high-band-



MaxRing

FPGA FPGAPCIe PCIe

XeonCores

Some design teams have shied away fromFPGA-based acceleration because of the

complexity of programming. MaxCompiler lets you implement the FPGA design in Java, without

resorting to lower-level languages such as VHDL.

Figure 5 – MaxNode-1821 architecture diagram containing eight Intel Xeon cores and two Xilinx FPGAs

width communication channel directlybetween the FPGAs.

One of the main focus points dur-ing this stage was finding a designthat balanced arithmetic optimiza-tions, desired accuracy, power con-sumption and reliability. We decidedto implement two variations of thedesign: a “full-precision” design forfair-value (FV) calculations and a“reduced-precision” version for sce-nario analysis. We designed the full-precision variant for an average rela-tive accuracy of 10-8 and the reduced-precision for an average relativeaccuracy of 10-4. These design pointsshare identical implementations,varying only in the compile-timeparameters of precision, parallelismand clock frequency.

We built the Copula and convolutionkernels to implement one or morepipelines that effectively parallelize

loops within the computation. Since theconvolution uses each value the Copulamodel generates many times, the kerneland manager components were scala-ble to enable us to tailor the exact ratioof Copula or convolution resources tothe workload for the design. Figure 6shows the structure of the convolutionkernel in the design.

When implemented in MaxCompiler,the code for the convolution in theinner loop resembles the originalstructure. The first difference is thatthere is an implied loop, as datastreams through the design, ratherthan the design iterating over the data.Another important distinction is thatthe code now operates in terms of adata stream (rather than on an array),such that the code is now describing astreaming graph, with offsets forwardand backward in the stream as neces-sary, to perform the convolution.

Figure 7 shows the core of the codefor the convoluter kernel.

IMPLEMENTATION EFFORTIn general, software programming ina high-level language such as C++ ismuch easier than interacting directlywith FPGA hardware using a low-level hardware description language.Therefore, in addition to performance,development and support time areincreasingly significant components ofoverall effort from the software-engi-neering perspective. For our project,we measured programming effort asone aspect in the examination of pro-gramming models. Because it is diffi-cult to get accurate development-timestatistics and to measure the quality ofcode, we use lines of code (LOC) as ourmetric to estimate programming effort.

Due to the lower-level nature ofcoding for the FPGA architecture



Conditional Survival Probabilitiesand Discretized Loss Inputs

Market Factors Unrolled

Market Weight Inputs

Accumulated Loss Distribution(weighted sum)

Loss Distribution Output

Convoluter

Buffer

Nam

es U

nro

lled

Figure 6 – Convoluter architecture diagram

when compared with standard C++,the original 898 lines of code generat-ed 3,696 lines of code for the FPGA(or a growth factor of just over four).

Starting with the program that runson a CPU and iterating in time, wetransformed the program into the spa-tial domain running on the MaxNode,creating a structure on the FPGA thatmatched the data-flow structure ofthe program (at least the computa-tionally intensive parts). Thus, weoptimized the computer based on theprogram, rather than optimizing theprogram based on the computer.Obtaining the results of the computa-tion then became a simple matter ofstreaming the data through theMaxeler MaxRack system.

In particular, the fact we can usefixed-point representation for manyof the variables in our application isa big advantage, because FPGAsoffer the opportunity to optimizeprograms on the bit level, allowingus to pick the optimal representationfor the internal variables of an algo-rithm, choosing precision and rangefor different encodins such as float-ing-point, fixed-point and logarith-mic numbers.

We used Maxeler Parton to meas-ure the amount of real time taken forthe computation calls in the originalimplementation and in the acceleratedsoftware. We divided the total soft-ware time running on a single core byeight to attain the upper bound on themulticore performance, and com-pared it to the real time it took to per-form the equivalent computationusing eight cores and two FPGAs withthe accelerated software.

For power measurements, we usedan Electrocorder AL-2VA from AcksenLtd. With the averaging window set toone second, we recorded a 220-sec-ond window of current and voltagemeasurements while the software wasin its core computation routine.

As a benchmark for demonstrationand performance evaluation, we used afixed population of 29,250 CDO tranch-



HWVar d = io.input("inputDist", _distType);

HWVar p = io.input("probNonzeroLoss", _probType);

HWVar L = io.input("lowerProportion", _propType);

HWVar n = io.input("discretisedLoss", _operType);

HWVar lower = stream.offset(-n-1,-maxBins,0,d);

HWVar upper = stream.offset(-n,-maxBins,0,d);

HWVar o = ((1-p)*d + L*p*lower + (1-L)*p*upper);

io.output("outputDist", _distType, o);

// Shared library and call overhead 5%

for d in 0 ... dates - 1

// Curve Interpolation 54.5%


Q[j] = Interpolate(d, curve)

end # for j



// Copula Section 9%

prob = cum_norm (( inv_norm (Q[j]) - sqrt (rho)*M[i])/sqrt(1- rho);

loss= calc_loss (prob,Q2[j],RR[j],RM[j])*notional[j];

// (FPGA Data preparation and post processing) 11.2%

n = integer(loss);

lower = fractional(loss);


if j == 0

dist[k] = k == 0 ? 1.0 : 0.0 ;

dist[k] = dist[k]*(1 - prob) +

dist[k - n]* prob *(1 -L) +

dist[k - n - 1]* prob *L;

if j == credits - 1


end # for k

end # for j

end # for i

end # for d

// Other overhead (object construction, etc) 19.9%

Figure 7 – MaxCompiler code for adding a name to a basket

Figure 8 – Profiled version of FPGA take on pricing

es. This population comprised realbespoke tranche trades of varying matu-rity, coupon, portfolio composition andattachment/detachment points.

The MaxNode-1821 delivered a 31xspeedup over an eight-core (Intel XeonE5430 2.66-GHz) server in full-preci-sion mode and a 37x speedup atreduced precision, both nodes usingmultiple processes to price up to eighttranches in parallel.

The significant differences betweenthe two hardware configurationsinclude the addition of the MAX2-4412C card with two Xilinx FPGAs and24 Gbytes of additional DDR DRAM.Figure 8 shows the CPU profile of thecode running with FPGA acceleration.

As Table 1 shows, the power usageper node decreases by 6 percent in thehybrid FPGA solution, even with a 31xincrease in computational performance.It follows that the speedup per wattwhen computing is actually greater thanthe speedup per cubic foot.

Table 2 shows a breakdown of CPUtime of the Copula vs. convolution

computations and their equivalentresource utilization on the FPGA. Aswe have exchanged time for spacewhen moving to the FPGA design,this gives a representative indicationof the relative speedup between thedifferent pieces of computation.

THREE KEY BENEFITSThe credit hybrids trading groupwithin J.P. Morgan is reaping sub-stantial benefits from applyingacceleration technology. The order-of-magnitude increase in availablecomputation has led to three keybenefits: computations run muchfaster; additional applications andalgorithms, or those that were onceimpossible to resource, are nowpossible; and operational costsresulting from given computationsdrop dramatically.

The design features of the Maxelerhybrid CPU/FPGA computer meanthat its low consumption of electrici-ty, physically small footprint and lowheat output make it an extremely

attractive alternative to the traditionalcluster of standard cores.

One of the key insights of the proj-ect has been the substantial benefits tobe gained from changing the computerto fit the algorithm, not changing thealgorithm to fit the computer (whichwould be the usual approach usingstandard CPU cores). We have foundthat executing complex calculations incustomizable hardware with Maxelerinfrastructure is much faster than exe-cuting them in software.

FUTURE WORKThe project has expanded to includethe delivery of a 40-node Maxelerhybrid computer designed to provideportfolio-level risk analysis for thecredit hybrids trading desk in nearreal time. For credit derivatives, a sec-ond (more general) CDO model is cur-rently undergoing migration to run onthe architecture.

In addition, we have also applied theapproach to two interest-rate models.The first was a four-factor Monte Carlomodel covering equities, interest rates,credit and foreign exchange. So far,we’ve achieved an accelerationspeedup of 284x over a Xeon core. Thesecond is a general tree-based model,for which acceleration results are pro-jected to be of a similar order of mag-nitude. We are currently evaluating fur-ther, more wide-ranging applicationsof the acceleration approach, and areexpecting similar gains across a widerrange of computational challenges.


COMPUTATIONPERCENT TIMEIN SOFTWARE

6-INPUTLOOKUP TABLES

FLIP-FLOPS36-KBIT BLOCK RAMS

18X25-BITMULTIPLIERS

Copula Kernel 22.0% 30.05% 35.12% 11.85% 15.79%

Convolutionand Integration

56.1% 46.54% 52.84% 67.54% 84.21%

PLATFORM IDLE PROCESSING

Dual Xeon L5430 2.66 GHzQuad-Core 48-GB DDR DRAM

185W 255W

(as above) with MAX2-4412CDual Xilinx SX240T FPGAs24-GB DDR DRAM

210W 240W


Table 1 – Power usage for 1U compute nodes when idle and while processing

Table 2 – Computational breakdown vs. FPGA resource usage of total used


Look Ma, No Motherboard!

How one design team put a full single-board computer with SATA into a Xilinx FPGA.

XCELLENCE IN INDUSTRIAL


Embedded systems for industri-al, scientific and medical (ISM)applications must support a

plethora of interfaces. That’s whymany design teams choose FPGA-based daughtercards that plug rightinto a PC’s motherboard to add thosespecial I/Os. Given the huge capacityof modern FPGAs, entire systems-on-chips can be implemented inside aXilinx® device. These systems includehardware, operating system and soft-ware, and can provide almost the com-plete functionality of a PC, diminish-

ing the need for a PC motherboard.The result is a more compact, lesspower-hungry, configurable single-board computer system.

Many of those ISM applications relyon fast, dependable mass storage tohold and store the results of dataacquisition, for example. Solid-statedrives have become the de facto stan-dard in this application because oftheir high reliability and fast perform-ance. These SSDs almost always con-nect via a Serial ATA (SATA) interface.

Let’s examine the steps that wetook to extend a single-board comput-er system, built around a Xilinx chip,with high-speed SATA connectivity toadd SSD RAID functionality. For thistask, the Xilinx Alliance Programecosystem brought together ASICS

World Service’s (ASICS ws) expertise inhigh-quality IP cores and Missing LinkElectronics’ (MLE) expertise in pro-grammable-systems design.

But before delving into the detailsof the project, it’s useful to take adeeper look at SATA itself. As shownin Figure 1, multiple layers areinvolved for full SATA host controllerfunctionality. Therefore, when itcomes to implementing a completeSATA solution for an FPGA-basedprogrammable system, designersneed much more than just a high-quality intellectual-property (IP)core. Some aspects of the designoften get overlooked.

First, it makes sense to implementonly the Physical (PHY), Link andsome portions of the Transport Layerin FPGA hardware; that’s why IP ven-dors provide these layers in the IPthey sell. The SATA Host IP core fromASICS World Service utilizes the so-called MultiGigabit Transceivers, orMGT, [1] to implement the PHYlayer—which comprises an out-of-band signaling block similar to theone described in Xilinx applicationnote 870 [2]—completely within theFPGA. The higher levels of theTransport Layer, along with theApplications, Device and UserProgram layers, are better implement-ed in software and, thus, typically IPvendors do not provide these layers tocustomers. This, however, places theburden of creating the layers on thesystem design team and can add unan-ticipated cost to the design project.

The reason vendors do not includethese layers in their IP is becauseeach architecture is different andeach will be used in a different man-ner. Therefore, to deliver a completesolution that ties together the IP corewith the user programs, you mustimplement, test and integrate compo-nents such as scatter-gather DMA(SGDMA) engines, which consist ofhardware and software.

In addition, communication at theTransport Layer is done via so-called

by Lorenz KolbMember of the Technical StaffMissing Link Electronics, [email protected]

Endric SchubertCo-founder Missing Link Electronics, [email protected]

Rudolf UsselmannFounder ASICS World Service [email protected]

mkf

sfs

ck

fdis

k

md

adm

dd

hd

par

m

Block Device Layer(/dev/sdX)

libATA(SMART, hot swap, NCQ,TRIM, PATA/SATA/ATAPI)

SATA HCI Driver

SATA HCI

Transport

Link

PHY

ShadowRegister

FISConstruct

FISDecomp

CRCChecker

CRCGenerate

Scrambler Descrambler

8b/10bEncoder

8b/10bDecoder

OOB SpeedNegotiate

Figure 1 – Serial ATA function layers

X C E L L E N C E I N I N D U S T R I A L




frame information structures (FIS). TheSATA standard [3] defines the set of FIStypes and it is instructive to look at thedetailed FIS flow between host anddevice for read and write operations.

As illustrated in Figure 2, a hostinforms the device about a new opera-tion via a Register FIS, which holds a

standard ATA command. In case of aread DMA operation, the device sendsone (or more) Data FIS as soon as it isready. The device completes the trans-action via a Register FIS, from deviceto host. This FIS can inform of either asuccessful or a failed operation.

Figure 2 also shows the FIS flow

between host and device for a writeDMA operation. Again, the hostinforms the device of the operation viaa Register FIS. When the device isready to receive data, it sends a DMAActivate FIS and the host will starttransmitting a single Data FIS. Whenthe device has processed this FIS and



HO

ST

Read DMA Write DMA

DEV

ICE

DEV

ICEH

OST

RegisterRegister

Data

Data

Data

Data

DMA Activate

DMA Activate

Data

Data

RegisterRegister

HO

ST

HO

ST

Read FPDMA Queued Write FPDMA Queued

DEV

ICE

DEV

ICE

DataData

Data

Data

Register

Register

Register RegisterData

DataDMA SETUP (TAG=4) DMA SETUP (TAG=4)

Set Device Bits (Busy=0)

Register (TAG=1)

Register (TAG=4)



Register (TAG=1)

Register (TAG=4)


DMA SETUP (TAG=1)

DMA SETUP (TAG=1)

DMA Activate

DMA Activate

Figure 2 – FIS flow between host and device during a DMA operation

Figure 3 – FIS flow between host and device during first-party DMA queued operation

it still expects data, it again sends aDMA Activate FIS. The process iscompleted in the same way as the readDMA operation.

A new feature introduced withSATA and not found in parallel ATA isthe so-called first-party DMA. Thisfeature transfers some control overthe DMA engine to the device. In thisway the device can cache a list ofcommands and reorder them for opti-mized performance, a techniquecalled native command queuing. NewATA commands are used for first-party DMA transfers. Because thedevice does not necessarily completethese commands instantaneously, but

rather queues them, the FIS flow is abit different for this mode of opera-tion. The flow for a read first-partyDMA queued command is shown onthe left side of Figure 3.

Communication on the ApplicationLayer, meanwhile, uses ATA com-mands. [4] While you can certainlyimplement a limited number of thesecommands as a finite state machine inFPGA hardware, a software imple-mentation is much more efficient andflexible. Here, the open-source Linuxkernel provides a known-good imple-mentation that almost exactly followsthe ATA standard and is proven inmore than a billion devices shipped.

The Linux ATA library, libATA,copes with more than 100 differentATA commands to communicate withATA devices. These commands includedata transfers but also provide func-tionality for SMART (Self-MonitoringAnalysis and Reporting Technology)and for security features such assecure erase and device locking.

The ability to utilize this code base,however, requires the extra work ofimplementing hardware-dependentsoftware in the form of Linux devicedrivers as so-called Linux KernelModules. As Figure 4 shows, theMissing Link Electronics “Soft”Hardware Platform comes with a full



MLE Storage Test Suite

File Systems

Linux

Programming/Scripting

mkt

sfs

ck

fdis

k

md

adm

dd

hd

par

m

bo

nn

ie

gcc

g+

+

mak

e

pyt

ho

n

BA

SH ... X11

vfat

Linux Kernel Drivers

ext2 ext3 btrfs

RAID Devices(/dev/mdx) Block Device Layer (/dev/sdx)

libATA(SMART, hot swap, NCOTRIM, PATA/SATA/ATAPI)

SATA HCI Driver

AC’97

AC’97

DDR2

Flash

DVI RS232 USB Ethernet WLAN Bluetooth SPI GPIO SATA1 SATA2

RS232 WLAN

AC’97 RS232 WLAN

GPIO

GPIO

DVI USB Ethernet

DVIUSB Ethernet

SPIBluetooth

SPIBluetooth

DDR2Ctrl

FlashCtrl

DMAEngine

CPUASICS ws

FPGA

MLEApplication

Operating System

System-on-Chip

I/O Connectivity

ShadowRegister

CRCChecker

8b/10bEncoder

8b/10bDecoder

SpeedNegotiateODB

Scrambler Descrambler

CRCGenerate

FISConstruct

FISDecomp

SATA HCITransport

Link

PHY

Figure 4 – Complete SATA solution

GNU/Linux software stack prein-stalled, along with a tested and opti-mized device driver for the SATA hostIP core from ASICS World Service.

When integrating a SATA IP core intoan FPGA-based system, there are manydegrees of freedom. So, pushing the lim-its of the whole system requires knowl-edge of not just software or hardware,but both. Integration must proceed intandem for software and hardware.

Figure 5 shows examples of how toaccomplish system integration of a

SATA IP core. The most obvious way isto add the IP core as a slave to the bus(A) and let the CPU do the transfersbetween memory and the IP. To be sure,data will pass twice over the systembus, but if high data rates are notrequired, this easy-to-implementapproach may be sufficient. In this case,however, you can use the CPU only fora small application layer, since most ofthe time it will be busy copying data.

The moment the CPU has to run afull operating system, the impact on

performance will be really dramatic.In this case, you will have to considerreducing the CPU load by adding adedicated copy engine, the XilinxCentral DMA (option B in the figure).This way, you are still transferring datatwice over the bus, but the CPU doesnot spend all of its time copying data.

Still, the performance of a systemwith a full operating system is faraway from a standalone application,and both are far from the theoreticalperformance limits. The third architec-ture option (C in the figure) changesthis picture by reducing the load of thesystem bus and using simple dedicatedcopy engines via Xilinx’s streamingNPI port and Multiport MemoryController (MPMC). This boosts theperformance of the standalone appli-cation up to the theoretical limit.However, the Linux performance ofsuch a system is still limited.

From the standalone application,we know that the bottleneck is notwithin the interconnection. This timethe bottleneck is the memory manage-ment in Linux. Linux handles memoryin blocks of a page size. This page sizeis 4,096 bytes for typical systems. Witha simple DMA engine and free memoryscattered all over the RAM in 4,096-byte blocks, you may move only 4,096bytes with each transfer. The finalarchitectural option (D in the figure)tackles this problem.

For example, the PowerPC® PPC440core included in the Virtex®-5 FXTFPGA has dedicated engines that arecapable of SGDMA. This way, the DMAengine gets passed a pointer to a list ofmemory entries and scatters/gathersdata to and from this list. This results in



Mem

ory

A

PLB

1

2

CPU

Mem

Ctr

l

SATA0

SATA1

Mem

ory

C

PLB

CPU

Mem

Ctr

l

Mem

ory

B

PLB

1

2

CPU

Mem

Ctr

l

SATA0

SATA1

DMA

NPI

NPI

Mem

ory

D

Mem

Ctr

lDMA DMADMA

PPC440

X-B

AR

DMA

MC

I

Loca

lLin

k

LL

SATA0

SATA1

SATA0

SATA1

DM

AD

MA

When integrating a SATA IP core into an FPGA-based system,

there are many degrees of freedom. So, pushing the limits of the whole

system requires knowledge of not just software or hardware, but both.

Integration must proceed in tandem for software and hardware.

Figure 5 – Four architectural choices for integrating a SATA IP core

larger transfer sizes and brings the sys-tem very close to the standalone per-formance. Figure 6 summarizes the per-formance results of these differentarchitectural choices.

Today, the decision whether tomake or buy a SATA host controllercore is obvious: Very few design teamsare capable of implementing a func-tioning SATA host controller for thecost of licensing one. At the sametime, it is common for design teams tospend significant time and money in-house to integrate this core into a pro-grammable system-on-chip, developdevice drivers for this core and imple-ment application software for operat-ing (and testing) the IP.

The joint solution our team craftedwould not have been possible with-out short turnaround times betweentwo Xilinx Alliance ProgramPartners: ASICS World Service Ltd.and Missing Link Electronics, Inc. Tolearn more about our complete SATA

solution, please visit the MLE LiveOnline Evaluation site at http://www.

missinglinkelectronics.com/LOE.There, you will get more technicalinsight along with an invitation totest-drive a complete SATA systemvia the Internet.

References:

1. Xilinx, “Virtex-5 FPGA RocketIO™ GTXTransceiver User Guide,” October 2009.http://www.xilinx.com/bvdocs/userguides/ug198.pdf.

2. Xilinx, Inc., “Serial ATA Physical LinkInitialization with the GTP Transceiver ofVirtex-5 LXT FPGAs,” 2008 applicationnote. http://www.xilinx.com/support/docu-mentation/application_notes/xapp870.pdf.

3. Serial ATA International Organization,Serial ATA Revision 2.6, February 2007.http://www.sata-io.org/

4. International Committee forInformation Technology Standards, ATAttachment 8 - ATA/ATAPI Command Set,September 2008. http://www.t13.org/



MB/s

225

150

75

PIO Central DMA

DMA over NPI

LocalLink

SGDMA

Gen 1/2 only

Gen 2 Limit

with FullLinux System

Standalone

Figure 6 – Performance of complete SATA solution

http://www

http://www.xilinx.com/bvdocs/userguides/ug198.pdf

http://www.xilinx.com/support/docu-mentation/application_notes/xapp870.pdf



http://www.sata-io.org

http://www.t13.org

http://www.missinglinkelectronics.com/LOE

http://www.xilinx.com/support/documentation/application_notes/xapp870.pdf.

http://www.enclustra.com


Built on a pair of Spartan FPGAs,

the fourth-generation Gecko

system serves industrial and

research applications as well

as educational projects.

XCELLENCE IN UNIVERSITY RESEARCH

By Theo Kluter, PhD LecturerBern University of Applied Sciences (BFH)[email protected]

Marcel Jacomet, PhDProfessorBern University of Applied Sciences (BFH)[email protected]

General-Purpose SoC Platform HandlesHardware/Software Co-Design




he Gecko system is ag e n e r a l - p u r p o s ehardware/software

co-design environment for real-timeinformation processing in system-on-chip (SoC) solutions. The systemsupports the co-design of software,fast hardware and dedicated real-time signal-processing hardware. Thelatest member of the Gecko family,the Gecko4main module, is an exper-imental platform that offers the nec-essary computing power to speedintensive real-time algorithms andthe necessary flexibility for control-intensive software tasks. Addingextension boards equipped with sen-sors, motors and a mechanical hous-ing suits the Gecko4main module forall manner of industrial, research andeducational projects.

Developed at the Bern University ofApplied Sciences (BFH) in Biel/Bienne,Switzerland, the Gecko program got itsstart more than 10 years ago in aneffort to find a better way to handlecomplex SoC design. The goal of thelatest, fourth-generation Gecko proj-ect was to develop a new state-of-the-art version of the general-purpose sys-tem-on-chip hardware/software co-design development boards, using thecredit-card-size form factor (54 x 85mm). With Gecko4, we have realized acost-effective teaching platform whilestill preserving all the requirements forindustrial-grade projects.

The Gecko4main module containsa large, central FPGA—a 5 million-gate Xilinx® Spartan®-3—and asmaller, secondary Spartan for boardconfiguration and interfacing purpos-es. This powerful, general-purposeSoC platform for hardware/softwareco-designs allows you to integrate aMicroBlaze™ 32-bit RISC processorwith closely coupled application-spe-

cific hardware blocks for high-speed hardware algorithms withinthe single multimillion-gate FPGAchip on the Gecko4main module.To increase design flexibility, theboard also is equipped with severallarge static and flash memorybanks, along with RS232 and USB2.0 interfaces.

In addition, you can stack a set ofextension boards on top of theGecko4main module. The currentextension boards—which featuresensors, power supply and amechanical housing with motors—can help in developing system-on-chip research and industrial proj-ects, or a complete robot platformready to use for student exercises indifferent subjects and fields.

FEATURES AND EXTENSION BOARDSThe heart of the Gecko4 system is theGecko4main module, shown in Figure1. The board can be divided into threedistinct parts:

• Configuration and IP abstraction(shown on the right half of Figure 1)

• Board powering

• Processing and interfacing(shown on the left half of Figure 1)

One of the main goals of theGecko4main module is the abstractionof the hardware from designers or stu-dent users, allowing them to concen-trate on the topics of interest without

Storage Flash2 Gbit (64M x 32)

SDRAM128 Mbit(8M x 16)



FPGA2XC3S1000XC3S1500XC3S2000XC3S4000XC3S5000

3.3V

I/O

an

d S

up

ply

Clocks25 MHz16 MHz

Bit file Flash16 Mbit

USB 2.0 IFCypress FX2

Dual Function:– JTAG (PC USB)– VGA and RS232

DC/DCand

PowerSwitch

Mu

lti-

Stan

dar

d I/

O56

IO/2

GN

D/2

Vio

Mu

lti-

Stan

dar

d I/

O56

IO/2

GN

D/2

Vio

FPGA1XC3S200AN

Du

al C

olo

rLE

Ds

Figure 1 – The Gecko4main module is an eight-layer PCB with the components all on the top layer, reducing production cost.

T

having to deal too much with the hard-ware details. The onboard Spartan-3ANcontains an internal flash memory,allowing autonomous operation. ThisFPGA performs the abstraction by pro-viding memory-mapped FIFOs andinterfaces to a second FPGA.

Users can accomplish the dynamicor autonomous configuration of thatsecond Spartan in three ways: via abit file that is stored in the bit-fileflash for autonomous operation, bythe PC through the USB 2.0 and theUSBTMC protocol, or by connecting a

Xilinx JTAG programmer to the dual-function connecter.

The main Spartan-3AN handles thefirst two options using a parallel-slaveconfiguration protocol. The JTAG con-figuration method involves the directintervention of the Xilinx design tools.

NEW TWISTS ON CHIP DESIGN

The inherently interdisciplinary character of modern systems has revealed the limits of the electronicsindustry’s classical chip-design approach. The capture-and-simulate method of design, for example, hasgiven way to the more efficient describe-and-synthesize approach.

More sophisticated hardware/software co-design methodologies have emerged in the past several years.Techniques such as the specify-explore-refine method give designers the opportunity of a high-level approachto system design, addressing lower-level system-implementation considerations separately in subsequent designsteps. Speed-sensitive tasks are best implemented with dedicated hardware, whereas software programs run-ning on microprocessor cores are the best choice for control-intensive tasks.

A certain amount of risk is always present when designing SoCs to solve new problems for which only limitedknowledge and coarse models of the analog signals to be processed are available. The key problem in suchsituations is the lack of experience with the developedsolution. This is especially true where real-time informa-tion processing in a closed control loop is needed. Pure-software approaches limit the risk of nonoptimal solu-tions, simply because it’s possible to improve softwarealgorithms in iterative design cycles. Monitoring the effi-ciency and quality of hardware algorithms is, however,no easy task. In addition, hardware algorithms cannoteasily be improved on SoC solutions in the same itera-tive way as software approaches.

In control-oriented reactive applications, the systemreacts continuously to the environment. Therefore, themonitoring of different tasks is crucial, especially ifthere are real-time constraints. As long as precise mod-els of the information to be processed are not available,as often occurs in analog signal processing, a straightfor-ward design approach is impossible or at least risky. Theflexibility of iteratively improving analog/digital hard-ware-implemented algorithms is thus necessary in orderto explore the architecture and to refine its implementa-tion. Again, this involves monitoring capabilities of theSoC in its real-time environment.

Thus, an important problem in the classical ASIC/FPGAhardware/software co-design approach is decoupling ofthe design steps with the prototyping hardware in thedesign flow, which inhibits the necessary monitoring task in an early design phase. Adding an extra prototypingstep and an analysis step opens the possibility to iteratively improve the SoC design. The refine-prototype-analysisdesign loop best matches the needs of SoC designs with real-time constraints (see figure). If precise models of theplant to be controlled are missing, thorough data analysis and subsequent refinement of the data-processing algo-rithms are needed for designing a quality SoC. — Theo Kluter and Marcel Jacomet

specify

explore

refine

analyze

prototype

implement

System Specification

System Co-Design

Algorithm Improvement

Signal & Data Analysis

Real-Time Verification

Final SoC Design


The specify-explore-refine design flow transforms intoa specify-explore-refine-prototype-analyze design flowfor SoC designs with real-time constraints.

X C E L L E N C E I N U N I V E R S I T Y R E S E A R C H

While the latter option allows for on-chip debugging with, for example,ChipScope™, the former two providedynamic reconfiguration of even a 5 million-gate-equivalent second FPGAin only a fraction of a second.

Many of our teaching and researchtargets require a fast communicationlink for features like hardware-in-the-loop, regression testing and semi-real-time sensing, among others. TheGecko4 system realizes this fast linkby interfacing a USB 2.0 slave con-troller (the Cypress Fx2) with theUSBTMC protocol. To both abstractthe protocol and provide easy interfac-ing for users, the Spartan-3AN doesprotocol interpretation and places thepayload into memory-mapped FIFOs.

Designers who do not need a largeFPGA can use the Gecko4main mod-ule without mounting the left half ofthe PCB, reducing board costs signifi-cantly. In this setting, designers canuse the Spartan-3AN directly to housesimple systems—indeed, they caneven implement a simple CPU-basedsystem in this 200 million-gate-equiva-lent FPGA. For debugging and displaypurposes, the device provides eightdual-color LEDs, three buttons, anunbuffered RS232 connection and atext-display VGA connection at a reso-lution of 1,024 x 768 pixels (60 Hz).These interfaces are also availablememory-mapped to the second FPGA.

BOARD POWERINGDue to the various usages of theGecko4main module, powering must bevery flexible. We provided this flexibili-ty by means of the onboard generationof all required voltages. Furthermore,the module offers two possibilities ofsupplying power: the USB cable or anadded power board. The attachedpower board is the preferred supply.

According to the USB standard, adevice is allowed to use a maximum of500 milliamps. The Gecko4main mod-ule requests this maximum current. Ifthe design requires more than theavailable 2.5 watts, the board turns off

automatically. The second way of sup-plying the unit is by attaching a powerboard. If these power boards are bat-tery-powered, the Gecko4main mod-ule can function autonomously (as, forexample, in a robot application).

If both a USB cable and a powerboard are connected (or disconnected),the Gecko4main module automaticallyswitches seamlessly between the dif-ferent supplies without a power glitch.

PROCESSING AND INTERFACINGThe processing heart of theGecko4main module is the secondFPGA, also a Spartan-3. The board canaccommodate a Spartan-3 ranging fromthe “small” million-gate-equivalentXC3S1000 up to the high-end 5 million-gate-equivalent XC3S5000. The boardshown in Figure 1 has an XC3S5000mounted that was donated by theXilinx University Program.

Except for the XC3S1000, allSpartan-3 devices have three inde-pendent banks of 16-Mbyte embeddedSDRAM connected, along with 2 Gbitsof NAND flash for dynamic and staticstorage. Furthermore, to interface to

extension boards, the Gecko4mainmodule provides two “stack-through”connectors. The one shown on the leftin Figure 1 provides a standard 3.3-V-compliant bus including power supplypins and external clocking. This stan-dard connector is backward-compatiblewith our Gecko3 system.

One of the advantages of the XilinxFPGA family is its support for variousI/O standards on each of the I/Obanks of a given FPGA. To be able toprovide flexibility in the I/O standardused on an extension board, the I/Oconnector shown in the middle ofFigure 1 is divided into halves, eachcompletely connected to one I/Obank of the FPGA. Furthermore, eachhalf provides its own I/O supply volt-age pin. If there is no attached exten-sion board providing an I/O supplyvoltage, the I/O banks are automati-cally supplied by 3.3 V, placing themin the LVTTL standard.

For some applications, even thePCB trace lengths are important. Toalso provide the flexibility of imple-menting this kind of application, theGecko4main module provides 38 outof the 54 available I/O lines that arelength-compensated (+/- 0.1 mm).

Figure 2 – For an industrial project fora long-range RFID reader, a Gecko2mainboard and an RF extension boardinhabit a housing that also serves as acooling element.



EXTENSIONS ARE AVAILABLEThe strength of the Gecko4main mod-ule lies in its reduction to the minimalrequired in terms of storage and periph-erals for SoCs, and also in its stack-through principle. Both aspects allowfor simple digital logic experiments (bysimply attaching one board to the USBconnector of a PC) or complex multi-processor SoC emulation (by stackingseveral boards in combination withextension boards). The Bern Universityof Applied Sciences and several otherresearch institutes have developed avariety of extension boards rangingfrom sensor boards up to completemainboards. A full overview of ouropen-source Gecko system can befound at our Gecko Web page(http://gecko.microlab.ch) and on theOpenCores Web page (http://www.

opencores.org/project,gecko4).

EDUCATION PLATFORMThe Gecko system is extensivelyused in the electronic and communi-cation engineering curriculum atBFH. Different professors use thesystem in their laboratories, includ-ing the Institute of Human-CenteredEngineering-microLab, in both theundergraduate and the master’s cur-riculum. First-year exercises startwith simple digital hardware designsusing the Geckorobot platform, asmall robot frame on which theGecko3main can be attached. Theframe provides two motors, someinfrared sensors, two speed sensorsand a distance sensor.

Robot control exercises in the sig-nal-processing courses develop con-trol algorithms for the two independ-ent robot motors using the well-knownMATLAB®/Simulink® tools for designentry, simulation and verification. Areal-time environment allows the stu-dents to directly download theirSimulink models into the Geckorobotplatform and supports a real-time visu-alization of the control algorithmbehavior of the running robot in theMATLAB/Simulink tools.

For this purpose, the students haveaccess to the necessary Simulink S-functions and a MicroBlaze-based plat-form with the software and hardwareIP interfacing the sensors and motorsof the Geckorobot platform. Immediatefeedback of the robot’s behavior in real-world situations as well as in the MAT-LAB/Simulink design environment hasa high learning value for the students.For more advanced student projects,we are planning drive-by-wire andcybernetic behavior of the robots in arobot cluster. Moreover, we are alsodeveloping a compressive-samplingGecko extension board for audio sig-nals through which students willlearn—by listening—the impressivepossibilities of this new theory as wellas experience its limitations.

RESEARCH AND INDUSTRIAL PROJECTSOutside the university, the Gecko plat-form is finding a home in appliedresearch and industrial projects. Inmost cases researchers pair theGecko4main module, for the computa-

tion-intensive tasks, together withapplication-specific extension boards.In an industrial project developinghardware algorithms for long-rangeradio frequency identification (RFID)readers, one group designed an RFpower board as an extension board forthe Gecko2main board (see Figure 2).

Currently, hardware algorithms forhigh-speed optical-coherence tomog-raphy (OCT) signal processing areunder investigation. OCT is an opticalsignal-acquisition technology thatmakes it possible to capture high-reso-lution 3-D images in biological tissueusing near-infrared light. In an ongoingapplied-research project, a Geckoextension board with the optical inter-face is under development (see Figure3). The Gecko4main module serves asa platform to develop high-speed OCThardware algorithms.

TOOL CHAIN FOR GECKOOur applied SoC design method forreal-time information-processing prob-lems is based on an optimal combina-tion of approved design technologies


Figure 3 – An applied-research project is investigating high-speed hardware algorithmsfor optical-coherence tomography image processing. The image shows a depth scan (A-scan) illustrating various reflections of the sample as an intermediate result.


http://gecko.microlab.ch

http://www.opencores.org/project.gecko4

http://www.opencores.org/project.gecko4

that are widely used in the industry(see sidebar). The presented designflow (Figure 4) is exclusively basedon Xilinx and some third-party tools,and on adding dedicated interfacesand IP. Control algorithms developedin the MATLAB/Simulink environmentcan be compiled, downloaded andimplemented in the Gecko4 platformfor the real-time analyses by a simpleand fast pushbutton tool environ-ment. The tool chain represents astep toward the upcoming specify-explore-refine design methodology byadding a fast-prototyping feature. Fastprototyping is a crucial feature for aniterative improvement of control orsignal-processing algorithms withfeedback loops where precise modelsare not available.

Experience over the last 10 years hasshown that the Gecko approach pro-vides a good platform for educational,research and industrial projects alike.We believe that the capability of pro-ducing system prototypes very early inan SoC design flow is a major key tosuccess in today’s industry. It helps tosecure the system architecture work,develop usable prototypes that can bedistributed to internal and external cus-tomers and, last but not least, ensurethat designers can start functional veri-fication at a very early design phase.

Rapid prototyping meets therequirements for hardware/softwareco-design as an early function integra-tion. It enables system validation andvarious forms of design verification.Very important is the real-time algo-

rithm verification of analog/digitalhardware and software realizations.We firmly believe that the Gecko4 plat-form enables all of these aspects of thenew breed of complex SoC designs.

Acknowledgements

We want to thank the Xilinx University

Program (XUP) for its generous donation

of the FPGAs used for the production of

both the third- and fourth-generation

Gecko mainboards. Furthermore, we want

to thank the OpenCores community for

hosting and certifying our project.

Finally, we thank all the students and

research assistants who helped build the

Gecko platform, in particular Christoph

Zimmermann, who realized and was

responsible for the third generation of the

Gecko mainboard.


Fast Prototyping(signal processing, control)

MATLAB/Simulink Code

Hardware Algorithms(signal processing)

MATLAB/Simulink CodeVHDL Code

Platform-Based Design(HW/SW co-design)

C codeVHDL Code

EDKIPCores

MATLAB/SimulinkHDL Coder

System Generator(SynplifyPro)

MATLAB/SimulinkModelLibrary

ModelLibrary

IP CoresPlatform

GCC(C compiler)

ISE– Synthesis– Place & Route

Download

GCC(C compiler)

SE(Synthesis,

Place & Route)

Figure 4 – General-purposehardware/software co-design toolchain of the Gecko design platformincludes a fast-prototyping environ-ment for real-time analysis.



Wireless MIMO Sphere Detector Implemented in FPGA

XCELLENCE IN WIRELESS COMMUNICATIONS

by Juanjo NogueraSenior Research EngineerXilinx, Inc. [email protected]

Stephen NeuendorfferSenior Research EngineerXilinx, [email protected]

Kees VissersDistinguished EngineerXilinx, [email protected]

Chris DickDistinguished EngineerXilinx, [email protected]

High-level synthesis tools from

AutoESL made it possible to

build a complex receiver for

broadband wireless systems

in a Xilinx Virtex-5 device.

On Jan. 31, Xilinx announced it had acquired AutoESL.As such, you’ll be hearing more about this exciting new technology in future issues of Xcell Journal.






S patial-division multiplexingMIMO processing significantlyincreases the spectral efficien-

cy, and hence capacity, of a wirelesscommunication system. For that rea-son, it is a core component of next-generation WiMAX and other OFDM-based wireless communications. Thisis a computationally intensive applica-tion that implements highly demand-ing signal-processing algorithms.

A specific example of spatial multi-plexing in MIMO systems is spheredecoding, an efficient method of solv-ing the MIMO detection problemwhile maintaining a bit-error rate(BER) performance comparable tothe optimal maximum-likelihooddetection algorithm. However, DSPprocessors don’t have enough com-pute power to cope with the require-ments of sphere decoding in real time.

Field-programmable gate arrays arean attractive target platform for theimplementation of complex DSP-inten-sive algorithms like the sphere decoder.Modern FPGAs are high-performanceparallel-computing platforms that pro-vide the dedicated hardware needed,while retaining the flexibility of pro-grammable DSP processors. There areseveral studies showing that FPGAscould achieve 100x higher performanceand 30x better cost/performance thantraditional DSP processors in a numberof signal-processing applications. [1]

Despite this tremendous perform-ance advantage, FPGAs are not gener-ally used in wireless signal processing,largely because traditional DSP pro-grammers believe they are hard tohandle. Indeed, the key barrier towidespread adoption of FPGAs inwireless applications is the traditionalhardware-centric design flow andtools. Currently, the use of FPGAsrequires significant hardware designexperience, including familiarity withhardware description languages likeVHDL and Verilog.

Recently, new high-level synthesistools [2] have become available asdesign aids for FPGAs. These design

tools take a high-level algorithmdescription as input, and generate anRTL that can be used with standardFPGA implementation tools (forexample, the Xilinx® ISE® design suiteand Embedded Development Kit). Thetools increase design productivity andreduce development time, while pro-ducing good quality of results. [3] Weused such tools to design an FPGAimplementation of a complex wirelessalgorithm—namely, a sphere detectorfor spatial-multiplexing MIMO in802.16e systems. Specifically, wechose AutoESL’s AutoPilot high-levelsynthesis tool to target a XilinxVirtex®-5 running at 225 MHz.

SPHERE DECODINGSphere detection, a part of the decod-ing process, is a prominent method ofsimplifying the detection complexity inspatial-multiplexing systems whilemaintaining BER performance compa-rable to that of optimum maximum-likelihood (ML) detection, a more com-plex algorithm.

In the block diagram of the MIMO802.16e wireless receiver shown inFigure 1, it is assumed that the channelmatrix is perfectly known to thereceiver, which can be accomplishedby classical means of channel estima-tion. The implementation consists of apipeline with three building blocks:channel reordering, QR decompositionand sphere detector (SD). In prepara-tion for engaging a soft-input-soft-out-put channel decoder (for example, aturbo decoder), we produced soft out-

puts by computing the log-likelihoodratio of the detected bits. [4]

CHANNEL MATRIX REORDERINGThe order in which the sphere detectorprocesses the antennas has a profoundimpact on the BER performance.Channel reordering comes first, priorto sphere detection. By utilizing a chan-nel matrix preprocessor that realizes atype of successive-interference cancel-lation similar in concept to thatemployed in BLAST (Bell Labs LayeredSpace Time) processing, the detectorachieves close-to-ML performance.

The method implemented by thechannel-reordering process determinesthe optimum detection order ofcolumns of the complex channel matrixover several iterations. The algorithmselects the row with the maximum orminimum norm depending on the itera-tion count. The row with the minimumEuclidean norm represents the influ-ence of the strongest antenna while therow with the maximum Euclidean normrepresents the influence of the weakest

antenna. This novel approach firstprocesses the weakest stream. All sub-sequent iterations process the streamsfrom highest to lowest power.

To meet the application’s high-data-rate requirements, we realized thechannel-ordering block using thepipelined architecture shown in Figure2, which processes five channels in atime-division multiplexing (TDM)approach. This scheme provided moreprocessing time between the matrixelements of the same channel while

ChannelEstimation

V-BLASTChannel

Reordering

ModifiedReal-Valued

QRD

SphereDetector

Soft-OutputGeneration

H

H sorted

Figure 1 – Block diagram for sphere decoder

sustaining high data throughput. Thecalculation of the G matrix is the mostdemanding component in Figure 2.The heart of the process is matrixinversion, which we realized usingQR decomposition (QRD). A commonmethod for realizing QRD is based onGivens rotations. The proposedimplementation performs the com-plex rotations in the diagonal and off-diagonal cells, which are the funda-mental computation units in the sys-tolic array we are using.

MODIFIED REAL-VALUED QRDAfter obtaining the optimal orderingof the channel matrix columns, thenext step is to apply QR decomposi-tion on the real-valued matrix coeffi-cients. The functional unit used forthis QRD processing is similar to theQRD engine designed to compute theinverse matrix, but with some modifi-cations. The input data in this casehas real values, and the systolic arraystructure has correspondingly higherdimensions (i.e., 8x8 real-valuedinstead of 4x4 complex-valued).

To meet the desired timing con-straints, the input data consumptionrate had to be one input sample perclock cycle. This requirement intro-duced challenges around processing-latency problems that we couldn’taddress with a five-channel TDMstructure. Therefore, we increasedthe number of channels in a TDMgroup to 15 to provide more timebetween the successive elements ofthe same channel matrix.

SPHERE DETECTOR DESIGNYou can view the iterative spheredetection algorithm as a tree travers-al, with each level of the tree i corre-sponding to processing symbols fromthe ith antenna. The tree traversal canbe performed using several differentmethods. The one we selected was abreadth-first search due to the attrac-tive, hardware-friendly nature of theapproach. At each level only the Knodes with the smallest partial-Euclidean distance (Ti) are chosen forexpansion. This type of detector iscalled a K-best detector.

The norm computation is done inthe partial Euclidean distance (PED)blocks of the sphere detector.Depending on the level of the tree,three different PED blocks are used.The root-node PED block calculatesall possible PEDs (tree-level index isi = M = 8). The second-level PEDblock computes eight possible PEDsfor each of the eight survivor pathsgenerated in the previous level. Thiswill give us 64 generated PEDs forthe tree-level index i = 7. The thirdtype of PED block is used for allother tree levels that compute theclosest-node PED for all PEDs com-puted on the previous level. This willfix the number of branches on eachlevel to K = 64, thus propagating tothe last level i = 1 and producing 64final PEDs along with their detectedsymbol sequences.

The pipeline architecture of theSD allows data processing on everyclock cycle. Thus, only one PEDblock is necessary at every tree level.The total number of PED units isequal to the number of tree levels,which for 4x4 64-QAM modulation iseight. Figure 3 illustrates the blockdiagram of the SD.

FPGA PERFORMANCEIMPLEMENTATION TARGETS The target FPGA device is a XilinxVirtex-5, with a target clock frequen-cy of 225 MHz. The channel matrix isestimated for every data subcarrier,which limits the available processingtime for every channel matrix. Forthe selected clock frequency and acommunication bandwidth of 5 MHz(corresponding to 360 data subcarri-ers in a WiMAX system), we calculat-ed the available number of process-ing clock cycles per channel matrixinterval as follows:

As mentioned earlier, we designedthe most computationally demandingconfiguration with 4x4 antennas and a


X C E L L E N C E I N W I R E L E S S C O M M U N I C A T I O N S

360*32x18b

360*32x18b

G matrixcalc

Normsearch

UpdateHsorted

360*32x18b

360*32x18b

G matrixcalc

Normsearch

UpdateHsorted

360*32x18b

360*32x18b

G matrixcalc

Normsearch

UpdateHsorted

Figure 2 – Iterative channel matrix reordering algorithm

64-QAM modulation scheme. Theachievable raw data rate in this case is83.965 Mbits per second.

HIGH-LEVEL SYNTHESIS FOR FPGASHigh-level synthesis tools take as theirinput a high-level description of thespecific algorithm to implement andgenerate an RTL description for FPGAimplementation, as shown in Figure 4.This RTL description can be integratedwith a reference design, IP core orexisting RTL code to create a com-plete FPGA implementation using tra-ditional Xilinx ISE/EDK tools.

Modern high-level synthesis toolsaccept untimed C/C++ descriptions asinput specifications. These tools givetwo interpretations to the same C/C++code: sequential semantics forinput/output behavior, and architec-ture specification based on C/C++code and compiler directives. Basedon the C/C++ code, compiler direc-tives and target throughput require-ments, these high-level synthesis toolsgenerate high-performance pipelinedarchitectures. Among other features,high-level synthesis tools enable auto-matic insertion of pipeline stages andresource sharing to reduce FPGAresource utilization. Essentially, high-level synthesis tools raise the level ofabstraction for FPGA design, andmake transparent the time-consumingand error-prone RTL design tasks.

We have focused on using C++descriptions, with the goal of leverag-ing C++ template classes to representarbitrary precision integer types and

template functions to represent para-meterized blocks in the architecture.

The overall design approach is seenin Figure 5, where the starting point isa reference C/C++ code that couldhave been derived from a MATLAB®

functional description. As the figureshows, the first step in implementingan application on any hardware targetis often to restructure the referenceC/C++ code. By “restructuring,” wemean rewriting the initial C/C++ code(which is typically coded for clarityand ease of conceptual understandingrather than for optimized perform-ance) into a format more suitable forthe target processing engine. Forexample, on a DSP processor it may benecessary to rearrange an applica-tion’s code so that the algorithmmakes efficient use of the cache mem-ories. When targeting FPGAs, thisrestructuring might involve, for exam-ple, rewriting the code so it representsan architecture specification that canachieve the desired throughput, orrewriting the code to make efficientuse of the specific FPGA features, likeembedded DSP macros.

We achieved the functional verifica-tion of this implementation C/C++ codeusing traditional C/C++ compilers (forexample, gcc) and reusing C/C++ leveltestbenches developed for the verifica-tion of the reference C/C++ code. Theimplementation C/C++ code is the maininput to the high-level synthesis tools.However, there are additional inputsthat significantly influence the generat-ed hardware, its performance and thenumber of FPGA resources used. Two

essential constraints are the targetFPGA family and target clock frequen-cy, both of which affect the number ofpipeline stages in the generated archi-tecture. Additionally, high-level synthe-sis tools accept compiler directives(e.g., pragmas inserted in the C/C++code). The designer can apply differenttypes of directives to different sectionsof the C/C++ code. For example, thereare directives that are applied to loops(e.g., loop unrolling), and others toarrays (for example, to specify whichFPGA resource must be used to theimplementation of the array).

Based on all these inputs, the high-level synthesis tools generate an outputarchitecture (RTL) and report itsthroughput. Depending on this through-put, the designer can then modify thedirectives, the implementation C/C++code or both. If the generated architec-ture meets the required throughput,then the output RTL is used as the inputto the FPGA implementation tools(ISE/EDK). The final achievable clockfrequency and number of FPGAresources used are reported only afterrunning logic synthesis and place-and-route. If the design does not meet tim-ing or the FPGA resources are not theexpected ones, the designer shouldmodify the implementation C/C++ codeor the compiler directives.

HIGH-LEVEL SYNTHESISIMPLEMENTATION OF SD We have implemented the three keybuilding blocks of the WiMAX spheredecoder shown in Figure 1 using theAutoPilot 2010.07.ft tool from AutoESL.It is important to emphasize that thealgorithm is exactly the algorithmdescribed in a recent SDR Conferencepaper, [4] and hence has exactly thesame BER. In this section we give spe-cific examples of code rewriting andcompiler directives that we used forthis particular implementation.

The original reference C code,derived from a MATLAB functionaldescription, contained approximately2,000 lines of code, including synthe-



RootPED

8 PED metrics carried on TDM bus

8 PED metrics carried on TDM bus

PEDSortFreePED

SortFreePED

MinSearch

Figure 3 – Sphere detector processing pipeline

sizable and verification C code. It con-tains only fixed-point arithmetic usingC built-in data types. An FPGA-friend-ly implementation approximated allthe required floating-point operations(for example, sqrt).

In addition to the reference C codedescribing the functions to synthesizein the FPGA, there is a complete C-level verification testbench. We gener-ated the input test vectors as well asthe golden output reference files from

the MATLAB description. The originalC/C++ reference code is bit-accuratewith the MATLAB specification, andpasses the entire regression suite con-sisting of multiple data sets.

This reference C/C++ code has gonethrough different types of code restruc-turing. As examples, Figure 5 showsthree instances of code restructuringthat we have implemented. We reusedthe C-level verification infrastructure toverify any change to the implementa-

tion C/C++ code. Moreover, we carriedout all verification at the C level, not atthe register-transfer level, avoidingtime-consuming RTL simulations andhence, contributing to the reduction inthe overall development time.

MACROARCHITECTURESPECIFICATION Probably the most important part ofcode refactoring is to rewrite the C/C++code to describe a macroarchitecturethat would efficiently implement a spe-cific functionality. In other words, thedesigner is accountable for themacroarchitecture specification, whilethe high-level synthesis tools are incharge of the microarchitecture genera-tion. This type of code restructuring hasa major impact on the obtainedthroughput and quality of results.

In the case of our sphere decoder,there are several instances of this typeof code restructuring. For example, tomeet the high throughput of the chan-nel-ordering block, the designer shoulddescribe in C/C++ the macroarchitec-ture shown in Figure 2. Such C/C++code would consist of several functioncalls communicating using arrays. Thehigh-level synthesis tools might auto-matically translate these arrays in ping-pong buffers to allow parallel execu-tion of the several matrix calculationblocks in the pipeline.

Another example of code restruc-turing at this level would be to decidehow many channels to employ in theTDM structure of a specific block(e.g., five channels in the channelmatrix reordering block, or 15 chan-nels in the modified real-valued QRdecomposition block).

Figure 6 is one example of macroar-chitecture specification. This snippetof C++ code describes the spheredetector block diagram shown inFigure 3. We can observe a pipeline ofnine function calls, each one repre-senting a block as shown in Figure 3.The communication between func-tions takes place through arrays,which are mapped to streaming inter-



RTL

IP

ReferenceDesigns

High-LevelSynthesis

Tool

RTL System Design

C/C++ Agorithm

XilinxISE/EDK

Tools

Bitstream/Netlist

DirectivesHigh-LevelSynthesis

Tool

RTL Design

Implementation C/C++

Code Restructuring

Reference C/C++

C/C

++

Ver

ifica

tion

(C/C

++

Com

pile

r, M

atla

b)

Figure 4 – High-level synthesis for FPGAs

Figure 5 -- Iterative C/C++ refinement design approach

faces (not embedded BRAM memoriesin the FPGA) by using the appropriatedirectives (pragmas) in lines 5 and 7.

IMPORTANCE OFPARAMETERIZATION Parameterization is another key exam-ple of code rewriting. We have exten-sively leveraged C++ template func-tions to represent parameterized mod-ules in the architecture.

In the implementation of the spheredecoder, there are several cases of thistype of code rewriting. A specificexample would be the different matrixoperations used in the channel-reordering block. The matrix calcula-tions blocks (4x4, 3x3 and 2x2) shownin Figure 2 use different types ofmatrix operations, such as MatrixInverse or Matrix Multiply. Theseblocks are coded as C++ templatefunctions with the dimensions of thematrix as template parameters.

Figure 7 shows the C++ templatefunction for Matrix Multiply. In additionto the matrix dimension, this templatefunction has a third parameter, MM_II(Initiation Interval for Matrix Multiply),which is used to specify the number ofclock cycles between two consecutiveloop iterations. The directive (pragma)in line 9 is used to parameterize therequired throughput for a specificinstance. This is a really important fea-ture, since it has a major impact on thegenerated microarchitecture—that is,the ability of the high-level synthesistools to exploit resource sharing, andhence, to reduce the FPGA resourcesthe implementation will use. For exam-ple, just by modifying this InitiationInterval parameter and using exactlythe same C++ code, the high-level syn-thesis tools automatically achieve dif-ferent levels of resource sharing in theimplementation of the different MatrixInverse (4x4, 3x3, 2x2) blocks.

FPGA OPTIMIZATIONSFPGA optimization is the last exam-ple of code rewriting. The designercan rewrite the C/C++ code to more



Figure 6 – Sphere detector macroarchitecture description

Figure 7 – Example of code parameterization

Figure 8 – FPGA optimization for DSP48 utilization

efficiently utilize specific FPGAresources, and hence, improve tim-ing and reduce area. Two very specif-ic examples of this type of optimiza-tion are bit-widths optimizations andefficient use of embedded DSPblocks (DSP48s). Efficient use ofDSP48s improves timing and FPGAresource utilization.

We wrote our reference C/C++code using built-in C/C++ data types(e.g., short, int), while the design uses18-bit fixed-point data types to repre-sent the matrix elements. We have

leveraged C++ template classes torepresent arbitrary precision fixed-point data types, hence reducingFPGA resources and minimizing theimpact on timing.

Figure 8 is a C++ template functionthat implements a multiplication fol-lowed by a subtraction, where thewidth of the input operands is parame-terized. You can map these two arith-metic operations into a single embed-ded DSP48 block. In Figure 8, we canalso observe two directives thatinstruct the high-level synthesis tool to

use a maximum of two cycles toschedule these operations and use aregister for the output return value.

PRODUCTIVITY METRICSIn Figure 9 we plot how the size of thedesign (that is, the FPGA resources)generated using AutoESL’s AutoPilotevolves over time, and compare it witha traditional SystemGenerator (RTL)implementation. With high-level synthe-sis tools we are able to implement manyvalid solutions that differ in size overtime. Depending on the amount of coderestructuring, the designer can trade offhow fast to get a solution vs. the size ofthat solution. On the other hand, thereis only one RTL solution, and it requiresa long development time.

We have observed that it takes rela-tively little time to obtain several syn-thesis solutions that use significantlymore FPGA resources (i.e., area) thanthe traditional RTL solution. On theother hand, the designer might decideto work at the tools’ Expert level andgenerate many more solutions byimplementing more advanced C/C++code-restructuring techniques (suchas FPGA-specific optimizations) toreduce FPGA resource utilization.

Finally, since we carried out all veri-fication at the C/C++ level, we avoidedthe time-consuming RTL simulations.We found that doing the design verifica-tion at the C/C++ level significantlycontributed to the reduction in theoverall development time.

QUALITY OF RESULTS In Figure 10, we compare final FPGAresource utilization and overall devel-opment time for the complete spheredecoder implemented using high-levelsynthesis tools and the referenceSystem Generator implementation,which is basically a structural RTLdesign, explicitly instantiating FPGAprimitives such as DSP48 blocks. Thedevelopment time for AutoESLincludes learning the tool, producingresults, design-space exploration anddetailed verification.



AutoPilot: 8x8 RVD-QRD design aimed at 225 MHz - LUT/FF Usage

108

107

106

105

104

0 5 10 15 200

5000

10000

15000

20000

25000

30000

35000

40000

45000

Effort (days)

Pro

cess

ing

time

for

360

chan

nels

(ns

)

Pos

t-P

AR

res

ourc

e us

age

(#LU

Ts, #

FF

s)

Execution timeFF usage

LUT usageRequired ex time

SysGen FF usageSysGen LUT usage

Metric SysGenExpert

AutoESLExpert

% Diff

DevelopmentTime (weeks)

16.5 15 -9%

LUTs 27,870 29,060 +4%

Registers 42,035 31, 000 -26%

DSP48s 237 201 -15%

18K BRAM 138 99 -28%

Figure 9 – Reduction of FPGA resources over development time

Figure 10 – Quality-of-results metrics give AutoESL the edge.

To have accurate comparisons, wehave reimplemented the referenceRTL design using the latest Xilinx ISE12.1 tools targeting a Virtex-5 FPGA.Likewise, we implemented the RTLgenerated by AutoESL’s AutoPilotusing ISE 12.1 targeting the sameFPGA. Figure 10 shows thatAutoESL’s AutoPilot achieves signifi-cant savings in FPGA resources,mainly through resource sharing inthe implementation of the matrixinverse blocks. We can also observe asignificant reduction in the number ofregisters and a slightly higher utiliza-tion of lookup tables (LUTs). Thisresult is partially due to the fact thatdelay lines are mapped onto SRL16s(i.e., LUTs) in the AutoESL implemen-tation, while the SystemGeneratorversion implements them using regis-

ters. In other modules we traded offBRAMs for LUTRAM, resulting inlower BRAM usage in the channelpreprocessor.

AutoESL’s AutoPilot achieves sig-nificant abstractions from low-levelFPGA implementation details (e.g.,timing and pipeline design), while pro-ducing a quality of results highly com-petitive with those obtained using atraditional RTL design approach.C/C++ level verification contributes tothe reduction in the overall develop-ment time by avoiding time-consum-ing RTL simulations. However, obtain-ing excellent results for complex andchallenging designs requires goodmacroarchitecture definition and asolid knowledge of FPGA design tools,including the ability to understand andinterpret FPGA tool reports.

REFERENCES

[1] Berkeley Design Technology Inc.,

“FPGAs for DSP,” white paper, 2007.

[2] Grant Martin, Gary Smith, “High-

Level Synthesis: Past, Present and

Future,” IEEE Design and Test ofComputers, July/August 2009.

[3] Berkeley Design Technology Inc.,

“High-Level Synthesis Tools for Xilinx

FPGAs,” white paper, 2010:

http://www.xilinx.com/technology/dsp/BDTI_techpaper.pdf.

[4] Chris Dick et al., “FPGA

Implementation of a Near-ML Sphere

Detector for 802.16E Broadband

Wireless Systems,” SDR Conference’09,

December 2009.

[5] K. Denolf, S. Neuendorffer, K.

Vissers, “Using C-to-Gates to Program

Streaming Image Processing Kernels

Efficiently on FPGAs,” FPL’09 confer-

ence, September 2009.



Versatile FPGA Platform

www.techway.eu

The Versatile FPGA Platform provides a cost-effective

way of undertaking intensive calculations

and high speedcommunications in an

industrial environment.

PCI Express 4x Short CardXilinx Virtex FamiliesI/0 enabled through an FMC site (VITA 57)Development kit and drivers optimized for Windows and Linux

Optical-Mez

http://www.xilinx.com/technology/dsp

http://www.techway.eu

http://www.xilinx.com/technology/dsp/BDTI_techpaper.pdf

The majority of designers working with Xilinx®

FPGAs use the primary design tools—ISE® ProjectNavigator and PlanAhead™ software—in graphical

user interface mode. The GUI approach provides a push-button flow, which is convenient for small projects.

However, as FPGAs become larger, so do thedesigns built around them and the design teams them-selves. In many cases GUI tools can become a limit-ing factor that hinders team productivity. GUI toolsdon’t provide sufficient flexibility and control overthe build process, and they don’t allow easy inte-gration with other tools. A good example is inte-gration with popular build-management and con-tinuous-integration solutions, such as TeamCity,Hudson CI and CruiseControl, which manydesign teams use ubiquitously for automatedsoftware builds.

Nor do GUI tools provide good support fora distributed computing environment. It cantake several hours or even a day to build alarge FPGA design. To improve the runtime,users do builds on dedicated servers, typi-cally 64-bit multicore Linux machines thathave a lot of memory but in many cases lacka GUI. Third-party job-scheduling solutionsexist, such as Platform LSF, to provide flex-ible build scheduling, load balancing andfine-grained control.

A less obvious reason to avoid GUI toolsis memory and CPU resource utilization.ISE Project Navigator and PlanAhead arememory-hungry applications—each GUItool instance uses more than 100 Mbytes ofRAM. On a typical workstation that has 2 Gbytes of memory, that’s 5 percent—a sub-

stantial amount of a commodity that can beput to a better use.

Finally, many Xilinx users come to FPGAswith an ASIC design background. These engi-

neers are accustomed to using command-linetool flows, and want to use similar flows in their

FPGA designs. These are some of the reasons why designers are

looking to switch to command-line mode for some ofthe tasks in a design cycle.

XILINX ENVIRONMENT VARIABLESThe very first task a user encounters while working with

command-line tools is setting up environment variables.The procedure varies depending on the Xilinx ISE version. In

versions before 12.x, all environment variables are set duringthe tool installation. Users can run command-line tools without

any further action.


Using Xilinx Tools in Command-Line Mode

XPERTS CORNER

Uncover new ISE design tools and methodologies to improveyour team’s productivity.

by Evgeni StavinovHardware ArchitectSerialTek [email protected]



That changed in ISE 12.x. Users now need to set all theenvironment variables every time before running the tools.The main reason for the change is to allow multiple ISE ver-sions installed on the same machine to coexist by limitingthe scope of the environmental variables to the local shell orcommand-line terminal, depending on the operating system.

There are two ways of setting up environment vari-ables in ISE 12.x. The simpler method is to call a scriptsettings32.{bat,sh,csh} or settings64.{bat,sh,csh}, depend-ing on the platform. The default location on Windowsmachines is C:\Xilinx\12.1\ISE_DS\; on Linux, it’s/opt/Xilinx/12.1/ISE_DS/.

Here is a Linux bash example of calling the script:

$ source /opt/Xilinx/12.1/ISE_DS/settings64.sh

The other option is to set the environment variables directly,as shown in the following chart:

Xilinx command-line tools can run on both Windows andLinux operating systems, and can be invoked using a broadrange of scripting languages, as Table 1 illustrates. Asidefrom these choices, other scripts known to run Xilinx toolsare Ruby and Python.

MANY CHOICES IN SCRIPTING LANGUAGES

Perl Perl is a popular scripting languageused by a wide range of EDA andother tools. Xilinx ISE installationcontains a customized Perl distribu-tion. Users can start the Perl shell byrunning xilperl command:

$ xilperl –v # display Perl version

TCL Tool Command Language (TCL) is a defacto standard scripting language ofASIC and FPGA design tools. TCL isvery different syntactically from otherscripting languages, and many develop-ers find it difficult to get used to. Thismight be one of the reasons TCL is lesspopular than Perl.

TCL is good for what it’s designed for—namely, writing tool command scripts.TCL is widely available, has excellentdocumentation and enjoys good com-munity support.

Xilinx ISE installation comes with acustomized TCL distribution. To startthe TCL shell use xtclsh command:

$ xtclsh –v # display TCL version

Unix bashand csh

There are several Linux and Unix shellflavors. The most popular are bashand csh.

Unlike other Unix/Linux shells, cshboasts a C-like scripting language.However, bash scripting language offersmore features, such as shell functions,command-line editing, signal traps andprocess handling. Bash is a default shellin most modern Linux distributions,and is gradually replacing csh.

Windowsbatch andPowerShell

Windows users have two scripting-lan-guage choices: DOS command line andthe more flexible PowerShell. An exam-ple of the XST invocation from DOScommand line is:

>xst -intstyle ise -ifn "crc.xst"

–ofn "crc.syr"

Windows >set XILINX= C:\Xilinx\12.1\

ISE_DS\ISE

>set XILINX_DSP=%XILINX%

>set PATH=%XILINX%\bin\nt;%XILINX%\

lib\nt;%PATH%

Linux bash $ export XILINX=/opt/Xilinx/12.1/

ISE_DS/ISE

$export XILINX_DSP=$XILINX

$ export PATH=${XILINX}/bin/lin64:

${XILINX}/sysgen/util:${PATH}

Linux csh % setenv XILINX/opt/Xilinx/12.1/

ISE_DS/ISE

% setenv XILINX_DSP $XILINX

% setenv PATH ${XILINX}/bin/lin64:

${XILINX}/sysgen/util:${PATH}

Table 1 – The most popular scripting languages are Perl, TCL,Unix/Linux and Windows batch and PowerShell.

X P E R T S C O R N E R

BUILD FLOWSXilinx provides several options to build a design usingcommand-line tools. Let’s take a closer look at the fourmost popular ones: direct invocation, xflow, xtclsh andPlanAhead. We’ll use a simple CRC generator project asan example.

The direct-invocation method calls tools in the followingsequence: XST (or other synthesis tool), NGDBuild, map,place-and-route (PAR), TRACE (optional) and BitGen.

You can autogenerate the sequence script from the ISEProject Navigator by choosing the Design -> View command

line log file, as shown in Figure 1.

The following script is an example of building a CRCgenerator project:

xst -intstyle ise -ifn "/proj/crc.xst" -ofn

"/proj/crc.syr"

ngdbuild -intstyle ise -dd _ngo

-nt timestamp -uc /ucf/crc.ucf

-p xc6slx9-csg225-3 crc.ngc crc.ngd

map -intstyle ise -p xc6slx9-csg225-3 -w -ol high

-t 1 -xt 0

-global_opt off -lc off -o crc_map.ncd crc.ngd

crc.pcf

par -w -intstyle ise -ol high crc_map.ncd crc.ncd

crc.pcf

trce -intstyle ise -v 3 -s 3 -n 3 -fastpaths -xml

crc.twx crc.ncd

-o crc.twr crc.pcf

bitgen -intstyle ise -f crc.ut crc.ncd

EASIER TO USEXilinx’s XFLOW utility provides another way to build adesign. It’s more integrated than direct invocation, doesn’trequire as much tool knowledge and is easier to use. Forexample, XFLOW does not require exit-code checking todetermine a pass/fail condition.

The XFLOW utility accepts a script that describes buildoptions. Depending on those options, XFLOW can run syn-thesis, implementation, BitGen or a combination of all three.

Synthesis only:xflow -p xc6slx9-csg225-3 -synth synth.opt

../src/crc.v

Implementation:xflow -p xc6slx9-csg225-3 -implement impl.opt

../crc.ngc

Implementation and BitGen:xflow -p xc6slx9-csg225-3 -implement impl.opt -config

bitgen.opt ../crc.ngc

You can generate the XFLOW script (see Figure 2) manual-ly or by using one of the templates located in the ISE instal-lation at the following location: $XILINX\ISE_DS\ISE\xil-inx\data\. The latest ISE software doesn’t provide an optionto autogenerate XFLOW scripts.



Figure 1 – Direct tool invocation sequence

synth

impl impl

bitstream

XFLOW options

Figure 2 – The XFLOW utility is more integrated than direct invocation.

HOW TO USE XTCLSHYou can also build a design using TCL script invoked fromthe Xilinx xtclsh, as follows:

xtclsh crc.tcl rebuild_project

You can either write the TCL script, which passes as aparameter to xtclsh, manually or autogenerate it from theISE ProjectNavigator by choosing Project->Generate

TCL Script (see Figure 3).

Xtclsh is the only build flow that accepts the original ISEproject in .xise format as an input. All other flows requireprojects and file lists, such as .xst and .prj, derived from theoriginal .xise project.

Each of the XST, NGDBuild, map, PAR, TRACE andBitGen tool options has its TCL-equivalent property. Forexample, the XST command-line equivalent of the Boolean“Add I/O Buffers” is –iobuf, while –fsm_style is the com-mand-line version of “FSM Style” (list).

Each tool is invoked using the TCL “process run” com-mand, as shown in Figure 4.

THE PLANAHEAD ADVANTAGEAn increasing number of Xilinx users are migrating fromISE Project Navigator and adopting PlanAhead as a maindesign tool. PlanAhead offers more build-related features,such as scheduling multiple concurrent builds, along withmore flexible build options and project manipulation.

PlanAhead uses TCL as its main scripting language. TCLclosely follows Synopsys’ SDC semantics for tool-specificcommands.

PlanAhead has two command-line modes: interactiveshell and batch mode. To enter an interactive shell, type thefollowing command:

PlanAhead –mode tcl

To run the entire TCL script in batch mode, source thescript as shown below:

PlanAhead –mode tcl –source <script_name.tcl>

The PlanAhead build flow (or run) consists of threesteps: synthesis, implementation and bitstream generation,as illustrated in Figure 5.

The PlanAhead software maintains its log of the opera-tions in the PlanAhead.jou file. The file is in TCL format,and located in C:\Documents and Settings\<username>\Application Data\HDI\ on Windows, and ~/.HDI/ onLinux machines.

Users can run the build in GUI mode first, and then copysome of the log commands in their command-line buildscripts.

Below is a PlanAhead TCL script used to build an exam-ple CRC project.

create_project pa_proj {crc_example/pa_proj} -part

xc6slx9csg225-3



-process “Synthesize - XST”

-process “Translate”

-process “Map”

-process “Place & Route”

-process “Generate Post - Place & Route Static Timing”

-process “Generate Programming File”

Xtclsh processes

Figure 3 – TCL script generation

Figure 4 – Xtclsh is the only flow that accepts .xise input.

set_property design_mode RTL

[get_property srcset

[current_run]]

add_files -norecurse

{crc_example/proj/.. /src/crc.v}

set_property library work

[get_files -of_objects

[get_property srcset [current_run]]

{src/crc.v}]

add_files -fileset [get_property

constrset [current_run]]

-norecurse {ucf/crc.ucf}

set_property top crc [get_property

srcset [current_run]]

set_property verilog_2001 true

[get_property srcset

[current_run]]

launch_runs -runs synth_1 -jobs 1

-scripts_only -dir

{crc_example/pa_proj/pa_proj.runs}

launch_runs -runs synth_1 -jobs 1

launch_runs -runs impl_1 -jobs 1

set_property add_step Bitgen [get_runs impl_1]

launch_runs -runs impl_1 -jobs 1 -dir

{crc_example/pa_proj/pa_proj.runs}

In addition to providing standard TCL scriptingcapabilities, PlanAhead offers other powerful fea-tures. It allows query and manipulation of thedesign database, settings and states, all from a TCLscript. This simple script illustrates some of theadvanced PlanAhead features that you can use in acommand-line mode:

set my_port [get_ports rst]

report_property $my_port

get_property PULLUP $my_port



pre-build steps

synthesis

ngbuild

map

place-and-route

static timing analysis

bitstreamgeneration

post-build steps

synthesis

implementation

bitstreamgeneration

PlanAhead TCL Mode

THE XILINX FPGA BUILD PROCESSBefore delving into the details of command-line flows, it’s useful to briefly review theXilinx FPGA build process (see figure), or the sequence of steps involved in building anFPGA design, from RTL to bitstream.

Xilinx provides two GUI tools with which to do a build: ISE Project Navigator andPlanAhead. There are several third-party tools that can do Xilinx FPGA builds as well, forexample, Synopsys’ Synplify product.

The exact build sequence will differ depending on the tool used. However, any XilinxFPGA build will contain eight fundamental steps: pre-build, synthesis, NGDBuild, map,place-and-route, static timing analysis, BitGen and post-build.

The main pre-build tasks are getting the latest code from a source control repository,assembling project file lists, setting all tool environment variables, acquiring tool licenses,incrementing a build number and preprocessing the RTL.

During the synthesis stage, designers may use the Xilinx XST tool or a third-party offer-ing. The major third-party synthesis tools for general-purpose FPGA design are Synopsys’Synplify and Mentor Graphics’ Precision.

The NGDBuild phase involves netlist translation using the Xilinx NGDBuild tool, whilethe next step, map, involves mapping the netlist into FPGA-specific resources, such asslices, RAMs and I/Os, using the Xilinx MAP tool. Xilinx’s PAR tool handles placementand routing, and the TRACE tool performs static timing analysis. Finally, the BitGen stageis the point in the design cycle at which FPGA bitstreams are generated.

Wrapping up the cycle, the post-build period involves tasks such as parsing the buildreports, processing and archiving build results and sending e-mail notifications to usersthat own the build. – Evgeni Stavinov

Figure 5 – PlanAhead offers many build-related features.

The Xilinx FPGAbuild process

involves a numberof sequenced steps.

set_property PULLUP 1 $my_port

get_property PULLUP $my_port

The simple script adds a pull-up resistor to one of thedesign ports. Of course, you can do the same by changingUCF constraints and rebuilding the project, or by usingthe FPGA Editor. But PlanAhead can do it with just threelines of code.

Finally, many developers prefer using the make utilitywhen doing FPGA builds. The two most important advan-tages of make are built-in checking of the return code andautomatic dependency tracking. Any of the flowsdescribed above can work with make.

LESSER-KNOWN COMMAND-LINE TOOLSThe Xilinx ISE installation contains several lesser-knowncommand-line tools that FPGA designers might find useful.ISE Project Navigator and PlanAhead invoke these toolsbehind the scenes as part of the build flow. The tools areeither poorly documented or not documented at all, whichlimits their exposure to users.

Ten of these tools especially are worth investigating.

• data2mem: This utility, which is used to initialize thecontents of a BRAM, doesn’t require rerunning Xilinximplementation. It inserts the new BRAM data directlyinto a bitstream. Another data2mem usage is to split theinitialization data of a large RAM consisting of severalBRAM primitives into individual BRAMs.

• fpga_edline: A command-line version of the FPGA Editor,fpga_edline can be useful for applying changes to the post-PAR .ngc file from a script as part of the build process.Some of the use cases include adding ChipScope™ probes,minor routing changes, changing LUT or IOB properties.

• mem_edit: This is not a tool per se, but a script thatopens Java applications. You can use it for simple memo-ry content editing.

• netgen: Here is a tool you can use in many situations. Itgets a Xilinx design file in .ncd format as an input, andproduces a Xilinx-independent netlist for use in simula-tion, equivalence checking and static timing analysis,depending on the command-line options.

• ngcbuild: This is a tool that consolidates several designnetlist files into one. It’s mainly used for the convenience ofworking with a single design file rather than multipledesigns and IP cores scattered around different directories.

• obngc: This is a utility that obfuscates .ngc files in orderto hide confidential design features, and prevent design

analysis and reverse-engineering. You can explore NGCfiles in FPGA Editor.

• pin2ucf: You can use this utility to generate pin-lockingconstraints from an NCD file in UCF format.

• XDL: An essential tool for power users, XDL has threefundamental modes: report device resource information,convert NCD to XDL and convert XDL to NCD. XDL is atext format that third-party tools can process. This is theonly way to access post-place-and-route designs.

• xreport: This is a utility that lets you view build reportsoutside of the ISE Project Navigator software. It has tableviews of design summary, resource usage, hyperlinkedwarnings and errors, and message filtering.

• xinfo: A utility that reports system information, xinfoalso details installed Xilinx software and IP cores, licens-es, ISE preferences, relevant environment variables andother useful information.

The tools are installed in $XILINX/ISE/bin/{nt, nt64, lin,lin64} and $XILINX/common/bin/{nt, nt64, lin, lin64}directories. More inquisitive readers might want to fur-ther explore those directories to discover other interest-ing tools that can boost design productivity.

ASSESSING THE OPTIONSIn this article we’ve discussed advantages of using Xilinxtools in command-line mode, explored several Xilinx buildflows, examined different scripting languages and taken acloser look at lesser-known tools and utilities. The goalwas not to identify the best build flow or script language,but rather to discuss the available options and parse theunique advantages and drawbacks of each to help XilinxFPGA designers make an informed decision.

You can base your selection process on a number of cri-teria, such as existing project settings, design team experi-ence and familiarity with the tools and scripts, ease of use,flexibility, expected level of customization and integrationwith other tools, among others. Users can find all scripts andexample projects used in this article on the author’s website: http://outputlogic.com/xcell_using_xilinx_tools/.

Evgeni Stavinov is a longtime Xilinx user with more

than 10 years of diverse design experience. Before

becoming a hardware architect at SerialTek LLC, he

held different engineering positions at Xilinx, LeCroy

and CATC. Evgeni is a creator of OutputLogic.com, an

online productivity-tools portal.



http://outputlogic.com/xcell_using_xilinx_tools/

http://www.outputlogic.com


FPGA101

When It Comes toRuntime Chatter,Less Is Best


The end goal of a testbench is to show that your design achieves itspurpose without having any bugs. Getting to that point, however,takes work. You will have to find, diagnose and fix bugs, in both the

testbench and the design. Anything that makes the debugging process eas-ier and achieves closure on the testbench and design under test (DUT) isvaluable—both in terms of time and money.

Controlling verbosity on a component level can save considerable timeand effort when debugging failing tests by presenting only information rel-evant to the area where a failure is occurring. Controlling verbosity at run-time saves considerably more time by allowing engineers to focus on aproblem area without having to edit and recompile the testbench code orsift through excessively verbose log files. While there is always the optionof using “grep” or other command-line tools to post-process logs, havingthe right information in the log files to begin with saves effort.

Some testbench methodologies do a great job of controlling verbosity ina granular way, but they lack flexibility at runtime. This is the case with theOpen Verification Methodology (OVM), the de facto industry testbenchmethodology standard.

The OVM provides a great framework for reporting debugging informa-tion. You can specify severity and verbosity levels for each reported mes-sage and implement different actions for various combinations of the two,specifying different verbosity thresholds and actions down to the ID.Moreover, during debug, OVM allows you to increase or decrease the ver-bosity threshold for the testbench as a whole. What OVM doesn’t do is pro-vide the ability to change verbosity for individual components at runtime.

To solve this runtime issue, we developed an add-on package at Xilinx,with additional direction from Mentor Graphics. When we employed it in arecent testbench environment, it saved us considerable time. Because of theadvantages it delivers, we have adopted this package for several currentdesigns at Xilinx. In fact, we almost always use runtime verbosity controls tozero in on problem areas. We have found that not having to recompile or siftthrough excessively verbose log files makes the designer’s life a lot easier.

In this article, we will show the basic steps required to implement theseverbosity controls. You can download the full methodology (including thestandalone package, which you can easily add to existing testbenches) fromhttp://www.xilinx.com/publications/xcellonline/downloads/runtime-

verbosity-package.zip. Everything said here about OVM should apply equal-ly to the Universal Verification Methodology, the next evolutionary step fortestbench methodology. The initial UVM version is derived from the OVMwith a few additions.

Xilinx add-on package simplifies testbench debug by controlling verbosity on a component level.

by Russ NelsonStaff Design EngineerXilinx, [email protected]

Michael HornPrincipal Verification ArchitectMentor Graphics [email protected]

http://www.xilinx.com/publications/xcellonline/downloads/runtime-verbosity-package.zip





A FEW CHOICE WORDS MAKE THE POINTSelectively increasing verbosity for a component or set ofcomponents is an easy and effective way of acceleratingthe time it takes to find bugs in a particular area of a veri-fication testbench. Passing in directives via plusargs turnsup the verbosity for components in the area of interest.

Take, for example, the following:

<sim_command> +OVM_VERBOSITY=OVM_LOW+VERB_OVM_HIGH=TX_SCOREBOARD

In this command, OVM_VERBOSITY=OVM_LOW is thebuilt-in OVM method for reducing the overall verbosity toa minimum, while VERB_OVM_HIGH=TX_SCOREBOARDis our command for boosting the TX scoreboard to a highlevel of verbosity.

Before adding any code to a testbench, it is importantto define the methodology that you will use to identifyeach component. It is also helpful (but not necessary) todefine what kind of information will be presented at eachverbosity level—consistency makes the debugging

process easier. (See the table on the final page for anexample of how to plan the verbosity levels.) Once youhave formulated a plan, there are two key testbench fea-tures required for enabling runtime verbosity controls.First, you must parse the command-line directives.Second, you must carry out the directives to change ver-bosity on a component-level basis.

The first step is to create a procedure for distinguishingone component from another; in other words, how youwill identify the areas of interest. We recommend using aunique tag (we call it report_id) for each component, witha default value that you can override via the OVM configu-ration database. The following code snippet illustrates oneway of implementing this functionality:

class example_agent extends ovm_agent;

string report_id = "EXAMPLE_AGENT";

// Other code not shown

endclass

function void example_agent::build();

super.build();

// Void casting this call, since

// it’s fine to use the

// default if not overridden

void'(get_config_string("report_id",

report_id));


endfunction

If you will use a particular component in several loca-tions within a testbench, it can be helpful to override itsdefault tag (report_id) with a new tag in such a way thateach component’s tag is unique within the testbench. Whenusing the tag to control verbosity, the level of control is onlyas precise as the tag—if two components share a tag, theycannot be controlled independently.

Another method for identifying a component is by its hier-archical name. The hierarchical name is an OVM built-in prop-

erty of each component in a testbench. While there is nothingwrong with using the hierarchical name to identify a compo-nent, it takes a lot more typing to specify components of inter-est, as the names can be quite long. However, OVM requiresthat the names be unique, which can be an advantage. Theexample code provided with this article supports identifyingcomponents either by tag or by hierarchical name.

The next step is to determine the runtime changes to ver-bosity. The OVM takes care of default verbosity via theOVM_VERBOSITY plusarg (OVM_VERBOSITY=OVM_LOW,etc.). We recommend implementing a similar method, witha plusarg for each OVM-defined verbosity level.

We used VERB_OVM_NONE, VERB_OVM_LOW,VERB_OVM_MED, VERB_OVM_HIGH, VERB_OVM_FULLand VERB_OVM_DEBUG. These correspond to the verbosi-ty enum built into OVM using similar names.

You can specify each verbosity level using a single reporttag, a hierarchical name or a colon-separated list of tags ornames. Additionally, for each tag, you can specify that theverbosity setting should be applied recursively to all the


F P G A 1 0 1

Before adding any code to a testbench, it is important to define the methodology

that you will use to identify each component. It is also helpful (but not necessary)

to define what kind of information will be presented at each verbosity level—

consistency makes the debugging process easier.

specified component’s children. The included code uses theprefixes “^” (for a recursive setting) and “~” (for a hierar-chical name). Here are a few examples:

• Low verbosity for most of the testbench, high verbosi-ty for the scoreboard:<sim> +OVM_VERBOSITY=OVM_LOW+VERB_OVM_HIGH=SCOREBOARD

• Full verbosity for the slave and all its children:<sim> +VERB_OVM_FULL=^SLAVE

• Full verbosity for the TX agent and its children, exceptfor the startup module:<sim> +VERB_OVM_FULL=^~ovm_test_top.tb.tx.tx_agent +VERB_OVM_LOW=~ovm_test_top.tb.tx.tx_agent.startup

You must determine the passed-in arguments early in thesimulation so that components can pick up the changesbefore they start reporting. The example code accomplish-es this by using a singleton class instantiated in a package.By including the package somewhere in the testbench (wesuggest the top-level module), the command-line process-ing gets called automatically at the beginning of the simula-tion. The following shows a simplified version of the pack-age code, illustrating just the key concepts.

// SystemVerilog package which provides// runtime verbosity optionspackage runtime_verbosity_pkg;

// Class which parses the// command-line plusargsclass verbosity_plusargs extends

ovm_object;// “new” gets called automaticallyfunction new(string name =

"verbosity_plusargs");super.new(name);// parse_plusargs does all the// string-processing workparse_plusargs();

endfunction : newendclass : verbosity_plusargs

// Instantiate one copy of the// verbosity_plusargs class in the// package - this will be shared by// everything which imports// this packageverbosity_plusargs

verb_plusargs = new();endpackage : runtime_verbosity_pkg

Finally, you must change the verbosity setting for eachreporting item. It is important to apply the setting early inthe item’s life span, before reporting begins. OVM compo-nents operate in “phases,” namely build, connect,end_of_elaboration, start_of_simulation, run, extract,check and report. UVM has an expanded phasing mecha-nism, but the need for applying the verbosity setting earlystill applies.

To ensure that you have tweaked the verbosity beforereporting begins, you should make the adjustment in thebuild() task of the component. UVM adds the capability tocustomize the phases somewhat, but it still uses build() asthe first phase. The example code provides a simple call toupdate the component’s verbosity setting. Building on theexample agent shown above, the build() function nowlooks like this:

function void example_agent::build();

super.build();

// Void casting this call, since

// it's fine to use the default

// if not overridden

void'(get_config_string("report_id",

report_id));


// Set the verbosity for this component

runtime_verbosity_pkg::set_verbosity(

.comp(this),.report_id(report_id));

endfunction

For simplicity, we recommend adding this functionalityto the build() function of each component in the test-bench. However, if that is not feasible (for example, if it’stoo late to change the components or if you are using lega-cy or third-party verification components), you can over-ride the default reporter, which is called by each compo-nent. The downside of overriding the reporter is that itwon’t increase the verbosity when using the OVM conven-ience macro `ovm_info, since the macro checks the com-ponent’s verbosity level before calling the reporter (it stillworks for decreasing the verbosity). Overriding thedefault reporter in the example code is as simple as addingthe following line somewhere in the testbench (we recom-mend adding it to the top-level module):

import runtime_verbosity_server_pkg::*;

F P G A 1 0 1


You can combine the two methods if you need to supportlegacy or third-party components yet still want to use themacros in the code.

The methodology shown here also works with UVM,with small changes to a few function names. With a littlemore effort, you can also extend it to work with objects,such as transactions and sequences. We’ve included asmall example testbench showing the method presented inthis article in the methodology zip file at http://www.xilinx.

com/publications/xcellonline/downloads/verb-ovm-

environment.zip.

As the examples presented here show, with a littlethought and planning, enabling runtime verbosity in atestbench is a very simple matter. This approach hasproven itself in production testbenches and designs. Ourexperience shows that the effort required for this small

addition has a disproportionately large payoff—debug-ging tests become much quicker and easier to do. Thismeans the simulator is running simulations more oftenand simulation cycles are not wasted. This is crucial,since simulation cycles are not a renewable resource. If asimulator is not running at a particular moment, that timecould be viewed as a wasted resource. Quicker debuggingmeans more up-time for testbench development andshorter time-to-closure.

The end goal of writing a testbench is to get a designimplemented and in the customer’s hands. Any advantagethat accelerates this goal is a welcome addition to thedesigner’s bag of tricks. Adding finer-grained verbosity con-trol at runtime, instead of requiring a loopback to compila-tion, saves valuable time and energy, speeding delivery ofthe end product and startup of your next project.

F P G A 1 0 1

VERBOSITY MESSAGE TYPE EXAMPLES

OVM_LOW Major environment/DUT state changes Beginning reset sequence; link established

OVM_MEDIUM Other environment/DUT state changes Interrupt asserted; arbitration timeout

OVM_HIGH Sequences started and finished Beginning read sequence; ending reset

OVM_HIGH Scoreboard activity Compare passed; check skipped

OVM_HIGH Major driver and monitor events Packet corrupted due to reset

OVM_HIGHReporting high-level configuration settings at startup

32-bit datapath; two masters and four slaves instantiated

OVM_FULL Transactions driven and monitored Full dump of transaction contents

OVM_FULL Driver and monitor state changesArbitration request/grant; inserting <n> IDLE cycles

OVM_DEBUG Objections raised and lowered Raised objection; lowered objection

OVM_DEBUGDetailed driver, monitor and scoreboard activity

Loop iterations, if-branches followed, etc.


WHICH INFORMATION AT WHICH LEVEL?It can be helpful to define what type of information is presented at each verbosity level, especially if a large design

team is involved. Establish a list of all the types of information you expect to generate and create a table showing the

verbosity for each message type. Our example table shows a few common types of messages.

http://www.xilinx

http://www.xilinx.com/publications/xcellonline/downloads/verb-ovmenvironment.zip

Time to market, engineering expense, complex fabrication, and board trouble-shooting all point to a ‘buy’ instead of ‘build’ decision. Proven FPGA boards from theDini Group will provide you with a hardware solution that works — on time, andunder budget. For eight generations of FPGAs we have created the biggest, fastest,and most versatile prototyping boards. You can see the benefits of this experience inour latest Vir tex-6 Prototyping Platform.

We started with seven of the newest, most powerful FPGAs for 37 Million ASICGates on a single board. We hooked them up with FPGA to FPGA busses that runat 650 MHz (1.3 Gb/s in DDR mode) and made sure that 100% of the boardresources are dedicated to your application. A Marvell MV78200 with Dual ARMCPUs provides any high speed interface you might want, and after FPGA configu-ration, these 1 GHz floating point processors are available for your use.

Stuffing options for this board are extensive. Useful configurations start below$25,000. You can spend six months plus building a board to your exact specifications,or start now with the board you need to get your design running at speed. Best ofall, you can troubleshoot your design, not the board. Buy your prototyping hardware,and we will save you time and money.

www.dinigroup.com • 7469 Draper Avenue • La Jolla, CA 92037 • (858) 454-3419 • e-mail: [email protected]

DN2076K10 ASIC Prototyping Platform

Why build your own ASIC prototyping hardware?

Daughter Cards

Custom, or:

FMC

LCDDRIVER

ARM TILE

SODIMMR FS

FLASHSOCKET

ECT

Intercon

OBS

MICTOR DIFF

DVI

V5T

V5TPCIE

S2GX

AD-DA

USB30 USB20

PCIE SATA

Custom, or:

RLDRAM-IISSRAM

MICTOR

QUADMIC

INTERCON

FLASH

DDR1

DDR2

DDR3

SDR

SE

QDR

RLDRAM

USB

12V Power

JTAG Mictor SATA

SATA

Seven50A Supplies

10/100/1000Ethernet

SMAs forHigh Speed

Serial

SODIMM Sockets

PCIe 2.0

PCIe to MarvellCPU

PCIe 2.0

USB 2.0

USB 3.0

10/100/1000Ethernet

10 GbEthernet

GTXExpansionHeader

3rd PartyDeBug

TestConnector

12V Power

All the gates and features you need are — off the shelf.

SODIMM SODIMM

http://www.dinigroup.com



Breaking Barriers with FPGA-Based Design-for-Prototyping

Automation and a new design methodology can improve software quality and supercharge SoC project schedules.

XPERT OPINION

by Doug AmosBusiness Development ManagerSynopsys [email protected]



There has never been a bettertime to use FPGAs. The capabil-ity of the devices, the power of

the tools and the wealth of FPGA-readyIP mean that we can tackle almost anyapplication with confidence. However,there are still between 5,000 and 10,000designs per year that require an ASIC,system-on-chip or ASSP solution—let’suse the generic term SoC for all of them.Even in these situations, FPGAs stillhave a crucial role to play in the pro-ject’s success as prototyping vehicles.There are many good reasons to opt forFPGA-based prototyping, especially ifyou can find ways for SoC teams andtheir FPGA-expert colleagues to com-bine their talents to speed up schedulesand improve quality.

THREE ‘LAWS’ TO GET US STARTEDYou may have noticed that technicaldisciplines often boil down to threelaws. There’s Sir Isaac Newton’sfamous trio of laws of motion, forexample, or the three laws of thermo-dynamics and even Isaac Asimov’sthree laws of robotics. After nearly 30years of working with FPGAs, I’d liketo humbly add a new set of three“laws” that seem to govern or influ-ence FPGA-based prototyping. Theycan be stated as follows:

1. SoCs can be larger than FPGAs.

2. SoCs can be faster than FPGAs.

3. SoC designs can be hard to portto FPGA.

OK, they are not going to win me aNobel Prize and they are not reallylaws at all in the strict sense. Rather,they represent an attempt to describethe root causes of the challenges thatface us when prototyping. Prototypershave spent a lot of development effortand engineering ingenuity on dealingwith these challenges. Let’s take acloser look at each of the three and indoing so, find a way around them byadopting a different approach to theSoC project, one that we at Synopsyscall Design-for-Prototyping.

WORKING WITHIN THE THREE LAWSImplementing an FPGA-based proto-type of an SoC requires overcoming avariety of development challenges.Perhaps surprisingly, creating theFPGA hardware itself is not the mostdifficult problem. With the help ofadvanced FPGA devices, such as theXCV6LX760 from the Xilinx® Virtex®-6family, it is possible to create boardsand platforms that can host up to 100million SoC gates. It’s tempting todelve into the pros and cons of mak-ing vs. buying such boards but, leav-ing that subject for another day, let’sfocus instead on the particular expert-ise required to implement an SoCdesign in FPGA resources. Our three“laws” are a useful scaffolding uponwhich to build the argument forFPGA-based prototyping.

FIRST LAW: SOCS CAN BELARGER THAN FPGASEven though today’s largest FPGAscan handle millions of SoC gates, therewill be SoC designs which are largerstill. To prototype these hefty SoCs,we not only need to partition the SoCdesign across multiple FPGA devicesbut must then reconnect the signalsamong those devices. Typically, whenyou have to split the design betweentwo or more FPGAs, the number ofpins available to reconnect the FPGAsbecomes the limiting factor, evenwhen using the latest FPGAs sportingmore than 1,000 I/O pins each. Forexample, partitioning designs withwide buses or datapaths can require asurprisingly large number of I/O pins.

Our goal must be to modify the SoCRTL as little as possible, but partition-ing and reconnecting create artificialboundaries in the design. This pres-ents complex challenges in optimalpartitioning, pin allocation and board-trace assignment. Since the jobinvolves working with thousands ofpins at a time, automation will be ahelp, given that just one misplaced pincan cause our prototype not to func-

tion (not to mention, provide endlesshours of fun debugging).

Full automation is difficult toachieve, however, because every SoCdesign is unique and the variety ofdesigns is wide. It is difficult to devisean automated algorithm that will pro-vide optimal results across so manyscenarios. We usually find that it isbest to start with some engineer-guid-ed steps for critical sections of thedesign and then use automation tocomplete the parts where optimalresults are not so crucial. Indeed, thatis exactly the approach of many pro-duction FPGA designs, where somewell-judged manual intervention cangreatly improve results. For prototyp-ing, it is the task of companies likeSynopsys and Xilinx to provide toolsthat balance automation, optimizationand intervention, with the understand-ing that the tools and methods neededfor prototyping and for productiondesigns are not necessarily the same.

Can we break the first law?Actually, no; there will probablyalways be SoC designs that are largerthan the available FPGAs. However,with semi-automated partitioning,high-speed interconnect and some

X P E R T O P I N I O N

designer intelligence, we can optimizeresults to minimize the effect of thepartition boundaries on the overallprototype system speed.

Further, we can certainly mitigatethe effect of the first law by using time-division multiplexing, allowing severaldesign signals to pass through a singleboard trace between two FPGAs. Thisgreatly eases the partitioning and recon-nection tasks but may limit the proto-type’s overall performance, especially ifhigh multiplexing ratios are needed.Happily, designers can now easily usethe FPGA’s fast differential I/O capabili-ty and source-synchronous signaling asthe transport medium between theFPGAs to regain performance, even atmultiplexing ratios as high as 128:2. Youcan achieve all of this in an automatedway without any need for changing thesource RTL for the SoC.

SECOND LAW: SOCS CAN BE FASTER THAN FPGASSoC designs can achieve a very highcore speed for critical paths mappedinto particular high-performance libraryelements that trade speed for area,power or cost. The unfortunate sideeffect is that critical paths in SoCdesigns may then be synthesized intomore levels of logic than would other-wise have been the case. That leavesprototypers with the task of mappingthese long paths into FPGA fabrics. Youcan optimize performance in the FPGAusing physical synthesis or layout tech-niques such as relationally placedmacros, but it may also be sensibleto start with the right expectationof the maximum per-formance achiev-able for thesecritical paths. Inmany cases, itmay not be neces-sary to run thesepaths at full speed anyway.

Can we break the second law?FPGAs deliver the performance toexceed the needs of the majority ofprototype end users, namely the soft-

ware engineers performing presiliconvalidation. These software engineersintegrate embedded software into thetarget hardware and also validate itsbehavior in the context of complexmulticore operating systems using real-time data and real-world environments.

To be most useful, the prototypehas to run accurately and at a speed asclose as possible to that of the finalSoC silicon. However, we can alsoplan to run the prototype at, say, halfor quarter speed, and let the softwareteam adjust their timing parameters inorder to allow for the lower speed andstill run their tests. Even this reducedspeed will be an order of magnitudefaster than any comparable emulationsystem, or two or three orders of mag-nitude faster than an RTL simulator.The performance of FPGAs providesmonths of benefit before the first SoCsilicon samples are finally available.

THIRD LAW: SOC DESIGNS CANBE HARD TO PORT TO FPGAA very common problem facing theteam when porting an SoC design toan FPGA is that the authors of theoriginal SoC RTL didn’t plan ahead tomeet the prototyper’s needs. As aresult, the design usually containssome SoC-specific elements—such aslibrary-cell and RAM instantiations—that require special attention beforethey can be implemented in an FPGAtool flow. The two most dominantexamples of such “FPGA-hostile”design elements are overly complexclock networks and nonportable IP.

Today’s FPGA devices featurevery sophisticated and capableclock-generation and distributionresources, such as the MMCM blocksin Virtex-6. However, SoC designsoften feature very complex clocknetworks themselves, including

clock gating, frequency scaling and alarge number of clock domains.These SoC clock networks can over-flow the resources of an FPGA, andthe problem is compounded whenthe design is split across multipleFPGAs and requires board-levelclock generation and distribution.

Regarding intellectual property(IP), it is hard to conceive of an SoCtoday that does not include manyblocks of IP, from relatively simplefunctional blocks up to CPUs andrelated bus-connection subsystems,and standards-based peripheral IPsuch as PCI Express® or USB. Theoriginal SoC designers may have cho-sen IP to meet their own needs, with-out checking whether an FPGA-equiv-alent version exists. This means that



To be most useful, the prototype has to run accurately and at a speed as close as possible to that of the final SoC silicon.

However, we can also plan to run the prototype at, say, half or quarterspeed, and let the software team adjust their timing parameters in

order to allow for the lower speed and still run their tests.

the prototypers need to either remapthe SoC version of the RTL or substi-tute an FPGA equivalent that is func-tionally as close as possible to theoriginal IP. The good news is thatmany designers regularly use theFPGA’s built-in hard IP and otherFPGA-compatible IP available withinCORE Generator™ from Xilinx orDesignWare from Synopsys as replace-ments for SoC IP with great success.

Can we break the third law? Yes, wecan. First, we can use tools and ingenu-ity to work around the SoC-specificelements. For example, automatedtools can convert gated clocks into sys-tem clocks and the appropriate clock-enables, which are more optimized forFPGA resources. We can also replaceSoC RAM instantiations with theirFPGA equivalents, or extract largerRAMs to an external memory resourceon the prototype board.

HOW CAN WE MAKE FPGA-BASED PROTOTYPING EASIER? Prototypers worldwide are successful-ly employing all the techniques citedabove and many other methods today,but what if we could get the design in amuch more FPGA-friendly versionfrom the start? What if the whole SoCproject team included FPGA expertsfrom the outset, offering guidance onFPGA technology so that the designcould be optimized for both SoC andFPGA implementation? Then we wouldhave a Design-for-Prototyping method-ology that would be a game changer fortoday’s SoC projects, and this wouldhelp software engineers most of all.

DESIGN-FOR-PROTOTYPING IS COMINGSoftware development now dominatesSoC project costs and schedules, and

increasingly, the goal ofthe SoC silicon is to pro-vide an optimal runtimeenvironment for the soft-ware. The design priori-ties in SoC projects arechanging to include meth-ods for accelerating soft-ware development andvalidation. FPGA-basedprototyping is a criticalpart of this change and assuch, it must be integrat-ed more completely intothe SoC project flow,rather than being almostan afterthought in someworst-case examples.

Design teams mustadopt different methods throughoutthe SoC flow in order to ease the cre-ation of an FPGA-based prototype.Then they must provide this proto-type to the software engineers assoon as possible to deliver maximumbenefit before the first SoC siliconarrives. That means eliminating—orat the very least, clearly identifyingand isolating—nonportable elementsso that the prototyping team canmake the necessary changes quicklyand without error.

The happy coincidence is thatguidelines in a Design-for-Prototypingmethodology are also central to goodSoC design practice and will makedesigns more portable and easier toreuse. For example, using wrappers toisolate technology-specific elements,such as memory and arithmeticblocks, so that they can be more easi-ly replaced with FPGA equivalents willalso make it easier to use formal equiv-alency-checking methods in the SoCflow itself. By using only SoC IP forwhich there is a proven and available

FPGA equivalent, design teams canshave weeks off their prototype devel-opment cycle, leading to immeasura-ble benefits in final software qualityand product scheduling.

SoC labs with an interest in Design-for-Prototyping must pay heed to twomajor initiatives, namely, proceduralguidelines and design guidelines.Many examples of each exist; look forfurther literature from Xilinx andSynopsys very soon on this importantsubject area.

FPGA-based prototyping is of suchcrucial benefit to today’s SoC andembedded-software projects that weshould do all we can to ensure its suc-cess. The first order of business is tobreak or overcome the three “laws”that have governed many prototypingprojects up until now, using automationand a Design-for-Prototyping methodol-ogy. If we can integrate the proceduraland design guidelines of Design-for-Prototyping within the whole SoC proj-ect, then software quality and projectschedules will improve—and so willthe final product.



What if we could get the design in a much more FPGA-friendly version from the start? What if the whole SoC project team included FPGA experts from the outset, offering guidance

on FPGA technology so that the design could be optimized for both SoC and FPGA implementation?Then we would have a Design-for-Prototyping methodology that would be a game changer for today's SoC projects.


XAPP887: PRC/EPRC: DATA INTEGRITY AND SECURITY CONTROLLER FOR PARTIAL RECONFIGURATIONhttp://www.xilinx.com/support/documentation/application_

notes/xapp887_PRC_EPRC.pdf

Partial reconfiguration of Xilinx® Virtex® FPGAs requiresthe use of partial bitstreams. These bitstreams reprogramregions of an operational FPGA with new functionalitywithout disrupting the functionality or compromising theintegrity of the remainder of the device. However, loading afaulty or corrupted partial bitstream might cause incorrectdesign behavior, not only in the region undergoing repro-gramming but also in the entire device.

Reprogramming the device with a correct partial or fullbitstream can be the simplest recovery method. However,for some systems, it is essential to validate the integrity ofpartial bitstreams before loading them into the device. Thisis because halting the design operation to reprogram theentire device is undesirable or delivering partial bitstreamsis error prone (as in radio transmission, for example).

This application note by Amir Zeineddini and JimWesselkamper describes a data integrity controller for partialreconfiguration (PRC) that you can include in any partiallyreconfigurable FPGA design to process partial bitstreams fordata integrity. The controller implements cyclic redundancycheck (CRC) in FPGA logic for partial bitstreams before load-ing them through the internal configuration access port(ICAP). A special version of the controller for encrypted par-tial reconfiguration (EPRC) enables encrypted partial bit-streams to use the internal configuration logic for decryption.

This application note is applicable only to Virtex-5 andVirtex-6 devices.

XAPP884: AN ATTRIBUTE-PROGRAMMABLE PRBS GENERATOR AND CHECKERhttp://www.xilinx.com/support/documentation/application_

notes/xapp884_PRBS_GeneratorChecker.pdf

In serial interconnect technology, it is very common to usepseudorandom binary sequence (PRBS) patterns to test therobustness of links. Behind this choice is the fact that everyPRBS pattern has a white spectrum in the frequency domainor, alternatively, a delta-shaped autocorrelation function inthe time domain. This application note describes a PRBS

generator/checker circuit where the generator polynomial,the parallelism level and the functionality (generator orchecker) are programmable via attributes.

Along with “how to” information on circuit description,pinout and standard polynomials, authors DanieleRiccardi and Paolo Novellini offer a design example andthen delve into the theoretical background on PRBSsequences, their generators and how to select them toobtain optimal spectral properties.

XAPP107: IMPLEMENTING TRIPLE-RATE SDI WITHSPARTAN-6 FPGA GTP TRANSCEIVERShttp://www.xilinx.com/support/documentation/application_

notes/xapp1076_S6GTP_TripleRateSDI.pdf

Professional broadcast video equipment makes wide use ofthe triple-rate serial digital interface (SDI) supporting theSMPTE SD-SDI, HD-SDI and 3G-SDI standards. In broad-cast studios and video production centers, SDI interfacescarry uncompressed digital video along with embeddedancillary data, such as multiple audio channels. Spartan®-6FPGA GTP transceivers are well-suited for implementingtriple-rate SDI receivers and transmitters.

In this document, author Reed Tidwell describes triple-rateSDI receiver and transmitter reference designs for the GTPtransceivers in Spartan®-6 devices. The transceivers provide ahigh degree of performance and reliability in a low-costdevice for this application. The Spartan-6 FPGA SDI transmit-ter supplies the basic operations necessary to support SD-SDI, HD-SDI, dual-link HD-SDI and 3G-SDI transmission. (Itdoes not perform video mapping for 3G-SDI or dual-link HD-SDI. Video formats that require mapping must be mapped intoSDI data streams prior to the triple-rate SDI TX module.)

The Spartan-6 FPGA GTP transceiver triple-rate SDItransmitter has a 20-bit GTP interface, supported on -3 andfaster Spartan-6 devices. Only two reference clock frequen-cies are required to support all SDI modes.

XAPP495: IMPLEMENTING A TMDS VIDEOINTERFACE IN THE SPARTAN-6 FPGAhttp://www.xilinx.com/support/documentation/application_

notes/xapp495_S6TMDS_Video_Interface.pdf

Transition-minimized differential signaling (TMDS) is astandard used for transmitting video data over the Digital

XAMPLES. . .

Application NotesIf you want to do a bit more reading about how our FPGAs lend themselves to a broad number of applications, we recommend these notes.

http://www.xilinx.com/support/documentation/application_ notes/xapp887_PRC_EPRC.pdf

http://www.xilinx.com/support/documentation/application_


http://www.xilinx.com/support/documentation/application_notes/xapp887_PRC_EPRC.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp884_PRBS_GeneratorChecker.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp1076_S6GTP_TripleRateSDI.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp495_S6TMDS_Video_Interface.pdf


Visual Interface (DVI) and High-Definition MultimediaInterface (HDMI). Both interfaces are popular in a widerange of market segments including consumer electronics,audio/video broadcasting, industrial surveillance and med-ical imaging. They are commonly seen in flat-panel TVs, PCmonitors, Blu-ray disc players, video camcorders and med-ical displays. This application note by Bob Feng describes aset of reference designs able to transmit and receive DVIand HDMI data streams up to 1,080 Mbits/second using theSpartan-6 FPGA’s native TMDS I/O interface.

The DVI and HDMI protocols use TMDS at the physicallayer. The TMDS throughput is a function of the serial datarate of the video screen mode being transmitted. This inturn determines the FPGA speed grade that’s needed to sup-port this throughput. In recent generations of Spartandevices, Xilinx has offered embedded electrically compliantTMDS I/O, allowing implementation of DVI and HDMI inter-faces inside the FPGA.

This application note successfully demonstrates a high-definition TMDS video connectivity solution featuring theprocessing of multiple HD channels in parallel. The designdemonstrates the Spartan-6 FPGA in DVI or HDMI applica-tions (including the 480p, 1080i and 720p video formats).

XAPP886: INTERFACING QDR II SRAM DEVICES WITH VIRTEX-6 FPGAShttp://www.xilinx.com/support/documentation/application_

notes/xapp886_Interfacing_QDR_II_SRAM_Devices_with_

V6_FPGAs.pdf

With an increasing need for lower latency and higheroperating frequencies, memory interface IP is becomingmore complex, and designers need to tailor it based on anumber of factors such as latency, burst length, interfacewidth and operating frequency. The Xilinx MemoryInterface Generator (MIG) tool enables the creation of alarge variety of memory interfaces for devices such as theVirtex-6 FPGA. However, in the Virtex-6, QDR II SRAM isnot one of the options available by default. Instead, thefocus has been on the QDR II+ technology using four-word burst-access mode.

This application note by Olivier Despaux presents aVerilog reference design that has been simulated, synthe-sized and verified on hardware using Virtex-6 FPGAs andQDR II SRAM two-word burst devices. The referencedesign is based on a MIG 3.4 QDR II+ four-word burstdesign. Despaux designed the code base to be adaptableso that designers can pass many parameters from the top-level RTL module to the levels lower down in the hierar-chy. The control layers of the design operate mostly athalf the rate of the memory interface, and the ISERDESand OSERDES are used to interface to the external mem-ory components.

XAPP88: FAST CONFIGURATION OF PCI EXPRESSTECHNOLOGY THROUGH PARTIALRECONFIGURATIONhttp://www.xilinx.com/support/documentation/application_

notes/xapp883_Fast_Config_PCIe.pdf

The PCI Express® specification requires ports to be readyfor link training at a minimum of 100 milliseconds after thepower supply is stable. This becomes a difficult task due tothe ever-increasing configuration memory size of each newgeneration of FPGA, such as the Xilinx Virtex-6 family. Oneinnovative approach to addressing this challenge is leverag-ing the advances made in the area of FPGA partial reconfig-uration to split the overall configuration of a PCIe® specifi-cation-based system in a large FPGA into two sequentialsteps: initial PCIe system link configuration and subsequentuser application reconfiguration.

It is feasible to configure only the FPGA PCIe systemblock and associated logic during the first stage within the100-ms window before the fundamental reset is released.Using the Virtex-6 FPGA’s partial reconfiguration capability,the host can then reconfigure the FPGA to apply the userapplication via the now-active PCIe system link.

This methodology not only provides a solution for fasterPCIe system configuration, it enhances user applicationsecurity because only the host can access the bitstream, sothe bitstream can be better encrypted. This approach alsohelps lower the system cost by reducing external configura-tion component costs and board space.

This application note by Simon Tam and MartinKellermann describes the methodology for building a FastPCIe Configuration (FPC) module using this two-step con-figuration approach. An accompanying reference designwill help designers quick-launch a PlanAhead™ softwarepartial reconfiguration project. The reference design imple-ments an eight-lane PCIe technology first-generation linkand targets the Virtex-6 FPGA ML605 Evaluation Board.The reference design serves as a guide for designers todevelop their own solutions.

XAPP492: EXTENDING THE SPARTAN-6 FPGACONNECTIVITY TARGETED REFERENCE DESIGN(PCIE-DMA-DDR3-GBE) TO SUPPORT THE AURORA8B/10B SERIAL PROTOCOLhttp://www.xilinx.com/support/documentation/application_

notes/xapp492_S6_ConnectivityTRD_Aurora.pdf

Targeted reference designs provide Xilinx designers withturnkey platforms to create FPGA-based solutions in a widevariety of industries. This application note by VasuDevunuri and Sunita Jain extends the Xilinx Spartan-6FPGA PCIe-DMA-DDR3-GbE targeted reference design tosupport the Aurora 8B/10B serial link-layer protocol for




http://www.xilinx.com/support/documentation/application_notes/xapp886_Interfacing_QDR_II_SRAM_Devices_with_V6_FPGAs.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp883_Fast_Config_PCIe.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp492_S6_ConnectivityTRD_Aurora.pdf


high-speed data communications. Designers use Aurora8B/10B to move data across point-to-point serial links. Itprovides a transparent interface to the physical serial linksand supports both framing and streaming modes of opera-tion. This application note uses Aurora in framing mode.

The PCIe-DMA base platform moves data between sys-tem memory and the FPGA. Thus transferred, the data canbe consumed within the FPGA, forwarded to another FPGAor sent over the backplane using a serial connectivity proto-col such as Aurora 8B/10B. Similarly, the Aurora protocolcan bring data in to the FPGA from another FPGA or back-plane, and send it to system memory for further processingor analysis. This enhancement retains the Ethernet opera-tion as is and demonstrates PCIe-to-Ethernet and PCIe-to-Aurora bridging functionality.

The application note demonstrates the Spartan-6 FPGAConnectivity Targeted Reference Design in a platform solu-tion. The example design uses this existing infrastructure asa base upon which to create a newer design with limitedeffort. The network path functioning as the network inter-face card remains the same. The memory path is modified tosupport packet FIFO over the Spartan-6 FPGA memory con-troller and to support integrating the Aurora 8B/10BLogiCORE® IP, which operates through the packet FIFO.

XAPP899: INTERFACING VIRTEX-6 FPGAS WITH 3.3V I/O STANDARDShttp://www.xilinx.com/support/documentation/application_

notes/xapp899.pdf

All the devices in the Virtex-6 family are compatible withand support 3.3-volt I/O standards. In this application note,Austin Tavares describes methodologies for interfacingVirtex-6 devices to 3.3V systems. The document coversinput, output and bidirectional buses, as well as signalintegrity issues and design guidelines.

The Virtex-6 FPGA I/O is designed for both high perform-ance and flexibility. Each I/O is homogenous, meaning everyI/O has all features and all functions. This high-performanceI/O allows the broadest flexibility to address a wide range ofapplications. Designers seeking to interface Virtex-6 FPGA I/Oto 3.3V devices must take care to ensure device reliability andproper interface operation by following the appropriate designguidelines. They must meet the DC and AC input voltage spec-ification when designing interfaces with Virtex-6 devices.

XAPP951: CONFIGURING XILINX FPGAS WITH SPI SERIAL FLASHhttp://www.xilinx.com/support/documentation/application_

notes/xapp951.pdf

The addition of the Serial Peripheral Interface (SPI) innewer Xilinx FPGA families allows designers to use multi-

vendor small-footprint SPI PROM for configuration.Systems with an existing onboard SPI PROM can lever-age the single memory source for storing configurationdata in addition to the existing user data. In this applica-tion note, author Stephanie Tapp discusses the requiredconnections to configure the FPGA from an SPI serialflash device and shows the configuration flow for the SPImode, noting special precautions for configuring from anSPI serial flash. The note details the ISE® Design SuiteiMPACT direct programming solution for SPI-formattedPROM file creation and programming during prototypingfor select vendors.

SPI serial flash memories are popular because they can beeasily accessed postconfiguration, offering random-access,nonvolatile data storage to the FPGA. Systems with SPI seri-al flash memory already onboard can also benefit from hav-ing the option to configure the FPGA from the same memorydevice. The SPI protocol does have a few variations amongvendors. This application note discusses these variants.

XAPP881: VIRTEX-6 FPGA LVDS 4XASYNCHRONOUS OVERSAMPLING AT 1.25 GB/Shttp://www.xilinx.com/support/documentation/application_

notes/xapp881_V6_4X_Asynch_OverSampling.pdf

The Virtex-6 FPGA’s SelectIO™ technology can perform 4xasynchronous oversampling at 1.25 Gbits/second, accom-plished using the ISERDESE1 primitive through the mixed-mode clock manager (MMCM) dedicated performancepath. Located in the SelectIO logic block, the ISERDES1primitive contains four phases of dedicated flip-flops usedfor sampling.

The MMCM is an advanced phase-locked loop that hasthe capability to provide a phase-shifted clock on a low-jitter performance path. The output of the ISERDESE1 isthen sorted using a data recovery unit. The DRU used inthis application note by Catalin Baetoniu and BrandonDay is based on a voter system that compares the outputsand selects the best data sample. The method consists ofoversampling the data with a clock of similar frequency(±100 ppm), taking multiple samples of the data at differ-ent clock phases to get a sample of the data at the mostideal point.

XAPP493: IMPLEMENTING A DISPLAYPORT SOURCE POLICY MAKER USING A MICROBLAZEEMBEDDED PROCESSORhttp://www.xilinx.com/support/documentation/application_

notes/xapp493_DisplayPort_SPM.pdf

The purpose of this application note and accompanying ref-erence design is to help you implement the DisplayPortSource Policy Maker in a MicroBlaze™ processor. Although

XAMPLES. . .





http://www.xilinx.com/support/documentation/application_notes/xapp899.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp951.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp881_V6_4X_Asynch_OverSampling.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp493_DisplayPort_SPM.pdf


authors Tom Strader and Matt Ouellette target their designspecifically for the Spartan-6 FPGA Consumer Video Kit,they have designed it to be architecture independent.

The DisplayPort Source Policy Maker communicateswith both the transmitter (source) and the receiver (sink) toperform several tasks, such as initialization of GTP trans-ceiver links, probing of registers and other features usefulfor bring-up and use of the core. This reference systemimplements the DisplayPort Source Policy Maker at thesource side. The mechanism for communication to the sinkside is over the auxiliary channel, as described in theDisplayPort standard specification (available from theVideo Electronics Standards Association website).

The reference design encompasses the DisplayPortSource Policy Maker, a DisplayPort source core generatedusing the Xilinx CORE Generator™ tool and a video patterngenerator to create video data for transmission over theDisplayPort link.

XAPP1071: CONNECTING VIRTEX-6 FPGAS TO ADCSWITH SERIAL LVDS INTERFACES AND DACS WITHPARALLEL LVDS INTERFACES http://www.xilinx.com/support/documentation/applica-

tion_notes/xapp1071_V6_ADC_DAC_LVDS.pdf

Virtex-6 FPGA interfaces use the dedicated deserializer(ISERDES) and serializer (OSERDES) functionalities to pro-vide a flexible and versatile platform for building high-speedlow-voltage differential signaling (LVDS) interfaces to all thelatest data converter devices on the market. This applicationnote by Marc Defossez describes how to utilize the ISERDESand OSERDES features in Virtex-6 FPGAs to interface withanalog-to-digital converters (ADCs) that have serial LVDS out-puts and with digital-to-analog converters (DACs) that haveparallel LVDS inputs. The associated reference design illus-trates a basic LVDS interface connecting a Virtex-6 FPGA toany ADCs or DACs with high-speed serial interfaces.

http://www.xilinx.com/support/documentation/applica-tion_notes/xapp1071_V6_ADC_DAC_LVDS.pdf



http://www.xilinx.com/support/documentation/application_notes/xapp1071_V6_ADC_DAC_LVDS.pdf


http://www.gaterocket.com


Xpress Yourself in Our Caption Contest

NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia or Canada (excluding Quebec) to enter. Entries must be entirely original and must be received by 5:00 pm Pacific Time (PT) on April 1, 2011. Official rules available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.

I f there’s a spring in your walk—or in your office chair—here’s your oppor-tunity to use it to good effect. We invite readers to bounce up to a verbal

challenge and submit an engineering- or technology-related caption for thisDaniel Guidera cartoon depicting something amiss in the lab. The imagemight inspire a caption like “The booby-trapped desk chair was Helen’s wayof breaking the glass ceiling in R&D.”

Send your entries to [email protected]. Include your name, job title, compa-ny affiliation and location, and indicate that you have read the contest rulesat www.xilinx.com/xcellcontest. After due deliberation, we will print thesubmissions we like the best in the next issue of Xcell Journal and award thewinner the new Xilinx® SP601 Evaluation Kit, our entry-level developmentenvironment for evaluating the Spartan®-6 family of FPGAs (approximateretail value, $295; see http://www.xilinx.com/sp601). Runners-up will gainnotoriety, fame and a cool, Xilinx-branded gift from our SWAG closet.

The deadline for submitting entries is 5:00 pm Pacific Time (PT) on April 1st,2011. So, get writing in time for April Fool’s Day!

GLENN BABECKI, principal system engineer at

Comcast, won a Spartan-6 EvaluationKit for this caption for the holiday

turkey cartoon in the last issue of Xcell Journal.

Congratulations as well

to our two runners-up:

“The boss had a strange sense of humor.Last week he had commented that thenew prototype was hot enough to cook

a Thanksgiving dinner, but...”

– Robin Findley,

Findley Consulting, LLC

“The new wafer fab has been producingsome real turkeys lately.”

– Vadim Vaynerman,

Bottom Line Technologies Inc.

“When the boss said ‘watch theslices in the fabric,’ I didn’t think

he was talking about servingThanksgiving turkey in the

conference room!”

DANIEL GUIDERA

DA

NIE

LG

UID

ER

AXCLAMATIONS!


http://www.xilinx.com/xcellcontest

http://www.xilinx.com/sp601

http://www.xilinx.com/xcellcontest

The Best Prototyping System Just Got Better

� Highest Performance

� Integrated Software Flow

� Pre-tested DesignWare® IP

� Advanced Verification Functionality

� High-speed TDM

HAPS™ High-performance ASIC Prototyping Systems™

For more information visit www.synopsys.com/fpga-based-prototyping

http://www.synopsys.com/fpga-based-prototyping

HALF THEPOWER

TWICE THE PERFORMANCEA WHOLE NEW WAY OF THINKING.

Powerful, flexible, and built on the only unified architecture to span low-cost to ultra high-end FPGA families. Leveraging next-generation ISE Design Suite, development times speed up, while protecting your IP investment. Innovate without compromise.

LEARN MORE AT WWW.XILINX.COM / 7

Lowest power and cost

Best price and performance

Highest system performance and capacity

Introducing the 7 Series. Highest performance,

lowest power family of FPGAs.

© Copyright 2010. Xilinx, Inc. XILINX, the Xilinx logo, Artix, ISE, Kintex, Spartan, Virtex, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.

PN 2480

http://WWW.XILINX.COM/7