of 68 /68
www.xilinx.com/xcell/ SOLUTIONS FOR A PROGRAMMABLE WORLD FPGAs Speed the Computation of Complex Credit Derivatives Xcell journal Xcell journal ISSUE 74, FIRST QUARTER 2011 Using MicroBlaze to Build Mil-grade Navigation Systems FPGA Houses Single-Board Computer with SATA Old School: Using Xilinx Tools in Command-Line Mode Stacked & Loaded Stacked & Loaded Xilinx SSI, 28-Gbps I/O Yield Amazing FPGAs page 18

Xcell Journal issue 74

Embed Size (px)

DESCRIPTION

The winter 2011 issue of Xcell Journal is a must read for those interested in FPGA technology and innovative electronic design

Text of Xcell Journal issue 74

  • www.xilinx.com/xcell/

    S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

    FPGAs Speed theComputation ofComplex CreditDerivatives

    Xcell journalXcell journalI S SUE 74 , F I RST QUAR TER 2011

    Using MicroBlaze to Build Mil-grade Navigation Systems

    FPGA Houses Single-BoardComputer with SATA

    Old School: Using Xilinx Tools in Command-Line Mode

    Stacked & Loaded Stacked & Loaded Xilinx SSI, 28-Gbps I/O Yield Amazing FPGAs

    page18

  • Avnet, Inc. 2011. All rights reserved. AVNET is a registered trademark of Avnet, Inc. All other brands are the property of their respective owners.

    Expanding Upon theIntel Atom ProcessorAvnet Electronics Marketing has combined an

    Emerson Network Power Nano-ITX motherboard

    that provides a complete Intel Atom processor

    E640-based embedded system with a Xilinx

    Spartan-6 FPGA PCI Express (PCIe) daughter card

    for easy peripheral expansion. A reference design,

    based on an included version of the Microsoft

    Windows Embedded Standard 7 operating system,

    shows developers how to easily add custom FPGA-

    based peripherals to user applications.

    Nano-ITX/Xilinx Spartan-6 FPGADevelopment Kit Features

    tEmerson Nano-ITX-315 motherboard tXilinx Spartan-6 FPGA PCIe boardtFPGA Mezzanine Card (FMC)

    connector for expansion modulestGPIO connector for custom peripheral

    or LCD panel interfacestTwo independent memory

    banks of DDR3 SDRAMtMaxim SHA-1 EEPROM for

    FPGA design securitytDownloadable documentation

    and reference designs

    To purchase this kit, visit www.em.avnet.com/spartan6atomor call 800.332.8638.

  • Toggle among banks of internal signals for incremental real-time internal measurements without:

    2oyy`lU]E 2]!lU`lU]E

  • L E T T E R F R O M T H E P U B L I S H E R

    Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

    2011 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

    The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

    PUBLISHER Mike [email protected]

    EDITOR Jacqueline Damian

    ART DIRECTOR Scott Blair

    DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

    ADVERTISING SALES Dan [email protected]

    INTERNATIONAL Melissa Zhang, Asia [email protected]

    Christelle Moraga, Europe/Middle East/[email protected]

    Miyuki Takegoshi, [email protected]

    REPRINT ORDERS 1-800-493-5551

    Xcell journal

    www.xilinx.com/xcell/

    Xilinx Alliance Program Streamlines IP-Based FPGA Design

    In the last quarter, Xilinx has made several major announcements, including our 28-nanometer 7 series FPGAs with their new More than Moore stacked-die architec-ture, and even a Xilinx Virtex-7 FPGA family member that enables 400-Gbpsbandwidth, as described in this issues cover story. One announcement that you may haveoverlooked in this flurry of news came from the Xilinx Alliance Program group. Under theleadership of Dave Tokic, the team has completely revamped Xilinxs activities to strengthenthe ecosystem of design and integration services, embedded software, design tools, devel-opment and evaluation boards, complementary silicon device technology and, especially,IP providers. This massive effort marks quite an accomplishmentone that should yieldgreat benefits to Xilinx users as well as partners for years to come.

    The two biggest issues that have plagued the IP industry since its emergence in the 1990sare how to know that the piece of IP you are using is of high quality, and which bus to useto connect all these different pieces of IP. Tokics group at Xilinx has made huge inroads inaddressing these problems. With the Extensible Processing Platform (EPP) architecture,Xilinx is currently creating a device that will marry an ARM MPU core with FPGA fabricon the same device (see cover story, Xcell Journal, issue 71). We also worked with ARM tocreate the next version of the industry-standard AMBA interconnect protocol, calledAXI4, that is optimized for use in FPGAs. All of the EPP products as well as the 7 seriesdevices will use the AXI4 interconnect. This companywide standardization on AXI4 allowscustomers and ecosystem providers that have already developed IP for AMBA to moreeasily migrate their IP into Xilinx FPGAs. It also makes it easier for Xilinx to build our ownIP and better maintain our existing IP libraries moving forward.

    Xilinx has also adopted IP industry standards such as IP-XACT, or IEEE 1685 (formerlyknown as SPIRIT), which establishes a standard structure for packaging, integrating andreusing IP within tool flows. We have settled on a common set of quality and implementa-tion metrics (leveraging the efforts first undertaken by the VSIA) to create transparency forcustomers into IP quality and implementation details. Tokics group also helped establishthe industrys SignOnce program to facilitate customer licensing. Many IP vendors with-in the Alliance Program offer licensing via these simplified contracts. And with the releaseof Virtex-6 devices over a year ago, Xilinx standardized on the VITA FMC PCB standard,making it easier for IP vendors to offer development boards and daughtercards that easilyplug into Targeted Design Platforms from Xilinx, Avnet and Tokyo Electron Devices.

    Whats more, Tokics group took the guesswork out of choosing the right provider byestablishing a deeper qualification process and tiering these suppliers based on criteriaincluding business practices, technical and quality processes and track record. The pro-gram has three tiers: Alliance Program Members maintain credible customer satisfactionwith their products and services and a track record of success. Certified Members haveadditionally met specific technical qualification on FPGA implementation expertise. At thetop level, Premier Partners are audited to ensure they adhere to the highest standards.

    For a reporter who has spent many years covering the wars and dramas involved inestablishing standards, its awesome to see them finalized and being put to fabulous use.For more information, visit http://www.xilinx.com/alliance/index.htm.

    Mike SantariniPublisher

  • C O N T E N T S

    VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES 1818

    3838

    Letter From the PublisherXilinx Alliance Program

    Streamlines IP-Based Design4

    Xpert OpinionBreaking Barriers with

    FPGA-Based Design-for-Prototyping58

    Xcellence in A&DMicroBlaze Hosts

    Mobile Multisensor

    Navigation System14

    Xcellence in High-Performance ComputingFPGAs Speed the Computation of

    Complex Credit Derivatives18

    Xcellence in IndustrialFPGA Houses Single-Board

    Computer with SATA26

    Xcellence in University ResearchGeneral-Purpose SoC Platform

    Handles Hardware/Software

    Co-Design32

    Xcellence in WirelessMIMO Sphere Detector

    Implemented in FPGA38

    Cover StoryStacked & Loaded: Xilinx SSI & 28-Gbps I/O Yield Novel FPGAs

    88

    1818

  • F I R S T Q U A R T E R 2 0 1 1 , I S S U E 7 4

    Xperts CornerUsing Xilinx Tools in

    Command-Line Mode46

    FPGA101When It Comes to Runtime Chatter,

    Less Is Best52

    XTRA READING

    THE XILINX XPERIENCE FEATURES

    5252

    Xamples A mix of new and popular application notes62

    Xclamations! Share your wit and wisdom by supplying a caption for our

    techy cartoon, and you may win

    a Spartan-6 development kit66

    5858

    4646

    Xcell Journal recently received 2010 APEX Awards of Excellence in the

    categories Magazine & Journal Writing and Magazine & Journal Design and Layout.

  • 8 Xcell Journal First Quarter 2011

    COVER STORY

    Stacked & Loaded: Xilinx SSI, 28-Gbps I/OYield Amazing FPGAs

    by Mike SantariniPublisher, Xcell JournalXilinx, [email protected]

  • First Quarter 2011 Xcell Journal 9

    X ilinx recently added to its line-up two innovations that willfurther expand the applicationpossibilities and market reach ofFPGAs. In late October, Xilinx

    announced it is adding stacked siliconinterconnect (SSI) FPGAs to the highend of its forthcoming 28-nanometerVirtex-7 series (see Xcell Journal,Issue 72). The new, innovative archi-tecture connects several dice on a sin-gle silicon interposer, allowing Xilinxto field Virtex-7 FPGAs that pack asmany as 2 million logic cellstwicethe logic capacity of any otherannounced 28-nm FPGAenablingnext-generation capabilities in the cur-rent generation of process technology.

    Then, in late November, Xilinxtipped the Virtex-7 HT line of devices.Leveraging this SSI technology to com-bine FPGA and high-speed transceiverdice on a single IC, the Virtex-7 HTdevices are a giant technological leapforward for customers in the commu-nications sector and for the growingnumber of applications requiring high-speed I/O. These new FPGAs carrymany 28-Gbit/second transceiversalong with dozens of 13.1-Gbps trans-ceivers in the same device, facilitatingthe development of 100-Gbps commu-nications equipment today and 400-Gbps communications line cards wellin advance of established standardsfor equipment running at that speed.

    MORE THAN MOOREEver since Intel co-founder GordonMoore published his seminal articleCramming More Components ontoIntegrated Circuits in the April 19,1965 issue of Electronics magazine, thesemiconductor industry has doubled

    the transistor counts of new devicesevery 22 months, in lockstep with theintroduction of every new siliconprocess. Like other companies in thesemiconductor business, Xilinx haslearned over the years that to lead themarket, it must keep pace with MooresLaw and create silicon on each new gen-eration of process technologyor bet-ter yet, be the first company to do so.

    Now, at a time when the complexity,cost and thus risk of designing on thelatest process geometries are becomingprohibitive for a greater number ofcompanies, Xilinx has devised a uniqueway to more than double the capacityof its next-generation devices, theVirtex-7 FPGAs. By introducing one ofthe semiconductor industrys firststacked-die architectures, Xilinx willfield a line of the worlds largestFPGAs. The biggest of these, the 28-nmVirtex-7 XC7V2000T, offers 2 millionlogic cells along with 46,512 kbits ofblock RAM, 2,160 DSP slices and 36GTX 10.3125-Gbps transceivers. TheVirtex-7 family includes multiple SSIFPGAs as well as monolithic FPGAconfigurations. Virtex-7 is the high endof the 7 series, which also includes thenew low-cost, low-power ArtixFPGAs and the midrange KintexFPGAsall implemented on a unifiedApplication Specific Modular BlockArchitecture (ASMBL) architecture.

    The new SSI technology is more thanjust a windfall for customers itching touse the biggest FPGAs the industry canmuster. The successful deployment ofstacked dice in a mainstream logic chipmarks a huge semiconductor engineer-ing accomplishment. Xilinx is deliver-ing a stacked silicon chip at a timewhen most companies are just evaluat-

    More than Moore stacked silicon interconnect technology and 28-Gbps transceivers lead new era of FPGA-driven innovations.

  • ing stacked-die architectures in hopesof reaping capacity, integration, PCBreal-estate and even yield benefits.Most of these companies are lookingto stacked-die technology to simplykeep up with Moores LawXilinx isleveraging it today as a way to exceedit and as a way to mix and match com-plementary types of dice on a single ICfootprint to offer vast leaps forward insystem performance, bill-of-materials(BOM) savings and power efficiency.

    THE STACKED SILICON ARCHITECTUREThis new stacked silicon intercon-nect technology allows Xilinx to offernext-generation density in the currentgeneration of process technology,said Liam Madden, corporate vicepresident of FPGA development andsilicon technology at Xilinx. As diesize gets larger, the yield goes downexponentially, so building large diceis quite difficult and very costly. Thenew architecture allows us to build a

    number of smaller dice and then usea silicon interposer to connect thosesmaller dice lying side-by-side on topof the interposer so they appear tobe, and function as, one integrateddie (Figure 1).

    Each of the dice is interconnectedvia layers in the silicon interposer inmuch in the same way that discretecomponents are interconnected on themany layers of a printed-circuit board(Figure 2). The die and silicon inter-poser layer connect by means of multi-ple microbumps. The architecture alsouses through-silicon vias (TSVs) thatrun through the passive silicon inter-poser to facilitate direct communica-tion between regions of each die onthe device and resources off-chip(Figure 3). Data flows between theadjacent FPGA die across more than10,000 routing connections.

    Madden said using a passive sili-con interposer rather than going witha system-in-package or multichip-module configuration has huge

    advantages. We use regular siliconinterconnect or metallization to con-nect up the dice on the device, saidMadden. We can get many moreconnections within the silicon thanyou can with a system-in-package.But the biggest advantage of thisapproach is power savings. Becausewe are using chip interconnect toconnect the dice, it is much moreeconomical in power than connect-ing dice through big traces, throughpackages or through circuit boards.

    In fact, the SSI technology providesmore than 100 times the die-to-dieconnectivity bandwidth per watt, atone-fifth the latency, without consum-ing any high-speed serial or parallelI/O resources.

    Madden also notes that themicrobumps are not directly con-nected to the package. Rather, theyare interconnected to the passiveinterposer, which in turn is linked tothe adjacent die. This setup offersgreat advantages by shielding the

    10 Xcell Journal First Quarter 2011

    C O V E R S T O R Y

    Silicon Interposer

    >10K routing connectionsbetween slices

    ~1ns latency

    ASMBLOptimizedFPGASlice

    FPGA SlicesSide-by-Side

    SiliconInterposer

    Figure 1 The stacked silicon architecture places several dice (aka slices) side-by-sideon a silicon interposer.

  • microbumps from electrostatic dis-charge. By positioning dice next toeach other and interfaced to the ball-grid array, the device avoids the ther-mal flux, signal integrity and designtool flow issues that would haveaccompanied a purely vertical die-stacking approach.

    As with the monolithic 7 seriesdevices, Xilinx implemented the SSImembers of the Virtex-7 family inTSMCs 28-nm HPL (high-perform-ance, low-power) process technolo-gy, which Xilinx and TSMC devel-oped to create FPGAs with the rightmix of power efficiency and perform-ance (see cover story sidebar, XcellJournal, Issue 72).

    NO NEW TOOLS REQUIREDWhile the SSI technology offers someradical leaps forward in terms ofcapacity, Madden said it will notforce a radical change in customerdesign methodologies. One of thebeautiful aspects of this architectureis that we were able to establish theedges of each slice [individual die inthe device] along natural partitionswhere we would have traditionallyrun long wires had these structuresbeen in our monolithic FPGA archi-tecture, said Madden. This meantthat we didnt have to do anythingradical in the tools to support thedevices. As a result, customersdont have to make any major adjust-

    ments to their design methods orflows, he said.

    At the same time, Madden said thatcustomers will benefit from addingfloor-planning tools to their flowsbecause they now have so many logiccells to use.

    A SUPPLY CHAIN FIRSTWhile the design is in and of itselfquite innovative, one of the biggestchallenges of fielding such a devicewas in putting together the supplychain to manufacture, assemble, testand distribute it. To create the endproduct, each of the individual dicemust first be tested extensively at thewafer level, binned and sorted, and

    First Quarter 2011 Xcell Journal 11

    C O V E R S T O R Y

    28nm FPGA Slice 28nm FPGA Slice

    Package Substrate

    28nm FPGA Slice 28nm FPGA Slice

    Microbumps

    Access to power / ground / IOs

    Access to logic regions

    Leverages ubiquitous image sensor microbump technology

    Through-Silicon Vias (TSVs)

    Only bridge power / ground / IOs to C4 bumps

    Coarse pitch, low density aid manufacturability

    Etch process (not laser drilled)

    Passive Silicon Interposer

    4 conventional metal layers connect microbumps & TSVs

    No transistors means low risk and no TSV-induced performance degradation

    Etch process (not laser drilled)

    Side-by-Side Die Layout

    Minimal heat flux issues

    Minimal design tool flow impact

    Microbumps

    Silicon Interposer

    Through-Silicon Vias

    C4 Bumps

    BGA Balls

    Figure 2 Xilinxs stacked silicon technology uses passive silicon-basedinterposers, microbumps and TSVs.

  • then attached to the interposer. Thecombined structure then needs to bepackaged and given a final test toensure connectivity before the endproduct ships to customers.

    Maddens group worked with TSMCand other partners to build this supplychain. This is another first in theindustry, as no other company has putin place a supply chain like this acrossa foundry and OSAT [outsourcedsemiconductor assembly and test],said Madden.

    Another beautiful aspect of thisapproach is that we can use essential-ly the same test approach that we usein our current devices, he wenton.Our current test technology allowsus to produce known-good dice, andthat is a big advantage for us becausein general, one of the biggest barriersof doing stacked-die technology is howdo you test at the wafer level.

    Because the stacked silicon tech-nology integrates multiple XilinxFPGA dice on a single IC, it logicallyfollows that the architecture wouldalso lend itself to mixing and matchingFPGA and other dice to create entirelynew devices. And thats exactly what

    Xilinx did with its ultrafast Virtex-7 HTline, announced just weeks after theSSI technology rollout.

    DRIVING COMMUNICATIONS TO 400 GBPSThe new Virtex-7 HT line of devices istargeted squarely at communicationscompanies that are developing 100- to400-Gbps equipment. The Virtex-7 HTcombines on a single IC multiple 28-nm FPGA dice, bearing dozens of13.1-Gbps transceivers, with 28-Gbpstransceiver dice. The result is toendow the final device with a formi-dable mix of logic cells as well as cut-ting-edge transceiver performanceand reliability.

    The largest of the Virex-7 HT lineincludes sixteen GTZ 28-Gbps trans-ceivers, seventy-two 13.1-Gbps trans-ceivers plus logic and memory, offeringtransceiver performance and capacityfar greater than competing devices(see Video 1, http://www.youtube.com/user/XilinxInc#p/c/71A9E924ED61B8F9/1/eTHjt67ViK0).

    C O V E R S T O R Y

    12 Xcell Journal First Quarter 2011

    Figure 3 Actual cross-section of the 28-nm Virtex-7 device. TSVs can be seenconnecting the microbumps (dotted line, top) through the silicon interposer.

    Video 1 Dr. Howard Johnsonintroduces the 28-Gbps transceiver-laden Virtex-7 HT.http://www.youtube.com/user/XilinxInc#p/c/71A9E924ED61B8F9/1/eTHjt67ViK0.

  • We leveraged stacked interconnecttechnology to offer Virtex-7 deviceswith a 28G capability, said Madden. Byoffering the transceivers on a separatedie, we can optimize our 28-Gbps trans-ceiver performance and electrically iso-late functions to offer an even higherdegree of reliability for applicationsrequiring cutting-edge transceiver per-formance and reliability.

    With the need for bandwidthexploding, the communications sectoris franticly racing to establish new net-works. The wireless industry is scram-bling to produce equipment support-ing 40-Gbps data transfer today, whilewired networking is approaching 100Gbps. FPGAs have played a key role injust about every generation of net-working equipment since their incep-tion (see cover stories in XcellJournal, Issues 65 and 67).

    Communications equipment designteams traditionally have used FPGAsto receive signals sent to equipmentin multiple protocols, translate thosesignals to common protocols that theequipment and network use, and thenforward the data to the next destina-tion. Traditionally companies haveplaced a processor in between FPGAsmonitoring and translating incomingsignals and those FPGAs forwardingsignals to their destination. But asFPGAs advance and grow in capacityand functionality, a single FPGA canboth send and receive, while also per-forming processing, to add greaterintelligence and monitoring to thesystem. This lowers the BOM and,more important, reduces the powerand cooling costs of networkingequipment, which must run reliably24 hours a day, seven days a week.

    In a white paper titled IndustrysHighest-Bandwidth FPGA EnablesWorlds First Single-FPGA Solution for400G Communications Line Cards,Xilinxs Greg Lara outlines several com-munications equipment applicationsthat can benefit from the Virtex-7 HTdevices (see http://www.xilinx.com/support/documentation/white_papers/wp385_V7_28G_for_400G_Comm_Line_Cards.pdf).

    To name a few, Virtex-7 HT FPGAscan find a home in 100-Gbps linecards supporting OTU-4 (OpticalTransfer Unit) transponders. Theycan be used as well in muxponders orservice aggregation routers, in lower-cost 120-Gbps packet-processing linecards for highly demanding data pro-cessing, in multiple 100G Ethernetports and bridges, and in 400-GbpsEthernet line cards. Other potentialapplications include base stationsand remote radio heads with 19.6-Gbps Common Public Radio Interfacerequirements, and 100-Gbps and 400-Gbps test equipment.

    JITTER AND EYE DIAGRAMA key to playing in these markets isensuring the FPGA transceiver sig-nals are robust, reliable and resistantto jitter or interference and to fluctu-ations caused by power system noise.For example, the CEI-28G specifica-

    tion calls for 28-Gbps networkingequipment to have extremely tightjitter budgets.

    Signal integrity is an extremelycrucial factor for 28-Gbps operation,said Panch Chandrasekaran, seniormarketing manager of FPGA compo-nents at Xilinx. To meet the stringentCEI-28G jitter budgets, the trans-ceivers in the new Xilinx FPGAsemploy phase-locked loops (PLLs)based on an LC tank design andadvanced equalization circuits to off-set deterministic jitter.

    Noise isolation becomes a veryimportant parameter at 28-Gbps sig-naling speeds, said Chandrasekaran.Because the FPGA fabric and trans-ceivers are on separate dice, the sen-sitive 28-Gbps analog circuitry is iso-lated from the digital FPGA circuits,providing superior isolation com-pared to monolithic implementa-tions (Figures 4a and 4b).

    The FPGA design also includesfeatures that minimize lane-to-laneskew, allowing the devices to sup-port stringent optical standards suchas the Scalable Serdes FramerInterface standard (SFI-S).

    Further, the GTZ transceiverdesign eliminates the need fordesigners to employ external refer-ence resistors, lowering the BOMcosts and simplifing the boarddesign. A built-in eye scan func-tion automatically measures theheight and width of the post-equal-ization data eye. Engineers can usethis diagnostic tool to perform jitterbudget analysis on an active chan-nel and optimize transceiver param-eters to get optimal signal integrity,all without the expense of special-ized equipment.

    ISE Design Suite software toolsupport for 7 series FPGAs is avail-able today. Virtex-7 FPGAs with mas-sive logic capacity, thanks to the SSItechnology, will be available thisyear. Samples of the first Virtex-7 HTdevices are scheduled to be availablein the first half of 2012. For moreinformation on Virtex-7 FPGAs andSSI technology, visit http://www.xilinx.com/technology/roadmap/7-series-fpgas.htm.

    First Quarter 2011 Xcell Journal 13

    C O V E R S T O R Y

    Figure 4a Xilinx 28-Gbps transceiver displays an excellent eye opening

    and jitter performance (using PRBS31 data pattern).

    Figure 4b This is a competing devices 28-Gbps signal using a much simpler

    PRBS7 pattern. The signal is extremelynoisy with a significantly smaller

    eye opening. Eye size is shown close to relative scale.

  • 14 Xcell Journal First Quarter 2011

    MicroBlaze Hosts Mobile Multisensor Navigation System

    Researchers used Xilinxs soft-core processor to develop an integrated navigation solution that works in places where GPS doesnt.

    XCELLENCE IN AEROSPACE & DEFENSE

    by Walid Farid AbdelfatahResearch AssistantNavigation and Instrumentation Research Group Queen's University [email protected]

    Jacques GeorgyAlgorithm Design EngineerTrusted Positioning Inc.

    Aboelmagd NoureldinCross-Appointment Associate Professor in ECE Dept. Queen's University / Royal Military College of Canada

  • First Quarter 2011 Xcell Journal 15

    T he global positioning system(GPS) is a satellite-based navi-gation system that is widelyused for different navigation applica-tions. In an open sky, GPS can providean accurate navigation solution.However, in urban canyons, tunnelsand indoor environments that blockthe satellite signals, GPS fails to pro-vide continuous and reliable coverage.

    In the search for a more accurate andlow-cost positioning solution in GPS-denied areas, researchers are develop-ing integrated navigation algorithmsthat utilize measurements from low-cost sensors such as accelerometers,gyroscopes, speedometers, barometersand others, and fuses them with themeasurements from the GPS receiver.The fusion is accomplished using eitherKalman filter (KF), particle filter or arti-ficial-intelligence techniques.

    Once a navigation algorithm isdeveloped, verified and proven to beworthy, the ultimate goal is to put iton a low-cost, real-time embeddedsystem. Such a system must acquireand synchronize the measurements

    from the different sensors, then applythe navigation algorithm, which willintegrate these aligned measure-ments, yielding a real-time solution ata defined rate.

    The transition from algorithmresearch to realization is a crucial stepin assessing the practicality and effec-tiveness of a navigation algorithm and,consequently, the developed embeddedsystem, either for a proof of concept orto be accepted as a consumer product.In the transition process, there is nounique methodology that designers canfollow to create an embedded system.Depending on the platform of choicesuch as microcontrollers, digital signalprocessors and field-programmablegate arrayssystem designers use dif-ferent methodologies to develop thefinal product.

    MICROBLAZE FRONT AND CENTERIn research at Queens University inKingston, Ontario, we utilized a Xilinx

    MicroBlaze soft-core processor builton the ML402 evaluation kit, whichfeatures a Virtex-4 SX35 FPGA. The

    applied navigation algorithm is a 2DGPS/reduced inertial sensor system(RISS) integration algorithm thatincorporates measurements from agyroscope and a vehicles odometeror a robots wheel encoders, alongwith those from a GPS receiver. Inthis way, the algorithm computes anintegrated navigation solution that ismore accurate than standalone GPS.For a vehicle moving in a 2D plane, itcomputes five navigation states con-sisting of two position parameters(latitude and longitude), two velocityparameters (East and North) and oneorientation parameter (azimuth).

    The 2D RISS/GPS integration algo-rithm involves two steps: RISS mecha-nization and RISS/GPS data fusionusing Kalman filtering. RISS mecha-nization is the process of transform-ing the measurements of an RISS sys-tem acquired in the vehicle frame intoposition, velocity and heading in ageneric frame. It is a recursiveprocess based on the initial condi-tions, or the previous output and thenew measurements. We fuse the posi-

    Timer Local Memory Debugger

    UART(1)

    UART(2)

    UART(3)

    MicroBlazeCore

    UART(4)

    LEDs GPIO

    PPS GPIOInterrupt

    Controller

    PushbuttonsGPIO

    FPU

    GPSMeasurements

    PPS Signal

    IMUMeasurements

    EncoderMeasurements

    Host Machine

    Simple UserInterfaceCrossbow

    IMU

    Microcontrollersending the

    robots encodersdata

    Nov/Atel SPAN

    Virtex-4

    X C E L L E N C E I N A E R O S PA C E & D E F E N S E

    Figure 1 MicroBlaze connections with the mobile robot sensors and the host machine

  • tion and velocity measurements fromthe GPS with the RISS-computedposition and velocity using the con-ventional KF method in a closed-loop, loosely coupled fashion. [1, 2]

    A soft-core processor such asMicroBlaze provides many advan-tages over off-the-shelf processors indeveloping a mobile multisensornavigation system. [3] They includecustomization, the ability to design amultiprocessor system and hard-ware acceleration.

    With a soft-core device, designersof the FPGA-based embedded proces-sor system have the total flexibility toadd any custom combination ofperipherals and controllers. You canalso design a unique set of peripheralsfor specific applications, adding asmany peripherals as needed to meetthe system requirements.

    Moreover, more-complex embed-ded systems can benefit from the exis-tence of multiple processors that canexecute the tasks in parallel. By usinga soft-core processor and accompany-ing tools, its an easy matter to createa multiprocessor-based system-on-chip. The only restriction to the num-

    ber of processors that you can add isthe availability of FPGA resources.

    Finally, one of the most compellingreasons for choosing a soft-coreprocessor is the ability to concurrentlydevelop hardware and software, andhave them coexist on a single chip.System designers need not worry if asegment of the algorithm is found to bea bottleneck; all you have to do to elim-inate such a problem is design a cus-tom coprocessor or a hardware circuitthat utilizes the FPGAs parallelism.

    INTERFACES AND ARITHMETICFor the developed navigation systemat Queens University, we mainly need-ed a system capable of interfacingwith three sensors and a personalcomputer via a total of four serialports, each of which requires a univer-sal asynchronous receiver transmitter(UART) component. We use the serialchannel between the embedded sys-tem and a PC to upload the sensorsmeasurements and the navigationsolution for analysis and debugging. Inaddition, the design also requires gen-eral-purpose input/output (GPIO)channels for user interfacing, a timer

    for code profiling and an interruptcontroller for the event-based system.

    We also needed a floating-pointunit (FPU), which can execute thenavigation algorithm once the meas-urements are available in the leasttime possible. Although the FPU usedin the MicroBlaze core is single-preci-sion, we developed our algorithmmaking use of double-precision arith-metic. Consequently, we used soft-ware libraries to emulate our double-precision operations on the single-precision FPU, resulting in a slightlydenser code but a far better naviga-tion solution.

    We first tested the developed navi-gation system on a mobile robot toreveal system bugs and integrationproblems, and then on a standardautomobile for further testing. Wechose to start the process with itera-tive testing on the mobile robot due tothe flexibility and ease of systemdebugging compared with a vehicle.We compared the solution with an off-line navigation solution computed on aPC and with the results from a high-end, high-cost, real-time NovAtelSPAN system. The SPAN system com-bines a GPS receiver and high-end tac-tical-grade inertial measurement unit(IMU), which acts as a reference.Figure 1 shows the MicroBlaze proces-sor connections with the RISS/GPSsystem sensors on a mobile robot.

    Figure 2 presents an open-sky tra-jectory which we acquired at theRoyal Military College of Canada for9.6 minutes; the solution of the devel-oped real-time system is compared tothe GPS navigation solution and theSPAN reference system. The differ-ence between the NovAtel SPAN ref-erence solution and our solution isshown in Figure 3. Both solutions arereal-time, and the difference in per-formance is due to our use of thelow-cost MEMS-based CrossbowIMU300CC-100 IMU in the developedsystem. The SPAN unit, by contrast,uses the high-cost, tactical-gradeHG1700 IMU.

    16 Xcell Journal First Quarter 2011

    X C E L L E N C E I N A E R O S PA C E & D E F E N S E

    Reference (SPAN)

    GPS

    MicroBlaze (real-time)

    Figure 2 Mobile robot trajectory, open-sky area

  • Figure 4 presents the differencebetween the positioning solutionsfrom the C double-precision off-linealgorithm and the C single-precisionoff-line algorithm computations,where the difference is in the range ofhalf a meter. The results show that thechoice of emulating the double-preci-sion operations on the MicroBlazecore is far more accurate than usingsingle-precision arithmetic, whichwould have consumed less memoryand less processing time.

    MULTIPROCESSING ANDHARDWARE ACCELERATIONUtilizing a soft-core processor such asthe MicroBlaze in a mobile navigationsystem composed of various sensorsprovides the flexibility to communi-cate with any number of sensors thatuse various interfaces, without anyconcern with the availability of theperipherals beforehand, as would betrue if we had used off-the-shelfprocessors. In terms of computation,designers can build the navigation

    algorithm using embedded softwarethat utilizes the single-precision FPUavailable within the MicroBlaze, as isthe case in this research. You can alsopartition a design between softwareand hardware by implementing thesegments of the algorithm that arefound to be a bottleneck as a customcoprocessor or a hardware circuit thatspeeds up the algorithm.

    The future of this research canprogress in two directions. The first isto use a double-precision floating-point unit directly with the MicroBlazeprocessor instead of double-precisionemulation. The second is related toimplementing more powerful and chal-lenging navigation algorithms that uti-lize more sensors to be integrated viaparticle filtering or artificial-intelli-gence techniques, which can make useof the power of multiprocessing andhardware acceleration.

    More information about the embed-ded navigation system design processand more results can be found in themasters thesis from which this workwas derived. [4]

    REFERENCES

    [1] U. Iqbal, A. F. Okou and A.

    Noureldin, An integrated reduced iner-

    tial sensor system RISS / GPS for land

    vehicle, Position, Location and

    Navigation Symposium proceedings,

    2008 IEEE/ION.

    [2] U. Iqbal and A. Noureldin, Integrated

    Reduced Inertial Sensor System/GPS for

    Vehicle Navigation: Multisensor position-

    ing system for land applications involving

    single-axis gyroscope augmented with a

    vehicle odometer and integrated with

    GPS, VDM Verlag Dr. Mller, 2010.

    [3] B. H. Fletcher, FPGA Embedded

    ProcessorsRevealing True System

    Performance, Embedded Systems

    Conference, 2005.

    [4] Walid Farid Abdelfatah, Real-Time

    Embedded System Design and Realization

    for Integrated Navigation Systems 2010.

    http://hdl.handle.net/1974/6128

    First Quarter 2011 Xcell Journal 17

    X C E L L E N C E I N A E R O S PA C E & D E F E N S E

    Figure 3 Mobile robot trajectory: Difference between NovAtel SPAN reference and MicroBlaze real-time solutions

    Figure 4 Mobile robot trajectory: Difference between C double-precision and single-precision off-line solutions

  • 18 Xcell Journal First Quarter 2011

    Innovation in the global creditderivatives markets rests on thedevelopment and intensive use ofcomplex mathematical models. Asportfolios of complicated credit instru-ments have expanded, the process ofvaluing them and managing the riskhas grown to a point where financialorganizations use thousands of CPUcores to calculate value and risk daily.This task in turn requires vast amounts

    of electricity for bothpower and cooling.

    In 2005, the worlds estimat-ed 27 million servers consumed

    around 0.5 percent of all electricityproduced on the planet, and the figureedges closer to 1 percent when the ener-gy for associated cooling and auxiliaryequipment (for example, backup power,power conditioning, power distribution,air handling, lighting and chillers) isincluded. Any savings from falling hard-ware costs are increasingly offset byrapidly rising power-related indirectcosts. No wonder, then, that large finan-cial institutions are searching for a wayto add ever-greater computationalpower at a much lower operating cost.

    FPGAs Speed theComputation of ComplexCredit Derivatives

    XCELLENCE IN HIGH-PERFORMANCE COMPUTING

    by Stephen WestonManaging DirectorJ.P. [email protected]

    James SpoonerHead of Acceleration (Finance)Maxeler [email protected]

    Jean-Tristan MarinVice PresidentJ.P. [email protected]

    Oliver PellVice President of EngineeringMaxeler [email protected]

    Oskar MencerCEOMaxeler [email protected]

    Financial organizations can assess value and risk

    30x faster on Xilinx-accelerated systems than those

    using standard multicore processors.

    2010 IEEE. Reprinted, with permission, from Stephen Weston, Jean-Tristan Marin, James Spooner, Oliver Pell, Oskar Mencer,Accelerating thecomputation of portfolios of tranched credit derivatives, IEEE Workshop on High Performance Computational Finance, November 2010.

  • First Quarter 2011 Xcell Journal 19

    Toward this end, late in 2008 theApplied Analytics group at J.P. Morganin London launched a collaborativeproject with the acceleration-solutionsprovider Maxeler Technologies thatremains ongoing. The core of the proj-ect was the design and delivery of aMaxRack hybrid (Xilinx FPGA andIntel CPU) cluster solution byMaxeler. The system has demonstrat-ed more than 31x acceleration of thevaluation for a substantial portfolio ofcomplex credit derivatives comparedwith an identically sized system that

    uses only eight-core Intel CPUs.The project reducesoperational expendi-

    tures more thanthirtyfold by build-

    ing a customizedhigh-efficiency high-performance com-

    puting (HPC) system.At the same time, it deliv-

    ers a disk-to-disk speedup ofmore than an order of magnitude

    over comparable systems, enablingthe computation of an order-of-mag-nitude more scenarios, with directimpact on the credit derivatives busi-ness at J.P. Morgan. The Maxeler sys-tem clearly demonstrates the feasibil-ity of using FPGA technology to sig-nificantly accelerate computationsfor the finance industry and, in par-ticular, complex credit derivatives.

    RELATED WORKThere is a considerable body of pub-lished work in the area of accelerationfor applications in finance. A sharedtheme is adapting technology to accel-erate the performance of computation-ally demanding valuation models. Ourwork shares this thrust and is distin-guished in two ways.

    The first feature is the computa-tional method we used to calculatefair value. A common method is asingle-factor Gaussian copula, amathematical model used in finance,to model the underlying credits,along with Monte Carlo simulation

    to evaluate the expected value of theportfolio. Our model employs stan-dard base correlation methodology,with a Gaussian copula for defaultcorrelation and a stochastic recoveryprocess. We compute the expected(fair) value of a portfolio using a con-volution approach. The combinationof these methods presents a distinctcomputational, algorithmic and per-formance challenge.

    In addition, our work encompassesthe full population of live complexcredit trades for a major investmentbank, as opposed to relying on theusual approach of using synthetic datasets. To fully grasp the computingchallenges, its helpful to take a closerlook at the financial products knownas credit derivatives.

    WHAT ARE CREDIT DERIVATIVES?Companies and governments issuebonds which provide the ability to fundthemselves for a predefined period,rather like taking a bank loan for afixed term. During their life, bonds typ-ically pay a return to the holder, whichis referred to as a coupon, and atmaturity return the borrowed amount.The holder of the bond, therefore, faces

    a number of risks, the most obviousbeing that the value of the bond may fallor rise in response to changes in thefinancial markets. However, the risk weare concerned with here is that when thebond reaches maturity, the issuer maydefault and fail to pay the coupon, thefull redemption value or both.

    Credit derivatives began life by offer-ing bond holders a way of insuringagainst loss on default. They haveexpanded to the point where they nowoffer considerably greater functionalityacross a range of assets beyond simplebonds, to both issuers and holders, byenabling the transfer of default risk inexchange for regular payments. The sim-plest example of such a contract is acredit default swap (CDS), where thedefault risk of a single issuer isexchanged for regular payments. The keydifference between a bond and a CDS isthat the CDS only involves payment of aredemption amount (minus any recov-ered amount) should default occur.

    Building on CDS contracts, creditdefault swap indexes (CDSIs) allow thetrading of risk using a portfolio ofunderlying assets. A collateralizeddefault obligation (CDO) is an exten-sion of a CDSI in which default losses

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    ReferencePortfolio

    100 EquallyWeighted

    Credits100%

    6%attachment

    Super

    Mezzanine

    Equity

    5%attachment

    5-6%tranche

    Tranche1%

    5%subordination

    Source: J.P. Morgan

    Figure 1 Tranched credit exposure

  • on the portfolio of assets are dividedinto different layers or tranches,according to the risk of default. Atranche allows an investor to buy orsell protection for losses in a certainpart of the pool.

    Figure 1 shows a tranche between 5percent and 6 percent of the losses ina pool. If an investor sells protectionon this tranche, he or she will not beexposed to any losses until 5 percentof the total pool has been wiped out.The lower (less senior) tranches havehigher risk and those selling protec-tion on these tranches will receivehigher compensation than thoseinvesting in upper (more senior, hence

    less likely to default) tranches. CDOs can also be defined on non-

    standard portfolios. This variety ofCDO, referred to as bespoke, pres-ents additional computational chal-lenges, since the behavior of a givenbespoke portfolio is not directlyobservable in the market.

    The modeling of behavior for a poolof credits becomes complex, as it isimportant to model the likelihood ofdefault among issuers. Corporatedefaults, for example, tend to be sig-nificantly correlated because all firmsare exposed to a common or correlat-ed set of economic risk factors, so thatthe failure of a single company tends

    to weaken the remaining firms.The market for credit derivatives

    has grown rapidly over the past 10years, reaching a peak of $62 trillionin January 2008, in response to a widerange of influences on both demandand supply. CDSes account forapproximately 57 percent of thenotional outstanding, with theremainder accounted for by productswith multiple underlying credits.Notional refers to the amount ofmoney underlying a credit derivativecontract; notional outstanding isthe total amount of notional in cur-rent contracts in the market.

    A CREDIT DERIVATIVES MODELThe standard way of pricing CDOtranches where the underlying is abespoke portfolio of reference creditsis to use the base correlationapproach, coupled with a convolutionalgorithm that sums conditionallyindependent loss-random variables,with the addition of a coupon modelto price exotic coupon features. Sincemid-2007 the standard approach in thecredit derivatives markets has movedaway from the original Gaussian cop-ula model. Two alternative approach-es to valuation now dominate: the ran-dom-factor loading (RFL) model andthe stochastic recovery model.

    Our project has adopted the latterapproach and uses the followingmethodology as a basis for accelerat-ing pricing: In the first step, we dis-cretize and compute the loss distri-bution using convolution, given theconditional survival probabilitiesand losses from the copula model.Discretization is a method of chunk-ing a continious space into a discretespace and separating the space intodiscrete bins, allowing algorithms(like integration) to be done numeri-

    20 Xcell Journal First Quarter 2011

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    for i in 0 ... markets - 1

    for j in 0 ... names - 1

    prob = cum_norm ((inv_norm(Q[j])- sqrt(rho)*M[i])/sqrt(1 - rho);

    loss=calc_loss(prob,Q2[j],RR[j],RM[j])*notional[j];

    n = integer(loss);

    L = fractional(loss);

    for k in 0 ... bins - 1

    if j == 0

    dist[k] = k == 0 ? 1.0 : 0.0;

    dist[k] = dist[k]*(1- prob) +

    dist[k - n]* prob *(1 - L) +

    dist[k - n - 1]* prob*L;

    if j == credits - 1

    final_dist [k] += weight[ i ] * dist[k];

    end # for k

    end # for j

    end # for i

    The modeling of behavior for a pool of credits becomes complex, as it is important to modelthe likelihood of default among issuers. Corporate defaults, for example, tend to be

    significantly correlatedthe failure of a single company tends to weaken the remaining firms.

    Figure 2 Pseudo code for the bespoke CDO tranche pricing

  • cally. We then use the standardmethod of discretizing over the twoclosest bins with a weighting suchthat the expected loss is conserved.We compute the final loss distribu-tion using a weighted sum over all ofthe market factors evaluated, usingthe copula model.

    Pseudo code for this algorithm isshown in Figure 2. For brevity, weveremoved the edge cases of the convolu-tion and the detail of the copula andrecovery model.

    ACCELERATED CDO PRICINGEach day the credit hybrids businesswithin J.P. Morgan needs to evaluatehundreds of thousands of credit deriv-ative instruments. A large proportionof these are standard, single-nameCDSes that need very little computa-tional time. However, a substantialminority of the instruments aretranched credit derivatives thatrequire the use of complex models likethe one discussed above. These dailyruns are so computationally intensivethat, without the application ofMaxelers acceleration techniques,they could only be meaningfully car-ried out overnight. To complete thetask, we used approximately 2,000standard Intel cores. Even with suchresources available, the calculationtime is around four and a half hoursand the total end-to-end runtime isclose to seven hours, when batchpreparation and results writeback aretaken into consideration.

    With this information, the acceler-ation stage of the project focused ondesigning a solution capable of deal-ing with only the complex bespoketranche products, with a specific goalof exploring the HPC architecturedesign space in order to maximizethe acceleration.

    MAXELERS ACCELERATION PROCESSThe key to maximizing the speed ofthe final design is a systematic anddisciplined end-to-end acceleration

    methodology. Maxeler follows the fourstages of acceleration shown in Figure3, from the initial C++ design to thefinal implementation, which ensureswe arrive at the optimal solution.

    In the analysis stage, we conduct-ed a detailed examination of the algo-rithms contained in the original C++model code. Through extensive codeand data profiling with the MaxelerParton profiling tool suite, we wereable to clearly understand and mapthe relationships between the com-putation and the input and outputdata. Part of this analysis involvedacquiring a full understanding of howthe core algorithm performs in prac-tice, which allowed us to identify themajor computational and data move-ments, as well as storage costs.Dynamic analysis using call graphs ofthe running software, combined withdetailed analysis of data values and

    First Quarter 2011 Xcell Journal 21

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    Original Application

    Analysis

    Transformation

    Partitioning

    Implementation

    AcceleratedApplication

    Achievesperformance

    Sets theoreticalperformancebounds

    // Shared library and call overhead 1.2 %

    for d in 0 ... dates - 1

    // Curve Interpolation 19.7%

    for j in 0 ... names - 1

    Q[j] = Interpolate(d, curve)

    end # for j

    for i in 0 ... markets - 1

    for j in 0 ... names - 1

    // Copula Section 22.0%

    prob = cum_norm (( inv_norm (Q[j]) - sqrt(rho)*M[i])/sqrt (1 - rho);

    loss= calc_loss (prob,Q2[j],RR[j],RM[j])*notional[j];

    n = integer(loss);

    lower = fractional(loss);

    for k in 0 ... bins - 1

    if j == 0

    dist[k] = k == 0 ? 1.0 : 0.0 ;

    // Convolution 51.4%

    dist[k] = dist[k]*(1- prob) +

    dist[k- n]* prob *(1 - L) +

    dist[k- n - 1]* prob *L;

    // Reintegration, 5.1%

    if j == credits - 1

    final_dist [k] += weight[ i ] * dist[k];

    end # for k

    end # for j

    end # for i

    end # for d

    // Code outside main loop 0.5%

    Figure 3 Iterative Maxeler process for accelerating software

    Figure 4 Profiled version of original pricing algorithm in pseudo-code form

  • runtime performance, were neces-sary steps to identifying bottlenecksin execution as well as memory uti-lization patterns.

    Profiling the core 898 lines of origi-nal source code (see Figure 4) focusedattention on the need to acceleratetwo main areas of the computation:calculation of the conditional-survivalprobabilities (Copula evaluation) andcalculation of the probability distribu-tion (convolution).

    The next stage, transformation,extracted and modified loops and pro-gram control structure from the existingC++ code, leading to code and data lay-out transformations that in turn enabledthe acceleration of the core algorithm.We removed data storage abstractionusing object orientation, allowing datato pass efficiently to the accelerator,with low CPU overhead. Critical trans-formations included loop unrolling,reordering, tiling and operating on vec-tors of data rather than single objects.

    PARTITIONING AND IMPLEMENTATIONThe aim for the partitioning stage ofthe acceleration process was to cre-ate a contiguous block of operations.This block needed to be tractable toaccelerate, and had to achieve themaximum possible runtime coverage,

    when balanced with the overall CPUperformance and data input and out-put considerations.

    The profiling process we identifiedduring the analysis stage provided thenecessary insight to make partitioningdecisions.

    The implementation of the parti-tioned design led to migrating the loopscontaining the Copula evaluation andconvolution onto the FPGA accelera-tor. The remaining loops and associat-

    ed computation stayed on the CPU.Within the FPGA design, we split theCopula and convoluter into separatecode implementations that could exe-cute and be parallelized independentlyof each other and could be sizedaccording to the bounds of the loopsdescribed in the pseudo code.

    In the past, some design teams haveshied away from FPGA-based accelera-tion because of the complexity of theprogramming task. Maxelers program-

    ming environment, called MaxCompiler,raises the level of abstraction of FPGAdesign to enable rapid development andmodification of streaming applications,even when faced with frequent designupdates and fixes.

    IMPLEMENTATION DETAILSMaxCompiler builds FPGA designs in asimple, modular fashion. A design hasone or more kernels, which are high-ly parallel pipelined blocks for execut-

    ing a specific computation. The man-ager dynamically oversees the data-flow I/O between these kernels.Separating computation and communi-cation into kernels and managerenables a high degree of pipeline paral-lelism within the kernels. This paral-lelism is pivotal to achieving the per-formance our application enjoys. Weachieved a second level of parallelismby replicating the compute pipelinemany times within the kernel itself, fur-ther multiplying speedup.

    The number of pipelines that can bemapped to the accelerator is limitedonly by the size of the FPGAs used in theMaxNode and available parallelizationin the application. MaxCompiler lets youimplement all of the FPGA design effi-ciently in Java, without resorting tolower-level languages such as VHDL.

    For our accelerator, we used theJ.P. Morgan 10-node MaxRack config-ured with MaxNode-1821 computenodes. Figure 5 sketches the systemarchitecture of a MaxNode-1821. Eachnode has eight Intel Xeon cores andtwo Xilinx FPGAs connected to theCPU via PCI Express. A MaxRinghigh-speed interconnect is also avail-able, providing a dedicated high-band-

    22 Xcell Journal First Quarter 2011

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    MaxRing

    FPGA FPGAPCIe PCIe

    XeonCores

    Some design teams have shied away fromFPGA-based acceleration because of the

    complexity of programming. MaxCompiler lets you implement the FPGA design in Java, without

    resorting to lower-level languages such as VHDL.

    Figure 5 MaxNode-1821 architecture diagram containing eight Intel Xeon cores and two Xilinx FPGAs

  • width communication channel directlybetween the FPGAs.

    One of the main focus points dur-ing this stage was finding a designthat balanced arithmetic optimiza-tions, desired accuracy, power con-sumption and reliability. We decidedto implement two variations of thedesign: a full-precision design forfair-value (FV) calculations and areduced-precision version for sce-nario analysis. We designed the full-precision variant for an average rela-tive accuracy of 10-8 and the reduced-precision for an average relativeaccuracy of 10-4. These design pointsshare identical implementations,varying only in the compile-timeparameters of precision, parallelismand clock frequency.

    We built the Copula and convolutionkernels to implement one or morepipelines that effectively parallelize

    loops within the computation. Since theconvolution uses each value the Copulamodel generates many times, the kerneland manager components were scala-ble to enable us to tailor the exact ratioof Copula or convolution resources tothe workload for the design. Figure 6shows the structure of the convolutionkernel in the design.

    When implemented in MaxCompiler,the code for the convolution in theinner loop resembles the originalstructure. The first difference is thatthere is an implied loop, as datastreams through the design, ratherthan the design iterating over the data.Another important distinction is thatthe code now operates in terms of adata stream (rather than on an array),such that the code is now describing astreaming graph, with offsets forwardand backward in the stream as neces-sary, to perform the convolution.

    Figure 7 shows the core of the codefor the convoluter kernel.

    IMPLEMENTATION EFFORTIn general, software programming ina high-level language such as C++ ismuch easier than interacting directlywith FPGA hardware using a low-level hardware description language.Therefore, in addition to performance,development and support time areincreasingly significant components ofoverall effort from the software-engi-neering perspective. For our project,we measured programming effort asone aspect in the examination of pro-gramming models. Because it is diffi-cult to get accurate development-timestatistics and to measure the quality ofcode, we use lines of code (LOC) as ourmetric to estimate programming effort.

    Due to the lower-level nature ofcoding for the FPGA architecture

    First Quarter 2011 Xcell Journal 23

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    Conditional Survival Probabilitiesand Discretized Loss Inputs

    Market Factors Unrolled

    Market Weight Inputs

    Accumulated Loss Distribution(weighted sum)

    Loss Distribution Output

    Convoluter

    Buffer

    Nam

    es U

    nro

    lled

    Figure 6 Convoluter architecture diagram

  • when compared with standard C++,the original 898 lines of code generat-ed 3,696 lines of code for the FPGA(or a growth factor of just over four).

    Starting with the program that runson a CPU and iterating in time, wetransformed the program into the spa-tial domain running on the MaxNode,creating a structure on the FPGA thatmatched the data-flow structure ofthe program (at least the computa-tionally intensive parts). Thus, weoptimized the computer based on theprogram, rather than optimizing theprogram based on the computer.Obtaining the results of the computa-tion then became a simple matter ofstreaming the data through theMaxeler MaxRack system.

    In particular, the fact we can usefixed-point representation for manyof the variables in our application isa big advantage, because FPGAsoffer the opportunity to optimizeprograms on the bit level, allowingus to pick the optimal representationfor the internal variables of an algo-rithm, choosing precision and rangefor different encodins such as float-ing-point, fixed-point and logarith-mic numbers.

    We used Maxeler Parton to meas-ure the amount of real time taken forthe computation calls in the originalimplementation and in the acceleratedsoftware. We divided the total soft-ware time running on a single core byeight to attain the upper bound on themulticore performance, and com-pared it to the real time it took to per-form the equivalent computationusing eight cores and two FPGAs withthe accelerated software.

    For power measurements, we usedan Electrocorder AL-2VA from AcksenLtd. With the averaging window set toone second, we recorded a 220-sec-ond window of current and voltagemeasurements while the software wasin its core computation routine.

    As a benchmark for demonstrationand performance evaluation, we used afixed population of 29,250 CDO tranch-

    24 Xcell Journal First Quarter 2011

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    HWVar d = io.input("inputDist", _distType);

    HWVar p = io.input("probNonzeroLoss", _probType);

    HWVar L = io.input("lowerProportion", _propType);

    HWVar n = io.input("discretisedLoss", _operType);

    HWVar lower = stream.offset(-n-1,-maxBins,0,d);

    HWVar upper = stream.offset(-n,-maxBins,0,d);

    HWVar o = ((1-p)*d + L*p*lower + (1-L)*p*upper);

    io.output("outputDist", _distType, o);

    // Shared library and call overhead 5%

    for d in 0 ... dates - 1

    // Curve Interpolation 54.5%

    for j in 0 ... names - 1

    Q[j] = Interpolate(d, curve)

    end # for j

    for i in 0 ... markets - 1

    for j in 0 ... names - 1

    // Copula Section 9%

    prob = cum_norm (( inv_norm (Q[j]) - sqrt (rho)*M[i])/sqrt(1- rho);

    loss= calc_loss (prob,Q2[j],RR[j],RM[j])*notional[j];

    // (FPGA Data preparation and post processing) 11.2%

    n = integer(loss);

    lower = fractional(loss);

    for k in 0 ... bins - 1

    if j == 0

    dist[k] = k == 0 ? 1.0 : 0.0 ;

    dist[k] = dist[k]*(1 - prob) +

    dist[k - n]* prob *(1 -L) +

    dist[k - n - 1]* prob *L;

    if j == credits - 1

    final_dist [k] += weight[ i ] * dist[k];

    end # for k

    end # for j

    end # for i

    end # for d

    // Other overhead (object construction, etc) 19.9%

    Figure 7 MaxCompiler code for adding a name to a basket

    Figure 8 Profiled version of FPGA take on pricing

  • es. This population comprised realbespoke tranche trades of varying matu-rity, coupon, portfolio composition andattachment/detachment points.

    The MaxNode-1821 delivered a 31xspeedup over an eight-core (Intel XeonE5430 2.66-GHz) server in full-preci-sion mode and a 37x speedup atreduced precision, both nodes usingmultiple processes to price up to eighttranches in parallel.

    The significant differences betweenthe two hardware configurationsinclude the addition of the MAX2-4412C card with two Xilinx FPGAs and24 Gbytes of additional DDR DRAM.Figure 8 shows the CPU profile of thecode running with FPGA acceleration.

    As Table 1 shows, the power usageper node decreases by 6 percent in thehybrid FPGA solution, even with a 31xincrease in computational performance.It follows that the speedup per wattwhen computing is actually greater thanthe speedup per cubic foot.

    Table 2 shows a breakdown of CPUtime of the Copula vs. convolution

    computations and their equivalentresource utilization on the FPGA. Aswe have exchanged time for spacewhen moving to the FPGA design,this gives a representative indicationof the relative speedup between thedifferent pieces of computation.

    THREE KEY BENEFITSThe credit hybrids trading groupwithin J.P. Morgan is reaping sub-stantial benefits from applyingacceleration technology. The order-of-magnitude increase in availablecomputation has led to three keybenefits: computations run muchfaster; additional applications andalgorithms, or those that were onceimpossible to resource, are nowpossible; and operational costsresulting from given computationsdrop dramatically.

    The design features of the Maxelerhybrid CPU/FPGA computer meanthat its low consumption of electrici-ty, physically small footprint and lowheat output make it an extremely

    attractive alternative to the traditionalcluster of standard cores.

    One of the key insights of the proj-ect has been the substantial benefits tobe gained from changing the computerto fit the algorithm, not changing thealgorithm to fit the computer (whichwould be the usual approach usingstandard CPU cores). We have foundthat executing complex calculations incustomizable hardware with Maxelerinfrastructure is much faster than exe-cuting them in software.

    FUTURE WORKThe project has expanded to includethe delivery of a 40-node Maxelerhybrid computer designed to provideportfolio-level risk analysis for thecredit hybrids trading desk in nearreal time. For credit derivatives, a sec-ond (more general) CDO model is cur-rently undergoing migration to run onthe architecture.

    In addition, we have also applied theapproach to two interest-rate models.The first was a four-factor Monte Carlomodel covering equities, interest rates,credit and foreign exchange. So far,weve achieved an accelerationspeedup of 284x over a Xeon core. Thesecond is a general tree-based model,for which acceleration results are pro-jected to be of a similar order of mag-nitude. We are currently evaluating fur-ther, more wide-ranging applicationsof the acceleration approach, and areexpecting similar gains across a widerrange of computational challenges.

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    COMPUTATIONPERCENT TIMEIN SOFTWARE

    6-INPUTLOOKUP TABLES

    FLIP-FLOPS36-KBIT BLOCK RAMS

    18X25-BITMULTIPLIERS

    Copula Kernel 22.0% 30.05% 35.12% 11.85% 15.79%

    Convolutionand Integration

    56.1% 46.54% 52.84% 67.54% 84.21%

    PLATFORM IDLE PROCESSING

    Dual Xeon L5430 2.66 GHzQuad-Core 48-GB DDR DRAM

    185W 255W

    (as above) with MAX2-4412CDual Xilinx SX240T FPGAs24-GB DDR DRAM

    210W 240W

    First Quarter 2011 Xcell Journal 25

    Table 1 Power usage for 1U compute nodes when idle and while processing

    Table 2 Computational breakdown vs. FPGA resource usage of total used

  • 26 Xcell Journal First Quarter 2011

    Look Ma, No Motherboard!

    How one design team put a full single-board computer with SATA into a Xilinx FPGA.

    XCELLENCE IN INDUSTRIAL

  • First Quarter 2011 Xcell Journal 27

    Embedded systems for industri-al, scientific and medical (ISM)applications must support aplethora of interfaces. Thats whymany design teams choose FPGA-based daughtercards that plug rightinto a PCs motherboard to add thosespecial I/Os. Given the huge capacityof modern FPGAs, entire systems-on-chips can be implemented inside aXilinx device. These systems includehardware, operating system and soft-ware, and can provide almost the com-plete functionality of a PC, diminish-

    ing the need for a PC motherboard.The result is a more compact, lesspower-hungry, configurable single-board computer system.

    Many of those ISM applications relyon fast, dependable mass storage tohold and store the results of dataacquisition, for example. Solid-statedrives have become the de facto stan-dard in this application because oftheir high reliability and fast perform-ance. These SSDs almost always con-nect via a Serial ATA (SATA) interface.

    Lets examine the steps that wetook to extend a single-board comput-er system, built around a Xilinx chip,with high-speed SATA connectivity toadd SSD RAID functionality. For thistask, the Xilinx Alliance Programecosystem brought together ASICS

    World Services (ASICS ws) expertise inhigh-quality IP cores and Missing LinkElectronics (MLE) expertise in pro-grammable-systems design.

    But before delving into the detailsof the project, its useful to take adeeper look at SATA itself. As shownin Figure 1, multiple layers areinvolved for full SATA host controllerfunctionality. Therefore, when itcomes to implementing a completeSATA solution for an FPGA-basedprogrammable system, designersneed much more than just a high-quality intellectual-property (IP)core. Some aspects of the designoften get overlooked.

    First, it makes sense to implementonly the Physical (PHY), Link andsome portions of the Transport Layerin FPGA hardware; thats why IP ven-dors provide these layers in the IPthey sell. The SATA Host IP core fromASICS World Service utilizes the so-called MultiGigabit Transceivers, orMGT, [1] to implement the PHYlayerwhich comprises an out-of-band signaling block similar to theone described in Xilinx applicationnote 870 [2]completely within theFPGA. The higher levels of theTransport Layer, along with theApplications, Device and UserProgram layers, are better implement-ed in software and, thus, typically IPvendors do not provide these layers tocustomers. This, however, places theburden of creating the layers on thesystem design team and can add unan-ticipated cost to the design project.

    The reason vendors do not includethese layers in their IP is becauseeach architecture is different andeach will be used in a different man-ner. Therefore, to deliver a completesolution that ties together the IP corewith the user programs, you mustimplement, test and integrate compo-nents such as scatter-gather DMA(SGDMA) engines, which consist ofhardware and software.

    In addition, communication at theTransport Layer is done via so-called

    by Lorenz KolbMember of the Technical StaffMissing Link Electronics, [email protected]

    Endric SchubertCo-founder Missing Link Electronics, [email protected]

    Rudolf UsselmannFounder ASICS World Service [email protected]

    mkf

    sfs

    ck

    fdis

    k

    md

    adm

    dd

    hd

    par

    m

    Block Device Layer(/dev/sdX)

    libATA(SMART, hot swap, NCQ,TRIM, PATA/SATA/ATAPI)

    SATA HCI Driver

    SATA HCI

    Transport

    Link

    PHY

    ShadowRegister

    FISConstruct

    FISDecomp

    CRCChecker

    CRCGenerate

    Scrambler Descrambler

    8b/10bEncoder

    8b/10bDecoder

    OOB SpeedNegotiate

    Figure 1 Serial ATA function layers

    X C E L L E N C E I N I N D U S T R I A L

  • frame information structures (FIS). TheSATA standard [3] defines the set of FIStypes and it is instructive to look at thedetailed FIS flow between host anddevice for read and write operations.

    As illustrated in Figure 2, a hostinforms the device about a new opera-tion via a Register FIS, which holds a

    standard ATA command. In case of aread DMA operation, the device sendsone (or more) Data FIS as soon as it isready. The device completes the trans-action via a Register FIS, from deviceto host. This FIS can inform of either asuccessful or a failed operation.

    Figure 2 also shows the FIS flow

    between host and device for a writeDMA operation. Again, the hostinforms the device of the operation viaa Register FIS. When the device isready to receive data, it sends a DMAActivate FIS and the host will starttransmitting a single Data FIS. Whenthe device has processed this FIS and

    28 Xcell Journal First Quarter 2011

    X C E L L E N C E I N I N D U S T R I A L

    HO

    ST

    Read DMA Write DMA

    DEV

    ICE

    DEV

    ICEHO

    ST

    RegisterRegister

    Data

    Data

    Data

    Data

    DMA Activa

    te

    DMA Activa

    te

    Data

    Data

    Register

    Register

    HO

    ST

    HO

    ST

    Read FPDMA Queued Write FPDMA Queued

    DEV

    ICE

    DEV

    ICE

    DataData

    Data

    Data

    Register

    Register

    Register RegisterData

    DataDMA SETUP (

    TAG=4) DMA SETUP (TAG=

    4)Set Device Bit

    s (Busy=0)

    Register (TAG=1)

    Register (TAG=4)

    Set Device Bits (Busy=0)

    Set Device Bits (Busy=0)

    Register (TAG=1)

    Register (TAG=4)

    Set Device Bits (Busy=0)

    DMA SETUP (TAG=1)

    DMA SETUP (TAG=1)

    DMA Activate

    DMA Activate

    Figure 2 FIS flow between host and device during a DMA operation

    Figure 3 FIS flow between host and device during first-party DMA queued operation

  • it still expects data, it again sends aDMA Activate FIS. The process iscompleted in the same way as the readDMA operation.

    A new feature introduced withSATA and not found in parallel ATA isthe so-called first-party DMA. Thisfeature transfers some control overthe DMA engine to the device. In thisway the device can cache a list ofcommands and reorder them for opti-mized performance, a techniquecalled native command queuing. NewATA commands are used for first-party DMA transfers. Because thedevice does not necessarily completethese commands instantaneously, but

    rather queues them, the FIS flow is abit different for this mode of opera-tion. The flow for a read first-partyDMA queued command is shown onthe left side of Figure 3.

    Communication on the ApplicationLayer, meanwhile, uses ATA com-mands. [4] While you can certainlyimplement a limited number of thesecommands as a finite state machine inFPGA hardware, a software imple-mentation is much more efficient andflexible. Here, the open-source Linuxkernel provides a known-good imple-mentation that almost exactly followsthe ATA standard and is proven inmore than a billion devices shipped.

    The Linux ATA library, libATA,copes with more than 100 differentATA commands to communicate withATA devices. These commands includedata transfers but also provide func-tionality for SMART (Self-MonitoringAnalysis and Reporting Technology)and for security features such assecure erase and device locking.

    The ability to utilize this code base,however, requires the extra work ofimplementing hardware-dependentsoftware in the form of Linux devicedrivers as so-called Linux KernelModules. As Figure 4 shows, theMissing Link Electronics SoftHardware Platform comes with a full

    First Quarter 2011 Xcell Journal 29

    X C E L L E N C E I N I N D U S T R I A L

    MLE Storage Test Suite

    File Systems

    Linux

    Programming/Scripting

    mkt

    sfs

    ck

    fdis

    k

    md

    adm

    dd

    hd

    par

    m

    bo

    nn

    ie

    gcc

    g+

    +

    mak

    e

    pyt

    ho

    n

    BA

    SH ... X11

    vfat

    Linux Kernel Drivers

    ext2 ext3 btrfs

    RAID Devices(/dev/mdx) Block Device Layer (/dev/sdx)

    libATA(SMART, hot swap, NCOTRIM, PATA/SATA/ATAPI)

    SATA HCI Driver

    AC97

    AC97

    DDR2

    Flash

    DVI RS232 USB Ethernet WLAN Bluetooth SPI GPIO SATA1 SATA2

    RS232 WLAN

    AC97 RS232 WLAN

    GPIO

    GPIO

    DVI USB Ethernet

    DVIUSB Ethernet

    SPIBluetooth

    SPIBluetooth

    DDR2Ctrl

    FlashCtrl

    DMAEngine

    CPUASICS ws

    FPGA

    MLEApplication

    Operating System

    System-on-Chip

    I/O Connectivity

    ShadowRegister

    CRCChecker

    8b/10bEncoder

    8b/10bDecoder

    SpeedNegotiateODB

    Scrambler Descrambler

    CRCGenerate

    FISConstruct

    FISDecomp

    SATA HCITransport

    Link

    PHY

    Figure 4 Complete SATA solution

  • GNU/Linux software stack prein-stalled, along with a tested and opti-mized device driver for the SATA hostIP core from ASICS World Service.

    When integrating a SATA IP core intoan FPGA-based system, there are manydegrees of freedom. So, pushing the lim-its of the whole system requires knowl-edge of not just software or hardware,but both. Integration must proceed intandem for software and hardware.

    Figure 5 shows examples of how toaccomplish system integration of a

    SATA IP core. The most obvious way isto add the IP core as a slave to the bus(A) and let the CPU do the transfersbetween memory and the IP. To be sure,data will pass twice over the systembus, but if high data rates are notrequired, this easy-to-implementapproach may be sufficient. In this case,however, you can use the CPU only fora small application layer, since most ofthe time it will be busy copying data.

    The moment the CPU has to run afull operating system, the impact on

    performance will be really dramatic.In this case, you will have to considerreducing the CPU load by adding adedicated copy engine, the XilinxCentral DMA (option B in the figure).This way, you are still transferring datatwice over the bus, but the CPU doesnot spend all of its time copying data.

    Still, the performance of a systemwith a full operating system is faraway from a standalone application,and both are far from the theoreticalperformance limits. The third architec-ture option (C in the figure) changesthis picture by reducing the load of thesystem bus and using simple dedicatedcopy engines via Xilinxs streamingNPI port and Multiport MemoryController (MPMC). This boosts theperformance of the standalone appli-cation up to the theoretical limit.However, the Linux performance ofsuch a system is still limited.

    From the standalone application,we know that the bottleneck is notwithin the interconnection. This timethe bottleneck is the memory manage-ment in Linux. Linux handles memoryin blocks of a page size. This page sizeis 4,096 bytes for typical systems. Witha simple DMA engine and free memoryscattered all over the RAM in 4,096-byte blocks, you may move only 4,096bytes with each transfer. The finalarchitectural option (D in the figure)tackles this problem.

    For example, the PowerPC PPC440core included in the Virtex-5 FXTFPGA has dedicated engines that arecapable of SGDMA. This way, the DMAengine gets passed a pointer to a list ofmemory entries and scatters/gathersdata to and from this list. This results in

    30 Xcell Journal First Quarter 2011

    X C E L L E N C E I N I N D U S T R I A L

    Mem

    ory

    A

    PLB

    1

    2

    CPU

    Mem

    Ctr

    l

    SATA0

    SATA1

    Mem

    ory

    C

    PLB

    CPU

    Mem

    Ctr

    l

    Mem

    ory

    B

    PLB

    1

    2

    CPU

    Mem

    Ctr

    l

    SATA0

    SATA1

    DMA

    NPI

    NPI

    Mem

    ory

    D

    Mem

    Ctr

    lDMA DMADMA

    PPC440

    X-B

    AR

    DMA

    MC

    I

    Loca

    lLin

    k

    LL

    SATA0

    SATA1

    SATA0

    SATA1

    DM

    AD

    MA

    When integrating a SATA IP core into an FPGA-based system,

    there are many degrees of freedom. So, pushing the limits of the whole

    system requires knowledge of not just software or hardware, but both.

    Integration must proceed in tandem for software and hardware.

    Figure 5 Four architectural choices for integrating a SATA IP core

  • larger transfer sizes and brings the sys-tem very close to the standalone per-formance. Figure 6 summarizes the per-formance results of these differentarchitectural choices.

    Today, the decision whether tomake or buy a SATA host controllercore is obvious: Very few design teamsare capable of implementing a func-tioning SATA host controller for thecost of licensing one. At the sametime, it is common for design teams tospend significant time and money in-house to integrate this core into a pro-grammable system-on-chip, developdevice drivers for this core and imple-ment application software for operat-ing (and testing) the IP.

    The joint solution our team craftedwould not have been possible with-out short turnaround times betweentwo Xilinx Alliance ProgramPartners: ASICS World Service Ltd.and Missing Link Electronics, Inc. Tolearn more about our complete SATA

    solution, please visit the MLE LiveOnline Evaluation site at http://www.missinglinkelectronics.com/LOE.There, you will get more technicalinsight along with an invitation totest-drive a complete SATA systemvia the Internet.

    References:

    1. Xilinx, Virtex-5 FPGA RocketIO GTXTransceiver User Guide, October 2009.http://www.xilinx.com/bvdocs/userguides/ug198.pdf.

    2. Xilinx, Inc., Serial ATA Physical LinkInitialization with the GTP Transceiver ofVirtex-5 LXT FPGAs, 2008 applicationnote. http://www.xilinx.com/support/docu-mentation/application_notes/xapp870.pdf.

    3. Serial ATA International Organization,Serial ATA Revision 2.6, February 2007.http://www.sata-io.org/

    4. International Committee forInformation Technology Standards, ATAttachment 8 - ATA/ATAPI Command Set,September 2008. http://www.t13.org/

    First Quarter 2011 Xcell Journal 31

    X C E L L E N C E I N I N D U S T R I A L

    MB/s

    225

    150

    75

    PIO Central DMA

    DMA over NPI

    LocalLink

    SGDMA

    Gen 1/2 only

    Gen 2 Limit

    with FullLinux System

    Standalone

    Figure 6 Performance of complete SATA solution

  • 32 Xcell Journal First Quarter 2011

    Built on a pair of Spartan FPGAs,

    the fourth-generation Gecko

    system serves industrial and

    research applications as well

    as educational projects.

    XCELLENCE IN UNIVERSITY RESEARCH

    By Theo Kluter, PhD LecturerBern University of Applied Sciences (BFH)[email protected]

    Marcel Jacomet, PhDProfessorBern University of Applied Sciences (BFH)[email protected]

    General-Purpose SoC Platform HandlesHardware/Software Co-Design

  • First Quarter 2011 Xcell Journal 33

    he Gecko system is ag e n e r a l - p u r p o s ehardware/software

    co-design environment for real-timeinformation processing in system-on-chip (SoC) solutions. The systemsupports the co-design of software,fast hardware and dedicated real-time signal-processing hardware. Thelatest member of the Gecko family,the Gecko4main module, is an exper-imental platform that offers the nec-essary computing power to speedintensive real-time algorithms andthe necessary flexibility for control-intensive software tasks. Addingextension boards equipped with sen-sors, motors and a mechanical hous-ing suits the Gecko4main module forall manner of industrial, research andeducational projects.

    Developed at the Bern University ofApplied Sciences (BFH) in Biel/Bienne,Switzerland, the Gecko program got itsstart more than 10 years ago in aneffort to find a better way to handlecomplex SoC design. The goal of thelatest, fourth-generation Gecko proj-ect was to develop a new state-of-the-art version of the general-purpose sys-tem-on-chip hardware/software co-design development boards, using thecredit-card-size form factor (54 x 85mm). With Gecko4, we have realized acost-effective teaching platform whilestill preserving all the requirements forindustrial-grade projects.

    The Gecko4main module containsa large, central FPGAa 5 million-gate Xilinx Spartan-3and asmaller, secondary Spartan for boardconfiguration and interfacing purpos-es. This powerful, general-purposeSoC platform for hardware/softwareco-designs allows you to integrate aMicroBlaze 32-bit RISC processorwith closely coupled application-spe-

    cific hardware blocks for high-speed hardware algorithms withinthe single multimillion-gate FPGAchip on the Gecko4main module.To increase design flexibility, theboard also is equipped with severallarge static and flash memorybanks, along with RS232 and USB2.0 interfaces.

    In addition, you can stack a set ofextension boards on top of theGecko4main module. The currentextension boardswhich featuresensors, power supply and amechanical housing with motorscan help in developing system-on-chip research and industrial proj-ects, or a complete robot platformready to use for student exercises indifferent subjects and fields.

    FEATURES AND EXTENSION BOARDSThe heart of the Gecko4 system is theGecko4main module, shown in Figure1. The board can be divided into threedistinct parts:

    Configuration and IP abstraction(shown on the right half of Figure 1)

    Board powering

    Processing and interfacing(shown on the left half of Figure 1)

    One of the main goals of theGecko4main module is the abstractionof the hardware from designers or stu-dent users, allowing them to concen-trate on the topics of interest without

    Storage Flash2 Gbit (64M x 32)

    SDRAM128 Mbit(8M x 16)

    SDRAM128 Mbit(8M x 16)

    SDRAM128 Mbit(8M x 16)

    FPGA2XC3S1000XC3S1500XC3S2000XC3S4000XC3S5000

    3.3V

    I/O

    an

    d S

    up

    ply

    Clocks25 MHz16 MHz

    Bit file Flash16 Mbit

    USB 2.0 IFCypress FX2

    Dual Function: JTAG (PC USB) VGA and RS232

    DC/DCand

    PowerSwitch

    Mu

    lti-

    Stan

    dar

    d I/

    O56

    IO/2

    GN

    D/2

    Vio

    Mu

    lti-

    Stan

    dar

    d I/

    O56

    IO/2

    GN

    D/2

    Vio

    FPGA1XC3S200AN

    Du

    al C

    olo

    rLE

    Ds

    Figure 1 The Gecko4main module is an eight-layer PCB with the components all on the top layer, reducing production cost.

    T

  • having to deal too much with the hard-ware details. The onboard Spartan-3ANcontains an internal flash memory,allowing autonomous operation. ThisFPGA performs the abstraction by pro-viding memory-mapped FIFOs andinterfaces to a second FPGA.

    Users can accomplish the dynamicor autonomous configuration of thatsecond Spartan in three ways: via abit file that is stored in the bit-fileflash for autonomous operation, bythe PC through the USB 2.0 and theUSBTMC protocol, or by connecting a

    Xilinx JTAG programmer to the dual-function connecter.

    The main Spartan-3AN handles thefirst two options using a parallel-slaveconfiguration protocol. The JTAG con-figuration method involves the directintervention of the Xilinx design tools.

    NEW TWISTS ON CHIP DESIGN

    The inherently interdisciplinary character of modern systems has revealed the limits of the electronicsindustrys classical chip-design approach. The capture-and-simulate method of design, for example, hasgiven way to the more efficient describe-and-synthesize approach. More sophisticated hardware/software co-design methodologies have emerged in the past several years.

    Techniques such as the specify-explore-refine method give designers the opportunity of a high-level approachto system design, addressing lower-level system-implementation considerations separately in subsequent designsteps. Speed-sensitive tasks are best implemented with dedicated hardware, whereas software programs run-ning on microprocessor cores are the best choice for control-intensive tasks.

    A certain amount of risk is always present when designing SoCs to solve new problems for which only limitedknowledge and coarse models of the analog signals to be processed are available. The key problem in suchsituations is the lack of experience with the developedsolution. This is especially true where real-time informa-tion processing in a closed control loop is needed. Pure-software approaches limit the risk of nonoptimal solu-tions, simply because its possible to improve softwarealgorithms in iterative design cycles. Monitoring the effi-ciency and quality of hardware algorithms is, however,no easy task. In addition, hardware algorithms cannoteasily be improved on SoC solutions in the same itera-tive way as software approaches.

    In control-oriented reactive applications, the systemreacts continuously to the environment. Therefore, themonitoring of different tasks is crucial, especially ifthere are real-time constraints. As long as precise mod-els of the information to be processed are not available,as often occurs in analog signal processing, a straightfor-ward design approach is impossible or at least risky. Theflexibility of iteratively improving analog/digital hard-ware-implemented algorithms is thus necessary in orderto explore the architecture and to refine its implementa-tion. Again, this involves monitoring capabilities of theSoC in its real-time environment.

    Thus, an important problem in the classical ASIC/FPGAhardware/software co-design approach is decoupling ofthe design steps with the prototyping hardware in thedesign flow, which inhibits the necessary monitoring task in an early design phase. Adding an extra prototypingstep and an analysis step opens the possibility to iteratively improve the SoC design. The refine-prototype-analysisdesign loop best matches the needs of SoC designs with real-time constraints (see figure). If precise models of theplant to be controlled are missing, thorough data analysis and subsequent refinement of the data-processing algo-rithms are needed for designing a quality SoC. Theo Kluter and Marcel Jacomet

    specify

    explore

    refine

    analyze

    prototype

    implement

    System Specification

    System Co-Design

    Algorithm Improvement

    Signal & Data Analysis

    Real-Time Verification

    Final SoC Design

    34 Xcell Journal First Quarter 2011

    The specify-explore-refine design flow transforms intoa specify-explore-refine-prototype-analyze design flowfor SoC designs with real-time constraints.

    X C E L L E N C E I N U N I V E R S I T Y R E S E A R C H

  • While the latter option allows for on-chip debugging with, for example,ChipScope, the former two providedynamic reconfiguration of even a 5 million-gate-equivalent second FPGAin only a fraction of a second.

    Many of our teaching and researchtargets require a fast communicationlink for features like hardware-in-the-loop, regression testing and semi-real-time sensing, among others. TheGecko4 system realizes this fast linkby interfacing a USB 2.0 slave con-troller (the Cypress Fx2) with theUSBTMC protocol. To both abstractthe protocol and provide easy interfac-ing for users, the Spartan-3AN doesprotocol interpretation and places thepayload into memory-mapped FIFOs.

    Designers who do not need a largeFPGA can use the Gecko4main mod-ule without mounting the left half ofthe PCB, reducing board costs signifi-cantly. In this setting, designers canuse the Spartan-3AN directly to housesimple systemsindeed, they caneven implement a simple CPU-basedsystem in this 200 million-gate-equiva-lent FPGA. For debugging and displaypurposes, the device provides eightdual-color LEDs, three buttons, anunbuffered RS232 connection and atext-display VGA connection at a reso-lution of 1,024 x 768 pixels (60 Hz).These interfaces are also availablememory-mapped to the second FPGA.

    BOARD POWERINGDue to the various usages of theG