72
www.xilinx.com/xcell/ SOLUTIONS FOR A PROGRAMMABLE WORLD Xcell journal Xcell journal ISSUE 79, SECOND QUARTER 2012 Using Formal Verification for HW/SW Co-verification of an FPGA IP Core How to Use the CORDIC Algorithm in Your FPGA Design Smart, Fast Financial Trading Platforms Start with FPGAs Xilinx Unveils Vivado Design Suite for the Next Decade of ‘All Programmable’ Devices FPGAs Provide Flexible Platform for High School Robotics page 28

SOLUTIONS FOR A PROGRAMMABLE WORLD - Xilinx · SOLUTIONS FOR A PROGRAMMABLE WORLD Xcelljournal ... Xilinx® Spartan ... Cover Story Xilinx Unveils Vivado Design Suite

  • Upload
    lyhanh

  • View
    263

  • Download
    6

Embed Size (px)

Citation preview

  • www.xilinx.com/xcell/

    S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

    Xcell journalXcell journalI S SUE 79 , S ECOND QUAR TER 2012

    Using Formal Verification for HW/SW Co-verification of an FPGA IP Core

    How to Use the CORDIC Algorithm in Your FPGA Design

    Smart, Fast Financial Trading Platforms Start with FPGAs

    Xilinx Unveils Vivado Design Suite for theNext Decade of All Programmable Devices

    FPGAs Provide Flexible Platform for High SchoolRobotics

    page28

    http://www.xilinx.com/xcell

  • Xilinx Spartan-6 FPGAsD e v e l o p m e n t T o o l s , D e s i g n e d b y A v n e t

    Xilinx Spartan-6 FPGA Motor Control Development Kit AES-S6MC1-LX75T-GGo beyond traditional MCUs to:

    Execute complex motor control algorithmsAchieve higher levels of integrationImplement custom safety features

    $1,095

    Xilinx Spartan-6 LX150T Development KitAES-S6DEV-LX150T-GPrototype high-performance designswith ease:

    PCI Express x4 end-pointSATA host connectorTwo general-purpose GTP portsDual FMC LPC expansion slots

    $995

    Xilinx Spartan-6 LX16 Evaluation KitAES-S6EV-LX16-G

    First-ever battery-powered Xilinx FPGA development boardAchieve on-board FPGA configuration and power measurement with the included Cypress PSoC 3

    $225

    Xilinx Spartan-6 FPGA LX75T Development KitAES-S6PCIE-LX75T-G

    Optimize embedded PCIe applications using a standard set of features in a compact PCIe form factorDual banks of DDR3 memoryExpand your design using a card edge-aligned FMC slot

    $425

    Xilinx Spartan-6 FPGA LX9 MicroBoardAES-S6MB-LX9-G

    Explore the MicroBlaze soft processor and Spartan-6 FPGAs Leverage the included pre-built MicroBlaze systemsWrite & debug code using the included Software Development Kit (SDK)

    $89

    Xilinx Spartan-6 FPGA Industrial Video Processing KitAES-S6IVK-LX150T-GPrototype & develop systems such as:

    High resolution video conferencingVideo surveillanceMachine vision

    $2,195 $2,695(for a limited time only)

    Xilinx Spartan-6 FPGA Industrial Ethernet Kit AES-S6IEK-LX150T-GPrototype & develop systems such as:

    Industrial networkingMotor controlEmbedded control

    $1,395 $1895(for a limited time only)

    Xilinx Spartan-6 FPGA SP605 Evaluation KitEK-S6-SP605-GDesigned by Xilinx, this kit enablesimplementation of features such as:

    High-speed serial transceiversPCI Express

    DVI & DDR3

    $450 $495(for a limited time only)

    NEW!

    FOR A LIMITED TIME,AVNET IS OFFERING

    DISCOUNTS ONSEVERAL OF OUR

    MOST POPULAR TOOLS.www.em.avnet.com/

    s6solutions

    Copyright 2012, Avnet, Inc. All rights reserved. AVNET and the AV logo are registered trademarks of Avnet, Inc.All other trademarks are the property of their respective owners.

    http://www.em.avnet.com/s6solutions

  • New stacked silicon architecture from Xilinx makes your big design much easier to prototype.Partitioning woes are forgotten, and designs run at near final chip speed. The DINI Group DNV7F1board puts this new technology in your hands with a board that gets you to market easier, faster andmore confident of your designs functionality running at high speed. DINI Group engineers put thefeatures you need most, right on the board:

    10GbE

    USB 2

    PCIe, Gen 1, 2, and 3

    240 pin UDIMM for DDR3

    There is a Marvel Processor for any custom interfaces you might need andplenty of power and cooling for high speed logic emulation. Software andfirmware developers will appreciate the productivity gains that come with thislow cost, stand-alone development platform.

    Prototyping just got a lot easier, call DINI today and get your chip up to speed.

    www.dinigroup.com 7469 Draper Avenue La Jolla, CA 92037 (858) 454-3419 e-mail: [email protected]

    http://www.dinigroup.commailto:[email protected]

  • L E T T E R F R O M T H E P U B L I S H E R

    Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

    2012 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

    The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

    PUBLISHER Mike [email protected]

    EDITOR Jacqueline Damian

    ART DIRECTOR Scott Blair

    DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

    ADVERTISING SALES Dan [email protected]

    INTERNATIONAL Melissa Zhang, Asia [email protected]

    Christelle Moraga, Europe/Middle East/[email protected]

    Miyuki Takegoshi, [email protected]

    REPRINT ORDERS 1-800-493-5551

    Xcell journal

    www.xilinx.com/xcell/

    Welcome to the Programmable Renaissance

    T ime flies. I recently celebrated my fourth anniversary here at Xilinx and have had agreat time participating in the Herculean task of bringing to market two new generations of siliconthe 40-nanometer 6 series FPGAs and the 28-nm 7 seriesdevices. Im proud to be a part of what is most likely the single most inspirational and inno-vative moment in the history of programmable logic since Xilinx introduced the very firstFPGA, the XC2064, in 1985.

    Along with being the first to market with 28-nm silicon in 2011, Xilinx introduced tworevolutionary technologiesthe Zynq-7000 Extensible Processing Platform and theVirtex-7 2000T FPGA.

    If youve been a faithful reader of Xcell Journal over the last couple of years, you are famil-iar with these two great devices. The Zynq-7000 EPP marries a dual ARM Cortex-A9MPCore processor with programmable logic on the same device, and boots from theprocessor core rather than from programmable logic. It enables new vistas in system-levelintegration for traditional FPGA designers and opens up the world of programmable logicto a huge user base of software engineers. The possibilities are endless.

    Im not alone in my enthusiasm for this device: The editors and readers of EE Times andEDN recently voted the Zynq-7000 the Ultimate SoC of 2011 in the UBM Electronics ACEAwards competition (see http://www.eetimes.com/electronics-news/4370156/Xilinx-Zynq-7000-receives-product-of-the-year-ACE-award).

    Not to be outdone, the Virtex-7 2000T, an ACE Awards finalist in the Ultimate Digital ICcategory, is in my opinion an equally if not even more technologically impressive accom-plishment. It is the first commercially available FPGA with Xilinxs 3D stacked-siliconinterconnect (SSI) technology, in which four 28-nm programmable logic dice (what we callslices) reside side-by-side on a passive silicon interposer. By stacking the dice, Xilinx wasable to make the Virtex-7 2000T the worlds single largest device in terms of transistorcounts and by far the highest-capacity programmable logic device that has every existed.

    The SSI technology not only allows customers to speed past Moores Law but also opensup new integration possibilities in which Xilinx can integrate different types of dice on a sin-gle device, speeding up the pace of user innovation. For example, Xilinx has announced theVirtex-7 HT family of devices, enabled by SSI technology. Each member of this family willinclude transceiver slices alongside programmable logic slices. The Virtex-7 HT family willallow wired communications companies to create equipment to conform to new bandwidthstandards for 100 Gbps and beyond. The biggest device in the family, the Virtex-7 H870T, willallow companies to create equipment that can run at up to 400 Gbpsdeveloping equip-ment at the leading edge of advanced communications standards.

    And now, to put the icing on the cake so to speak, Xilinx is launching its new VivadoDesign Suite (cover story). Vivado, which the company started developing four years ago, notonly blows away the runtimes of the ISE Design Suite but is built from the ground up usingopen standards and modern EDA technologies, even high-level synthesis, that should dramat-ically speed up productivity for the 7 series devices and many generations of FPGAs to come.

    I highly recommend you check out the new 7 series devices and the Vivado Design Suite.If you happen to be available for a trip to San Francisco in early June, Xilinx will be exhibit-ing at the Design Automation Conference (www.dac.com) from June 3 to 7 at Booth 730.Youll find me there, or at three of the Pavilion Panels Im organizing on DACs show floor(Booth 310): Gary Smith on EDA: Trends and Whats Hot at DAC, on Monday, June 4, 9:15-10:15 a.m.; Town Hall: Dark Side of Moores Law on Wednesday, June 6, 9:15 to 10:15 a.m.;and Hardware-Assisted Prototyping and Verification: Make vs. Buy? on Wednesday, June6, 4:30 to 5:15 p.m. I hope to see you there.

    Mike SantariniPublisher

    http://www.xilinx.com/xcellhttp://www.xilinx.com/xcellmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.eetimes.com/electronics-news/4370156/Xilinx-Zynq-7000-receives-product-of-the-year-ACE-awardhttp://www.eetimes.com/electronics-news/4370156/Xilinx-Zynq-7000-receives-product-of-the-year-ACE-awardhttp://www.dac.com

  • http://www.innovative-dsp.com

  • C O N T E N T S

    VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES 2020

    1414

    Cover StoryXilinx Unveils Vivado Design Suite for the Next Decade of All Programmable Devices

    88

    Xcellence in Communications

    High-Level Synthesis Tool DeliversOptimized Packet Engine Design 14

    Xcellence in Distributed Computing

    Accelerating Distributed Computing with FPGAs 20

    Xcellence in Education

    FPGAs Enable Flexible Platform for High School Robotics 28

    Xcellence in Financial

    Smart, Fast Trading Systems Start with FPGAs 36

    Letter From the PublisherWelcome to the

    Programmable Renaissance 4

    3636

  • S E C O N D Q U A R T E R 2 0 1 2 , I S S U E 7 9

    Xperts Corner

    Accelerate Partial Reconfiguration with a 100% Hardware Solution 44

    Xplanation: FPGA 101

    How to Use the CORDIC Algorithm in Your FPGA Design 50

    Xplanation: FPGA 101

    Using Formal Verification for HW/SW Co-verification of an FPGA IP Core 56

    THE XILINX XPERIENCE FEATURES

    6262

    2011

    Excellence in Magazine & Journal Writing2010, 2011

    Excellence in Magazine & Journal Design and Layout2010, 2011

    Tools of Xcellence New tools take the pain out of FPGA synthesis 62

    Xamples A mix of new and popular application notes 66

    Xtra, Xtra The latest Xilinx tool updates and patches, as of May 2012 68

    Xclamations! Share your wit and wisdom by supplyinga caption for our techy cartoon, for a chance to win an

    Avnet Spartan-6 LX9 MicroBoard 70

    XTRA READING

    44442828

  • 8 Xcell Journal Second Quarter 2012

    COVER STORY

    Xilinx Unveils Vivado DesignSuite for the Next Decade of All Programmable Devices

    State-of-the-art EDA technologies and methods underliea new tool suite that will radically improve design productivity and quality of results, allowing designers to create better systems faster and with fewer chips.

    After four years of develop-ment and a year of beta test-ing, Xilinx is making itsVivado Design Suite avail-

    able to customers via its early-accessprogram, ahead of public access thissummer. Vivado provides a highly inte-grated design environment with a com-pletely new generation of system- to IC-level tools, all built on the backbone of ashared scalable data model and a com-mon debug environment. It is also anopen environment based on industrystandards such as the AMBA AXI4 inter-connect, IP-XACT IP packaging metada-ta, the Tool Command Language (Tcl),Synopsys Design Constraints (SDC) andothers that facilitate design flows tai-lored to the users needs. Xilinx archi-tected the Vivado Design Suite to enablethe combination of all types of program-mable technologies and to scale up to100 million ASIC equivalent-gate designs.

    by Mike SantariniPublisher, Xcell JournalXilinx, [email protected]

  • Second Quarter 2012 Xcell Journal 9

    Over the last four years, Xilinx haspushed semiconductor innovation tonew heights and unleashed the fullsystem-level capabilities of program-mable devices, said Steve Glaser, sen-ior vice president of corporate strate-gy and marketing. Over this time,Xilinx has evolved into a company thatdevelops All Programmable Devices,extending programmability beyondprogrammable logic and I/O to soft-ware-programmable ARM subsystems,3D ICs and analog mixed signal. Weare enabling new levels of programma-ble system integration with devicessuch as the award-winning Zynq-7000 Extensible Processing Platform,the 3D Virtex-7 stacked-silicon inter-connect (SSI) technology devices andthe worlds most advanced FPGAs.Now, with Vivado, we are offering astate-of-the-art tool suite that willaccelerate the productivity of cus-tomers using these All ProgrammableDevices for the next decade.

    Glaser said Xilinx developed AllProgrammable Devices to enable cus-tomers to achieve new levels of pro-grammable systems integration,increased system performance, lowerBOM cost and total system powerreduction, and ultimately to acceleratedesign productivity so they can gettheir innovations to market quickly. Toaccomplish this, Xilinx needed to cre-ate a tool suite as innovative as its newsilicona suite that would addressnagging integration and implementa-tion design-productivity bottlenecks.

    Customers face a number of inte-gration bottlenecks, including integrat-ing algorithmic C and register-transferlevel (RTL) IP; mixing the DSP, embed-ded, connectivity and logic domains;verifying blocks and systems; andreusing designs and IP, said Glaser.They also face several implementationbottlenecks, including hierarchical chipplanning and partitioning; multidomainand multidie physical optimization;multivariant design vs. timing clo-

    sure; and late ECOs and the ripplingeffects of design changes. The newVivado Design Suite addresses thesebottlenecks and empowers users totake full advantage of the systemintegration capabilities of our AllProgrammable Devices.

    In developing the Vivado DesignSuite, Xilinx leveraged industry stan-dards and employed state-of-the-artEDA technologies and techniques. Theresult is that all designersfrom thosewho require a highly automated, push-button flow to those who are extreme-ly hands-onwill be able to designeven the largest Xilinx devices farfaster and more effectively thanbefore, while working in a state-of-the-art EDA environment that retains afamiliar, intuitive look and feel.

    The Vivado Design Suite gives cus-tomers a modern set of tools with full-system programmability features thatfar surpass the capabilities of the long-time flagship ISE Design Suite. Tohelp customers transition smoothly,Xilinx will continue to develop and sup-port ISE indefinitely for those targeting7 series and older Xilinx FPGA tech-nologies. Going forward, the VivadoDesign Suite will be the companys flag-ship design environment, supporting all7 series and future devices from Xilinx.

    Tom Feist, senior director ofdesign methodology marketing atXilinx, expects that when customerslaunch the Vivado Design Suite, thebenefits over ISE will become imme-diately evident.

    The Vivado Design Suite improvesuser productivity by offering up to 4Xruntime improvements over competingtools, while heavily leveraging indus-try standards such as SystemVerilog,SDC, C/C++/SystemC, ARMs AMBA

    AXI version 4 interconnect and interac-tive Tcl scripting, said Feist. Otherhighlights include comprehensivecross-probing of the Vivados manyreports and design views, state-of-the-art graphics-based IP integration and,

    last but not least, the first fully support-ed commercial deployment of high-level synthesisC++ to HDLby anFPGA vendor.

    TOOLS FOR THE NEXT ERA OF PROGRAMMABLE DESIGN Xilinx originally introduced its ISEDesign Suite back in 1997. The suitefeatured a then very innovative tim-ing-driven place-and-route engine thatXilinx had gained in its April 1995acquisition of NeoCAD. Over a decadeand a half, Xilinx added numerousnew technologiesincluding multi-language synthesis and simulation, IPintegration and a host of editing andtest utilitiesto the suite, striving toconstantly improve its design tools onall fronts as FPGAs became capableof performing increasingly more com-plex functions. In creating the newVivado Design Suite, Feist said thatXilinx drew upon all the lessonslearned with ISE, appropriating itskey technologies while also leverag-ing modern EDA algorithms, toolsand techniques.

    The Vivado Design Suite will great-ly improve design productivity fortodays designs and will easily scalefor the capacity and design-complexi-ty challenges of 20-nanometer siliconand beyond, said Feist. EDA technol-ogy has evolved greatly over the last15 years. In building this tool fromscratch, we were able to create a suitethat employs the latest EDA technolo-gies and standards and will scale nice-ly into the foreseeable future.

    DETERMINISTIC DESIGN CLOSUREAt the heart of any FPGA vendorsintegrated design suite is the physi-cal-implementation flowsynthesis,floorplanning, placement, routing,power and timing analysis, optimiza-tion and ECO. With Vivado, Xilinxhas built a state-of-the-art implemen-tation flow to help customers quicklyachieve design closure.

    C O V E R S T O R Y

  • SCALABLE DATA MODEL ARCHITECTURETo cut down on iterations and overalldesign time and to improve overall pro-ductivity, Xilinx built its implementa-tion flow using a single, shared, scalabledata modela framework also found intodays most advanced ASIC designenvironments. This shared scalabledata model allows all the steps in theflowsynthesis, simulation, floorplan-ing, place and route, etc.to operate onan in-memory data model that enablesdebug and analysis at every step in theprocess, so that users have visibilityinto key design metrics such as timing,power, resource utilization and routingcongestion much earlier in the designprocesses, said Feist. These estimatesbecome progressively more accurate asthe design progresses through the stepsin the implementation processes.

    Specifically, the unified data modelallowed Xilinx to tightly link its newmultidimensional, analytical place-and-route engine with the suites RTL syn-thesis engine, new multiple-languagesimulation engines as well as individualtools such as the IP Integrator, Pin

    Editor, Floor Planner and DeviceEditor. Customers can use the toolsuites comprehensive cross-probingfunction to track and cross-probe agiven problem from schematics, timingreports or logic cells to any other viewand all the way back to HDL code.

    You now have analysis at every stepof the design process and every step isconnected, said Feist. We also pro-vide analysis for timing, power, noiseand resource utilization at every stageof the flow after synthesis. So if I learnearly that my timing or power is wayoff, I can do short iterations to addressthe issue proactively rather than runlong iterations, perhaps several ofthem, after its been placed and routed.

    Feist said that tight integrationafforded by the scalable data modelenhanced the effectiveness of push-button flows for users who wantmaximum automation, relying ontheir tools to do the vast majority ofthe work. At the same time, he said,it also gives those users who requiremore-advanced controls betteranalysis and command of their everydesign move.

    HIERARCHICAL CHIP PLANNING, FAST SYNTHESIS Feist said that Vivado provides userswith the ability to partition the designfor processing by synthesis, implemen-tation and verification, facilitating adivide-and-conquer team approach tobig projects. A new design-preserva-tion feature enables repeatable timingresults and the ability to perform par-tial reconfiguration of the design.

    Vivado also includes an entirely newsynthesis engine that is designed to han-dle millions of logic cells. Key to the newsynthesis engine is superior support forSystemVerilog. Vivados synthesisengine supports the synthesizable sub-set of the SystemVerilog language betterthan any other tool in the market, saidFeist. It is three times faster than XST,the Xilinx Synthesis Technology in theISE Design Suite, and supports a quickoption that lets designers rapidly get afeeling for the area and size of thedesign, allowing them to debug issues15 times faster than before with anRTL or gate-level schematic. Withmore and more ASIC designers mov-ing to programmable platforms, Xilinx

    10 Xcell Journal Second Quarter 2012

    C O V E R S T O R Y

    25

    20

    15

    10

    5

    00.0E+00 5.0E+05 1.0E+06 1.5E+06 2.0E+06

    12h/MLC

    4.6h/MLC

    Vivado

    ISE

    Competitor tools

    Ru

    nti

    me

    (ho

    urs

    )

    Design size (LC)

    Figure 1 The Vivado Design Suite implements large and small designs more quickly and with better-quality results than other FPGA tools.

  • is also leveraging Synopsys DesignConstraints throughout the Vivadoflow. The use of standards opens upnew levels of automation where cus-tomers can now access state-of-the-industry EDA tools for things like con-straint generation, cross-domainclock checking, formal verificationand even static timing analysis withtools like PrimeTime from Synopsys.

    MULTIDIMENSIONAL ANALYTICAL PLACER Feist explained that the older-genera-tion FPGA vendor design suites useone-dimensional timing-driven place-and-route engines powered by simulat-ed annealing algorithms that determinerandomly where the tool should placelogic cells. With these routers, usersenter timing; then the simulated anneal-ing algorithm pseudorandomly placesfeatures to get a best as it can matchto timing requirements. In those daysit made sense, because designs weremuch smaller and logic cells were themain cause of delays, said Feist. Buttoday, with complex designs andadvances in silicon processes, intercon-nect and design congestion contributeto the delay far more.

    Place-and-route engines with simu-lated annealing algorithms do an ade-quate job for FPGAs below 1 milliongates, but they really start to underper-form as designs grow, said Feist. Notonly do they struggle with congestion,

    but the results start to become increas-ingly more unpredictable as designsgrow further beyond 1 million gates.

    With an eye toward the multimillion-gate future, Xilinx developed a modernmultidimensional analytic placementengine for the Vivado Design Suite thatis on par with those found in million-dollar ASIC place-and-route tools. Thisengine analytically finds a solution thatprimarily minimizes three dimensionsof a design: timing, congestion and wirelength. The Vivado Design Suites algo-rithm globally optimizes for best timing,congestion and wire length simultane-ously, taking into account the entiredesign instead of the local-moveapproach done with simulated anneal-ing, said Feist. As a result, the tool canplace and route 10 million gates quickly,deterministically and with consistentlystrong quality of results (see Figure 1).Because it is solving for all three fac-tors simultaneously, it means you runfewer iterations in your flow.

    To illustrate this advantage, Xilinxran the raw RTL for the Zynq-7000 EPPemulation platform, a very large andcomplex design, in both the ISE DesignSuite and Vivado Design Suite in a push-button mode. Each tool was instructedto target Xilinxs largest FPGA devicethe SSI-enabled Virtex-7 2000T FPGA.The Vivado Design Suites place-and-route engine took five hours to placethe 1.2 million logic cells, while the ISEDesign Suite version 13.4 took 13 hours

    (Figure 2). The Vivado Design Suite alsoimplemented the design with much lesscongestion (as seen in the gray and yel-low portions of the design) and in asmaller area, reflecting the total wire-length reduction. In addition, theVivado Design Suite implementationhad better memory compilation effi-ciency, taking only 9 Gbytes to imple-ment the designs required memory toISE Design Suites 16 Gbytes.

    Essentially what youre seeing isthat the Vivado Design Suite met allconstraints and only needed three-quarters of the device to implement theentire design, said Feist. That meansusers could add even more logic func-tionality and on-chip memory to theirdesigns [in the extra space] or, alterna-tively, even move to a smaller device.

    POWER OPTIMIZATION AND ANALYSISToday, power is one of the most criticalaspects of FPGA design. As such, theVivado Design Suite focuses onadvanced power-optimization tech-niques to provide greater power reduc-tions for users designs. The technologyuses advanced clock-gating techniquesfound in todays advanced ASIC toolsuites and is capable of analyzing designlogic and removing unnecessary switch-ing activity by applying clock gating,said Feist. Specifically, the new tech-nology focuses on the switching-activityfactor alpha. It is able to achieve up toa 30 percent reduction in dynamicpower. Feist said Xilinx introduced thetechnology in the ISE Design Suite lastyear but is carrying it forward and willcontinue to enhance it in Vivado.

    In addition, with the new sharedscalable data model, users can getpower estimates at every stage of thedesign flow, enabling up-front analysisso that problem areas can be addressedearly in the design flow, said Feist.

    SIMPLIFYING ENGINEERINGCHANGE ORDERSIncremental flows make it possible toquickly process small design changes by

    Second Quarter 2012 Xcell Journal 11

    C O V E R S T O R Y

    ISE13 hrs.P&R runtime

    Memory usage

    *Zynq emulation platform

    Wire lengthand congestion

    16 GB

    Vivado5 hrs.9 GB

    Figure 2 The Vivado Design Suites multidimensional analytic algorithm optimizes layouts for best timing, congestion and wire length, not just best timing.

  • simply reimplementing a small part ofthe design, making iterations fasterafter each change. They also enableperformance preservation after eachincremental change, thus reducing theneed for multiple design iterations.Toward this end, the Vivado DesignSuite includes a new extension to thepopular ISE FPGA Editor tool calledthe Vivado Device Editor. Feist said thatusing the Vivado Device Editor on aplaced-and-routed design, designers

    now have the power to make engineer-ing change orders (ECOs)to moveinstances, reroute nets, tap a register toa primary output for debug with ascope, change the parameters on a dig-ital clock manager (DCM) or a lookuptable (LUT)late in the design cycle,without needing to go back throughsynthesis and implementation. Noother FPGA design environment offersthis level of flexibility, he said.

    FLOW AUTOMATION, NOT FLOW DICTATIONIn building the Vivado Design Suite,the Xilinx tool teams mantra was toautomatenot dictatethe way peo-ple design. Whether they start in C,C++, SystemC, VHDL, Verilog orSystemVerilog, MATLAB or Sim-ulinkand whether they use our IP orthird-party IPwe offer a way to auto-mate all those flows and help cus-tomers be more productive, said Feist.We also accounted for the broad rangeof skill sets and preferences of ourusersfrom folks who want an entirelypushbutton flow to folks who do analy-sis at each phase of the design, andeven for those who think GUIs are for

    wimps and want to do everything incommand-line or batch mode via TCL.Users are able to suit the suites featuresto their specific needs.

    THE IP PACKAGER, INTEGRATOR AND CATALOGXilinxs tool architecture team placedtop priority on giving the new suite spe-cialized IP features to facilitate the cre-ation, integration and archiving of intel-lectual property. To this end, Xilinx has

    created three new IP capabilities inVivado, called IP Packager, IP Integratorand the Extensible IP Catalog.

    Today, it is hard to find an IC designthat doesnt incorporate some amountof IP, said Feist. By adopting industrystandards and offering tools to specifi-cally facilitate the creation, integrationand archiving/upkeep of IP, we are help-ing IP vendors in our ecosystem andcustomers to quickly build IP andimprove design productivity. More than20 vendors are already offering IP sup-porting the new suite.

    IP Packager allows Xilinx customers,IP developers and ecosystem partners toturn any part of their designor indeed,the entire designinto a reusable core atany level of the design flow: RTL, netlist,placed netlist and even placed-and-rout-ed netlist. The tool creates an IP-XACTdescription of the IP that users can easi-ly integrate into future designs. For itspart, the IP Packager specifies the datafor each piece of IP in an XML file. Feistsaid that once you have the IP packaged,you can use the new IP Integrator tostitch it into the rest of your design.

    IP Integrator allows customers tointegrate IP into their designs at the

    interconnect level rather than at the pinlevel, said Feist. You can drag anddrop the pieces of IP onto your designand it will check up front that therespective interfaces are compatible. Ifthey are, you draw one line between thecores and it will automatically write thedetailed RTL that connects all the pins.

    Once youve merged, say, four orfive blocks into your design with IPIntegrator, he said, you can take theoutput of that [process] and run it backthrough the IP Packager. The resultthen becomes a piece of IP that otherpeople can reuse, said Feist. And thisIP isnt just RTL, it can be a placednetlist or even a placed-and-routed IPnetlist block, which further saves inte-gration and verification time.

    A third feature, the Extensible IPCatalog, allows users to build theirown standard repositories from IPtheyve created or licensed from Xilinxand third-party vendors. The catalog,which Xilinx built to conform to therequirements of the IP-XACT standard,allows design teams and even enter-prises to better organize their IP andshare it across their organization. Feistsaid that the Xilinx System Generatorand IP Integrator are part of the VivadoExtensible IP Catalog so that users caneasily access catalogued IP and inte-grate it into their design projects.

    Instead of having third-party IPvendors deliver their IP in a zip fileand with various deliverables, theycan now deliver it to you in a unifiedformat that is instantly accessible andcompatible with the Vivado suite,said Ramine Roane, director of prod-uct marketing for Vivado.

    VIVADO HLS TAKES ESL MAINSTREAMPerhaps the most forward looking ofthe many new technologies in theVivado Design Suite release is VivadoHLS (high-level synthesis), which Xilinxgained in its acquisition of AutoESL in2010. Xilinx conducted an extensiveevaluation of commercial electronicsystem-level (ESL) design offerings

    C O V E R S T O R Y

    12 Xcell Journal Second Quarter 2012

    The tools will work for all levels of users, 'from folks who want an entirely

    pushbutton flow to folks who do analysisat each phase of the design.'

  • before acquiring the best in the indus-try. A study by research firm BDTIhelped Xilinxs acquisition choice (seeXcell Journal issue 71, BDTI StudyCertifies High-Level Synthesis Flowsfor DSP-Centric FPGA Design,http://www.xilinx.com/publications/archives/xcell/Xcell71.pdf).

    Vivado HLS provides comprehen-sive coverage of C, C++ and SystemC,and does floating-point as well as arbi-trary precision floating-point [calcula-tions], said Feist. This means that youcan work with the tool in an algorithm-development environment rather thana typical hardware environment, if youwish. A key advantage of doing this isthat the algorithms you developed atthat level can be verified orders of mag-nitude faster than at the RTL. Thatmeans you get simulation accelerationbut also the ability to explore the feasi-bility of algorithms and make, at anarchitectural level, trade-offs in termsof throughput, latency and power.

    Designers can use the Vivado HLStool in many ways to perform a widerange of functions. But for demonstra-tion purposes, Feist outlined a commonflow users can employ for developingIP and integrating it into their designs.

    In this flow, users create a C, C++or SystemC representation of theirdesign and a C testbench thatdescribes its desired behavior. Theythen verify the system behavior oftheir design using a GNU CompilerCollection/G++ or Visual C++ simula-tor. Once the behavioral design isfunctioning satisfactorily and theaccompanying testbench is ironed out,they run the design through VivadoHLS synthesis, which will generate anRTL design: Verilog or VHDL. With theRTL they can then perform Verilog orVHDL simulation of the design or havethe tool create a SystemC versionusing the C-wrapper technology. Userscan then perform SystemC architec-tural-level simulation and further veri-fy the architectural behavior and func-tionality of the design against the pre-viously created C testbench.

    Once the design has been solidi-fied, users can put it through theVivado Design Suites physical-imple-mentation flow to program theirdesign into a device and run it in hard-ware. Alternatively, they can use theIP Packager to turn the design into areusable piece of IP, stitch the IP intoa design using IP Integrator or run itin System Generator.

    This is merely one way to use the tool.In fact, in this issue of Xcell Journal,Agilents Nathan Jachimiec and XilinxsFernando Martinez Vallina describe howthey used the Vivado HLS technology(called AutoESL technology in the ISEDesign Suite flow) to develop a UDPpacket engine for Agilent.

    VIVADO SIMULATORIn addition to Vivado HLS, Xilinx alsocreated a new mixed-language simulatorfor the suite that supports Verilog andVHDL. With a single click of the mouse,Feist said, users can launch behavioralsimulations and view results in an inte-grated waveform viewer. Simulations areaccelerated at the behavioral level usinga new performance-optimized simulationkernel that executes up to three timesfaster than the ISE simulator. Gate-levelsimulations can also run up to 100 timesfaster using hardware co-simulation.

    AVAILABILITY IN 2012Where Xilinx offered the ISE DesignSuite in four editions aimed at differenttypes of designers (Logic, Embedded,DSP and System), the company will offerthe Vivado Design Suite in two editions.The base Design Edition includes thenew IP tools in addition to Vivados syn-thesis-to-bitstream flow. Meanwhile, theSystem Edition includes all the tools ofthe Design Edition plus SystemGenerator and Xilinxs new Vivado HLS.

    The Vivado Design Suite version2012.1 is available now as part of anearly-access program. Customers shouldcontact their local Xilinx representativefor more information. Public access willcommence with version 2012.2 in themiddle of the second quarter, followedby WebPACK availability later in theyear. ISE Design Suite Edition customerswith current support will receive thenew Vivado Design Suite Editions inaddition to ISE at no additional cost.

    Xilinx will continue to support anddevelop the ISE Design Suite for cus-tomers targeting devices prior to the 28-nm generation. To learn more aboutVivado, please visit www.xilinx.com/design-tools or come see the suite inaction at the Design Automation Con-ference (DAC), June 3-7 in San Francisco,Booth 730.

    Second Quarter 2012 Xcell Journal 13

    C O V E R S T O R Y

    FunctionalSpecification

    CDesign

    CTestbench

    RTLDesign

    Synthesis

    CWrapper

    Verification

    Packaging

    Vivado IP Packager

    Vivado IP Packer

    ArchitecturalVerification

    IP Integrator System Generator RTL

    Starts at C C C++ SystemC

    Produces RTL Verilog VHDL SystemC

    Automates Flow Verification Implementation

    Figure 3 Vivado HLS allows design teams to begin their designs at a system level.

    http://www.xilinx.com/publications/archives/xcell/Xcell71.pdfhttp://www.xilinx.com/design-tools

  • 14 Xcell Journal Second Quarter 2012

    Gigabit Ethernet is one of the mostubiquitous interconnect optionsavailable to link a workstation or lap-top to an FPGA-based embedded platformdue to the availability of the hardened tri-Ethernet MAC (TEMAC) primitive. The pri-mary impediment in developing Ethernet-based FPGA designs is the perceived proces-sor requirement necessary to handle theInternet Protocol (IP) stack. We approachedthe problem using the AutoESL high-levelsynthesis tool to develop a high-performanceIPv4 User-Datagram Protocol (UDP) packettransfer engine.

    Our team at Agilent's Measurement ResearchLab wrote original C source code based onInternet Engineering Task Force requests forcomments (RFCs) detailing packet exchangesamong several protocols, namely UDP, theAddress Resolution Protocol (ARP) and theDynamic Host Configuration Protocol (DHCP).This design implements a hardware packet-pro-cessing engine without any need for a CPU. Thearchitecture is capable of handling traffic at linerate with minimum latency and is compact inlogic-resource area. The usage of AutoESLmakes it easy to modify the user interface withminimum effort to adapt to one or more FIFOstreams or to multiple RAM interface ports.AutoESL is a new addition to the Xilinx ISE

    Design Suite and is called Vivado HLS in thenew Vivado Design Suite (see cover story).

    AutoESL enabled the creation of an in-fabric,processor-free UDP network packet engine.

    XCELLENCE IN COMMUNICATIONS

    by Nathan Jachimiec, PhDR&D EngineerAgilent TechnologiesTechnology Leadership [email protected]

    Fernando Martinez Vallina, PhD Software Applications EngineerXilinx, [email protected]

    High-Level Synthesis ToolDelivers Optimized Packet Engine Design

    mailto:[email protected]:[email protected]

  • Second Quarter 2012 Xcell Journal 15

    IPV4 USER DATAGRAM PROTOCOLInternet Protocol version 4 (IPv4) isthe dominant protocol of the Internet,with version 6 (IPv6) growing steadi-ly in popularity. When most develop-ers discuss IP, they commonly refer tothe Transmission Control Protocol, orTCP, a connection-based protocolthat provides reliability and conges-tion management. But for many appli-cations such as video streaming,telephony, gaming or distributed sen-sor networks, increased bandwidthand minimal latency trump reliability.Hence, these applications typicallyuse UDP instead.

    UDP is connectionless and pro-vides no inherent reliability. If packetsare lost, duplicated or sent out oforder, the sender has no way of know-ing and it is the responsibility of theusers application to perform somepacket inspection to handle theseerrors. In this regard, UDP has beennicknamed the unreliable protocol,but in comparison to TCP, it offershigher performance. UDP support isavailable in nearly every major operat-ing system that supports IP. High-levelsoftware programming languagesrefer to network streams as socketsand UDP as a datagram socket.

    SENSOR NETWORK ARCHITECTUREAt Agilent, we developed a LAN-basedsensor network that interfaces an ana-log-to-digital converter (ADC) with aXilinx Virtex-5 FPGA. The FPGA per-forms data aggregation and thenstreams a requested number of sam-ples to a predetermined IP addressthat is, a host PC. Because the blockRAM of our FPGA was almost com-pletely devoted to signal processing,we did not have enough memory tocontain the firmware for a softprocessor. Instead, we opted to imple-ment a minimal set of networkingfunctions to transfer sensor data viaUDP back to a host. Due to the needfor high bandwidth and low latency,UDP packet streaming was the pre-ferred network mode.

    Because of the time-sensitive natureof the data, a new set of sample data ismore pertinent than any retransmissionof lost samples. One of the two challeng-ing issues we faced was to avoid over-loading the host device. That meant wehad to find a way of efficiently handlingthe large number of inbound samples.The second major challenge was quicklyformatting the UDP packet and calculat-ing the required IP header fields and theoptional, but necessary, UDP payloadchecksum, before the next set of sam-ples overflowed internal buffers.

    INITIAL HDL DESIGNAn HDL implementation of the packetengine was straightforward given pre-existing pseudocode, but not optimal forour FPGA hardware. C and pseudocode

    provided from various sources simpli-fied verification. In addition, tools suchas Wireshark, the open-source packetanalyzer, and high-level languages suchas Java simplified the process of simula-tion and in-lab verification.

    Using provided pseudocode, the taskof developing Verilog to generate thepacket headers involved coding a statemachine, reading the sample FIFO andassembling the packet into a RAM-basedbuffer. We broke the design into threemain modules, RX Flow, TX Flow andLAN MCU, as shown in Figure 1. Aspackets arrive from the LAN, the RX

    Flow inspects them and passes themeither to the instrument core or to theLAN MCU for processing, such aswhen handling ARP or DHCP packets.

    The TX Flow packet engine reads NADC samples from a TX FIFO andcomputes a running payload check-sum for calculating the UDP check-sum. The TX FIFO buffers new sam-ples as they arrive, while the LANMCU prepares the payload of a yet-to-be-transmitted packet. After fetchingthe last requested sample, the LANMCU computes the remaining headerfields of the IP/UDP packet. In net-work terminology, this procedure is aTX checksum offload.

    Once the packet fields are generated,the LAN MCU sends the packet to theTEMAC for transmission but retains it

    until the TEMAC acknowledges suc-cessful transmissionnot reception bythe destination device. As this firstpacket is awaiting transmission by theTEMAC, new sensor samples are arriv-ing into the TX FIFO. When the firstpacket is finished, our packet enginereleases the buffer to prepare for thenext packet. The process continues in adouble-buffered fashion. If the TEMACsignals an error and the next transmitbuffer overflow is imminent, then thepacket is lost to allow the next sampleset to continue, but an exception isnoted. Due to time-stamping of the

    X C E L L E N C E I N C O M M U N I C AT I O N S

    RX FIFO

    InstrumentCore Logic

    andADC I/F

    TX FIFO

    TX Flow

    RX Flow

    TEMAC

    LAN MCUAutoESL

    Control Packets, UDP,Data Streaming, ARP,

    DHCP

    Figure 1 Our UDP packet engine design consisted of three main modules: RX Flow, TX Flow and LAN MCU.

  • sample set incorporated into our pack-et format, the host will realize a discon-tinuity in the set and accommodate it.

    The latency to transmit a packet isthe number of cycles it takes to readin N ADC samples plus the cycles togenerate the packet header fields,including the IPv4 flags, source anddestination address fields, UDP pseu-

    do header and both the IP and UDPchecksums. The checksum computa-tions are rather problematic sincethey require reading the entire packet,yet they lie before the payload bytes.

    CODING HDL IN THE DARKTo support the high-bandwidth andlow-latency requirements of the sen-sor network, we needed an optimalhardware design to keep up with therequired sample rate. The straight-forward approach we implementedfirst in Verilog failed to meet a 125-MHz clock rate without floorplan-ning, and took 17 clock cycles togenerate the IP/UDP packet headerfields. As we developed the initialHDL design, ChipScope was vitalto understanding the nuances of theTEMAC interface, but it also imped-ed the goal of achieving a 125-MHzclock. The additional logic-capturecircuits altered the critical path andwould require manual floorplanningfor timing closure.

    The critical path was calculatingthe IP and UDP header checksums,because our straightforward designused a four-operand adder to summultiple header fields together in var-ious states of our design. Our HDLdesign attempted a greedy schedul-ing algorithm that tried to do as muchwork as possible per cycle of the state

    machine. By removing ChipScope onthese operations and by floorplanning,we closed timing.

    The HDL design also used only oneport of a 32-bit-wide block RAM thatacted as our transmit packet buffer. Wechose a 32-bit-wide memory becausethats the native width of the BRAMprimitive and it allowed for byte-enable

    write accesses that would avoid theneed for read-modify-write access tothe transmit buffer.

    Using byte enables, the finite statemachine (FSM) writes directly to theheader field bytes needing modifica-tion at a RAM address. However, whatseemed like good design choicesbased on knowledge of the underlyingXilinx fabric and algorithm yielded anonoptimal design that failed to meettiming without manual placement ofthe four-input adders.

    Because the UDP algorithms werealready available in various forms in Ccode or written as pseudocode in IP-related RFC documentation, recodingthe UDP packet engine in C was not amajor task and proved to yield a betterinsight to the packet header process-ing. Just taking the pseudocode andstarting to write Verilog may havemade for quicker coding, but thismethodology would have sacrificedperformance without fully studyingthe data and control flows involved.

    ADVANTAGE AUTOESLThe ability for AutoESL to abstract theFIFO and RAM interfaces proved to beone of the most beneficial optimiza-tions for performance. With the abilityto code directly in C, we could now eas-ily include both ARP and DCHP rou-tines into our packet engine. Figure 2

    shows a flowchart of our design. OurHDL design utilized a byte-wide FIFOinterface that connected to the aggrega-tion and sensor interface of our design,which remained in Verilog. Also, ourVerilog design utilized a 32-bit memoryinterface that collected 4 bytes of sam-ple data and then saved it in the trans-mit buffer RAM as a 32-bit word.

    By means of its array reshapedirective, AutoESL optimized thememory interface so that the transmitbuffers, while written in C code as an8-bit memory, became a 32-bit memo-ry. This meant the C code could avoidhaving to do many bit manipulationsof the header fields, as they wouldrequire bit shifting to place into a 32-bit word. It also alleviated little-endianvs. big-endian byte-ordering issues.This optimization reduced the latencyof the TX offload function that com-putes the packet checksums and gen-erates header fields from 17 clocks, asoriginally written in Verilog, to justseven clock cycles while easily meet-ing timing. AutoESL could do better inthe future, since this current versiondoes not have the ability to manipulatebyte enables on RAM writes. Byte-enabled memory support is on thelong-term road map for the tool.

    Another optimization that AutoESLperformed, which we found byserendipity, was to access both ports ofour memory, since Xilinx block RAM isinherently dual-port. Our Verilog designreserved the second port of the trans-mit buffer so that its interface to theTEMAC would be able to access thebuffer without any need for arbitration.By allowing AutoESL to optimize forour true dual-port RAM, it was capableof performing reads or writes from twodifferent locations of the buffer. Ineffect, this wound up halving the num-ber of cycles necessary to generate theheader. The reduction in latency waswell worth the effort in creating a sim-ple arbiter in Verilog for the secondport of the memory so that the TEMACinterface could access the memory portthat AutoESL usurped.

    16 Xcell Journal Second Quarter 2012

    X C E L L E N C E I N C O M M U N I C AT I O N S

    AutoESLs ability to abstract the FIFO and RAMinterfaces proved to be one of the most beneficial

    optimizations for performance.

  • We controlled the bit widths of thetransmit buffer and the sample FIFOinterfaces via directives. Unfortunately,AutoESL does not automatically opti-mize your design. Instead, you have toexperiment with a variety of directivesand determine through trial and errorwhich of them is delivering an improve-ment. For our design, reducing thenumber of clock cycles to process thepacket fields while operating at 125MHz was the goal.

    The array reshape and looppipeline directives were importantfor optimizing the design. The reshapedirective alters the bit width of theRAM and FIFO interfaces, which ulti-mately led to processing multiple head-er fields in parallel per clock cycle andwriteback to memory. The optimalcombination that yielded the leastcycles was a transmit buffer bit widthof 32. The width of the FIFO feedingADC samples was not a factor inreducing the overall latency becauseits impossible to force samples toarrive any faster.

    The loop-pipelining directive isextremely important too, because it indi-cates to the compiler that our loops thatpush and pop from our FIFO interfacescan operate back-to-back. Otherwise,without the pipeline directive, AutoESLspent three to 20 clock cycles betweenpops of the FIFO due to scheduling rea-sons. It is therefore vital to utilizepipelining as much as possible to attainlow latency when streaming databetween memories.

    Xilinx block RAM also has a program-mable data output latency of one tothree clock cycles. Allowing threecycles of read latency enables the mini-mum clock to Q timing. To experimentwith different read latencies was only amatter of changing the latency direc-tive for the RAM primitive or coreresource. Because of the schedulingalgorithms that AutoESL performed,adding a read latency of three cycles toaccess the RAM only tacked on oneadditional cycle of latency to the overallpacket header generation. The extracycle of memory latency allowed for

    more slack in the design, and thataided the place-and-route effort.

    We also implemented ARP andDHCP routines in our AutoESL designthat we had avoided doing beforebecause of the level of effort required tocode them in Verilog. While not difficult,both ARP and DHCP are extremelycumbersome to write in Verilog andwould require a great number ofstates to perform. For instance, theARP request/response exchangerequired more than 70 states. Onecoding error in the Verilog FSM wouldlikely require multiple days to undo.For this reason alone, many designerswould prefer just to use a CPU to runthese network routines.

    Overall, AutoESL excelled at gener-ating a synthesizable netlist for the UDPpacket engine. The module it generatedfit between our two preexisting ADCand TEMAC interface modules and per-formed the necessary packet headergeneration and additional tasks. Wewere able to integrate the design it cre-ated into our core design and simulate itwith Mentor Graphics ModelSim to per-form functional verification. With thestreamlined design, we were able toreach timing closure with less synthe-sis, map and place-and-route effort thanwith our original HDL design. Yet wehave significantly more functionalitynow, such as ARP and DHCP support.

    Comparing our original design inVerilog with our hybrid design that uti-lized AutoESL to craft our LAN MCUand TX Flow modules yielded impres-sive results. Table 1 shows a compari-son of lookup table (LUT) usage. OurHDL version of TX Flow was smaller bymore than 37 percent, but our AutoESLdesign incorporated more functionality.Most impressive is that AutoESLreduced the number of cycles to per-form our packet header generation by59 percent. Table 2 shows the latencyof the TX Offload algorithm.

    The critical path of the HDL designwas computing the UDP checksum.Comparing this with the AutoESLdesign shows that the HDL design suf-

    X C E L L E N C E I N C O M M U N I C AT I O N S

    Second Quarter 2012 Xcell Journal 17

    RXInterrupt

    DHCPExchange

    IdentifyPacket

    UDPControl

    UDPDHCP

    PrepareUDP

    Packet

    Stream ControlInstruction to Core

    ADC Samplesfrom Core

    GenerateChecksums

    Stream toTEMAC

    ARPResponse

    ARPRequest

    Figure 2 Packet engine flowchart shows inclusion of ARP and DHCP.

  • fered from 10 levels of logic and atotal path delay of 6.4 nanoseconds,whereas AutoESL optimized this toonly three levels of logic and a pathdelay of 3.5 ns. Our development timefor the HDL design was about amonth of effort. We took about thesame amount of time with AutoESL,but incorporated more functionalitywhile gaining familiarity with thenuances of the tool.

    LATENCY AND THROUGHPUTAutoESL has a significant advantageover HDL design in that it performscontrol and data-flow analyses and canuse this information to reorder opera-tions to minimize latency and increasethroughput. In our particular case, weused a greedy algorithm that tried todo too many arithmetic operations perclock cycle. This tool rescheduled ourchecksum calculations so as to useonly two input adders, but scheduledthem in such a way to avoid increasingoverall execution latency.

    Software compilers intrinsicallyperform these types of exercises. Asstate machines become more com-plex, the HDL designer is at a disad-vantage compared to the omniscienceof the compiler. An HDL designerwould typically not have the opportu-nity to explore the effect of more thanjust two architectural choicesbecause of time constraints to delivera design, but this may be a vital taskto deliver a low-power design.

    The most important benefit of thistool was its ability to try a variety ofscenarios, which would be tedious inVerilog, such as changing bit widths ofFIFOs and RAMs, partitioning a largeRAM into smaller memories, reorder-ing arithmetic operations and utilizingdual-port instead of single-port RAM.In an HDL design, each scenario wouldlikely cost an additional day of writingcode and then modifying the testbenchto verify correct functionality. WithAutoESL these changes took minutes,were seamless and did not entail anymajor modification of the source code.

    Modifying large state machines isextremely cumbersome in Verilog. Theadvent of tools like AutoESL is reminis-cent of the days when processor design-ers began to employ microprogramminginstead of hand-constructing themicrocoded state machines of earlymicroprocessors such as the 8086 and68000. With the arrival of RISC architec-tures and hardware description lan-guages, microprogramming is nowmostly a lost art form, but its lesson iswell learned in that abstraction is neces-sary to manage complexity. As micro-programming offered a higher layer ofabstraction of state machine design, sotoo does AutoESLor high-level syn-thesis tools in general. Tools of this cal-iber allow a designer to focus more onthe algorithms themselves rather thanthe low-level implementation, which iserror prone, difficult to modify andinflexible with future requirements.

    18 Xcell Journal Second Quarter 2012

    X C E L L E N C E I N C O M M U N I C AT I O N S

    TX Flow Resource Usage

    HDL TX AutoESL - TX % Increase

    LUTs 858 1,372 37.5

    Table 1 - The AutoESL design used more lookup tables but incorporated more functionality.

    Latency

    HDL AutoESL % Improved

    Clock Cycles 17 7 58.8%

    Table 2 AutoESL improved the latency of the TX Offload algorithm.

    http://www.enclustra.com

  • http://www.sundance.comhttp://www.flextiles.eu

  • 20 Xcell Journal Second Quarter 2012

    XCELLENCE IN DISTRIBUTED COMPUTING

    by Frank Opitz, MScHamburg University of Applied Sciences Faculty of Engineering and Computer ScienceDepartment of Computer [email protected]

    Edris Sahak, BScHamburg University of Applied Sciences Faculty of Engineering and Computer ScienceDepartment of Computer [email protected]

    Bernd Schwarz, Prof. Dr.-Ing.Hamburg University of Applied Sciences Faculty of Engineering and Computer ScienceDepartment of Computer [email protected]

    An SoC network that uses Xilinx partial-reconfiguration technology offers cloud computing for algorithms under test with large stimulus data sets.

    Accelerating DistributedComputing with FPGAs

    mailto:[email protected]:[email protected]:[email protected]

  • R ather than install faster, morepower-hungry supercomputersto tackle increasingly complexscientific algorithms, universities andprivate companies are applying distrib-uted platforms upon which projectslike SETI@home compute their datausing thousands of personal comput-ers. [1, 2] Current distributed comput-ing networks typically use CPUs orGPUs to compute the project data.

    FPGAs, too, are being harnessed inprojects like COPACOBANA, whichemploys 120 Xilinx FPGAs to crackDES-encrypted files using brute-forceprocessing. [3] But in this case, theFPGAs are all collected in one placean expensive proposition not appro-priate for small university or companybudgets. Currently FPGAs are notnoted as a distributed computing utili-ty because their use demands theinvolvement of a PC to continuallyreconfigure the whole FPGA with anew bitstream. But now, with theapplication of the Xilinx partial-recon-figuration technology, its feasible todesign FPGA-based clients for a dis-tributed computing network.

    Our team at the Hamburg Universityof Applied Sciences created a prototype

    for such a client and implemented it in asingle FPGA. We structured the designto consist of two sections: a static and adynamic part. The static part loads atstartup of the FPGA, while its imple-mented processor downloads thedynamic part from a network server.The dynamic part is the partial-reconfig-uration region, which offers sharedFPGA resources. [4] With this configu-ration, the FPGAs may be situated any-where in the world, offering computingprojects access to a high amount ofcomputing power with a lower budget.

    DISTRIBUTED SOC NETWORKWith their parallel signal-processingresources, FPGAs provide four timesthe data throughput of a microproces-sor by using a clock that is eight timesslower and with eight times lowerpower consumption. [5] To leverage thiscomputational power for high-data-input rates, designers typically imple-ment algorithms as a pipeline, like DESencryption. [3] We developed the dis-tributed SoC network (DSN) prototypeto increase the speed of such algo-rithms and to process large data setsusing distributed FPGA resources. Ournetwork design applies a client-broker-

    server architecture so that we canassign all registered system-on-chip(SoC) clients to every network partici-pants computational project (Figure 1).This would be impossible in a client-server architecture, which connectsevery SoC client to only one project.

    Furthermore, we chose this broker-server architecture to reduce the num-ber of TCP/IP connections of eachFPGA to just one. The DSN FPGAscompute the algorithms with dedicateddata sets while the broker-server man-ages the SoC clients and the projectclients. The broker schedules the con-nected SoC clients so that each projecthas nearly the same computing powerat the same time, or uses time slices ifthere are fewer SoCs than projectswith computational requests available.

    The project client delivers the partial-reconfiguration module (PRM) and a setof stimulus input data. After connectingto the broker-server, the project clientsends the PRM bit files to the server,which distributes them to SoC clientswith a free partially reconfigurableregion (PRR). The SoC clients staticpart, a MicroBlaze-based microcon-troller, reconfigures the PRR dynamical-ly with the received PRM. In the next

    Second Quarter 2012 Xcell Journal 21

    X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    Dynamic Part

    Static Part

    SoC Client 1

    Dynamic Part

    Static Part

    SoC Client n

    NetworkInfrastructure

    TCP/IPConnection

    Broker-ServerComputer

    SoC 1

    PRM

    Data

    ProjectClient 1

    PRM

    Data

    ProjectClient m

    SoC n

    Project 1

    Project m

    NetworkInfrastructure

    Figure 1 Distributed SoC network with SoC clients provided by FPGAs and managed by a central broker-server. Project clients distribute the partial-reconfiguration modules and data sets. The dynamic part of an SoC client supplies resources

    via the PRR, and a microcontroller contained in the static part processes the reconfiguration.

  • the Xilinx EDK, such as the FastSimplex Link (FSL), PLB slave and PLBmaster. We chose a PLB master/slavecombination to get an easy-to-configureIP that sends and receives data requestswithout the MicroBlazes support, signif-icantly reducing the number of clockcycles per word transfer.

    For the client-server communica-tion, the FPGAs internal hard EthernetIP is an essential peripheral of theprocessor systems static part. With thesoft-direct-memory access (SDMA) ofthe local-link TEMAC to the memorycontroller, the data and bit file trans-fers produce less PLB load. After

    receiving a frame of 1,518 bytes, theSDMA generates an interrupt request,so that the lwip_read() functionunblocks and can handle this piece ofdata. The lwip_write() function tellsthe SDMA to perform a DMA transferover the TX channel to the TEMAC.

    step, the project client starts sendingdata sets and receives the computedresponse from the SoC client via thebroker-server. Depending on the projectclients intentions, it compares differentcomputed sets or evaluates them for itscomputational aims, for example.

    THE SOC CLIENTWe developed the SoC client for a XilinxVirtex-6 FPGA, the XC6VLX240T,which comes with the ML605 evalua-tion board. A MicroBlaze processorruns the clients software, which man-ages partial reconfigurations along withbitstream and data exchanges (Figure

    2). A Processor Local Bus (PLB) periph-eral that encapsulates the PRR in itsuser logic is the interface between thestatic and the dynamic parts. In thedynamic part reside the shared FPGAresources for accelerator IP cores sup-plied by the received PRM. To store

    22 Xcell Journal Second Quarter 2012

    X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    received and computed data, we choseDDR3 memory instead of CompactFlashbecause of its higher data throughputand the unlimited amount of writeaccesses. The PRM is stored in a dedi-cated data section to control its size andto avoid conflicts with other data sets.The section is set to 10 Mbytes, which isbig enough to store a complete FPGAconfiguration. Thus, every PRM shouldfit in this section.

    We also created data sections for thereceived and the computed data sets.These are 50 Mbytes in size so as toensure enough address space forimages or encrypted text files, for

    example. Managing these data sectionsrelies on an array of 10 administrationstructures; the latter contain the startand end addresses of each data set pairand a flag that indicates computed sets.

    To connect the static part to the PRR,we evaluated IP connections given by

    MicroBlaze100 MHz

    MPMC400 MHz

    DDR3SDRAM

    GMIIEthernet

    PHY

    Static Part

    CF Card

    SysACELL_TEMAC

    InterruptController HW-ICAP

    XilkernelTimer Dynamic

    Part

    PRR

    IRQ

    Timer IRQ

    IXCL

    DXCL

    SDMA

    SDMA Rx and Tx IRQ

    TX Local Link

    RX Local Link

    TEMAC IRQ

    PLBv46 PLB-Master IP

    IF

    A MicroBlaze processor runs the clients software,which manages partial reconfigurations along

    with bitstream and data exchanges.

    Figure 2 The SoC client is a processor system with a static part and a bus master peripheral, which contains the partially reconfigurable region (PRR). Implemented with Virtex-6 FPGA XC6VLX240T on an ML605 board.

  • X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    Second Quarter 2012 Xcell Journal 23

    We implemented the Xilkernel, akernel for Xilinx embedded proces-sors, as an underlying real-time oper-ating system of the SoC clients soft-ware in order to utilize the light-weight TCP/IP stack (LwIP) librarywith the socket mode for the TCP/IPserver connection. Figure 3 providesan overview of the clients threadsinitialization, creation, transmissionand processing sequences. The SoCclient thread initiates a connection tothe server and receives a PRM bit-stream (pr), which it stores inDDR3 memory, applying the XILMFSfile system. Thereafter theXps_hwicap (hardware internal con-figuration access point) reconfiguresthe PRR with the PRM. Finally, thebus master peripheral sets a status

    bit that instructs the SoC client tosend a request to the server. The serv-er responds with a data set (dr),which the SoC client stores in theonboard memory as well. These datafiles contain a content sequence suchas output_length+ol+data_to_com-pute. The output_length is the bytelength, which reserves the memoryrange for the result data followed bythe character pair ol. With the firstreceived dr message, a compute anda send thread get created.

    The compute thread transfers theaddresses of the input-and-result datasets to the slave interface of the PRRperipheral and starts the PRMsautonomous data set processing. Anadministration structure providesthese addresses for each data set and

    contains a done flag, which is setafter the result data is completelyavailable. In the current version of theclients software concept, the computeand send threads communicate viathis structure, with the send threadchecking the done bit repeatedly andapplying the lwip_write() calls onresults stored in memory.

    When testing the SoC client, wedetermined that with all interruptsenabled while the reconfiguration ofthe PRR is in progress, this process getsstuck randomly after the Xilkernelstimer generates a scheduling call to theMicroBlaze. This didnt happen with allinterrupts disabled or while using astandalone software module for theSoC clients MicroBlaze processorwithout the Xilkernels support.

    Main()

    ICAPinitialize

    Xilkernelstart

    ..xilkernel.. LwIPinitialize

    Setupnetwork

    interfaces

    LwIP Read Threadprocessing

    SoC Client Threadprocessing

    SoC Client

    Socketcreate

    Data memoryinitialize

    SoCsend

    Socket read

    dr

    aa/ao

    pr

    PRMreceive

    Input datareceive

    Handle toSend Thread

    SendThread

    ComputeThread

    Reconfiguration

    Figure 3 SoC clients software initialization and processing cycles include reconfiguration of the PRR with a PRM, data set retrieved from the server, start of processing and the data sets return to the server threads. Black bars

    indicate thread creation by sys_thread_new() calls from the Xilkernel library.

  • BUS MASTER PERIPHERAL WITH PRM INSTANTIATIONTo achieve a self-controlled stimulusdata and result exchange between thePRM and the external memory, westructured the bus master peripheralas a processor element with a dataand a control path (Figure 4). Withinthe data path, we embedded the PRMinterface between two FIFO blockswith a depth of 16 words each in orderto compensate for communicationand data transfer delays. Both FIFOs

    of the data path are connected direct-ly to the PLBs bus master interface. Inthis way, we obtain a significant tim-ing advantage from a straightforwarddata transfer operated by a finite statemachine (FSM). No software is

    24 Xcell Journal Second Quarter 2012

    X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    involved, so no intermediate data stor-age takes place in the MicroBlazesregister file. This RISC processorsload-store architecture always requirestwo bus transfer cycles for loading aCPU register from an address locationand storing the registers content toanother PLB participant. With theDXCL data cache link of theMicroBlaze to the memory controlleras a bypass to the PLB, the timing ofthese load-store cycles would notimprove. Thats because the received

    data and the transmitted computingresults are all handled once, word byword, without utilizing caching bene-fits. As a consequence, the PRR periph-erals activities are decoupled from theMicroBlazes master software process-

    ing. Thus, the PRR data transfer causesno additional Xilkernel context switch-es. But there is still the competition oftwo masters for a bus access, whichcant be avoided.

    The peripherals slave interface con-tains four software-driven registersthat provide the control path with startand end addresses of the input and out-put data sets. Another software regis-ter introduces a start bit to the FSM,which initiates the master data transfercycles. The status of a completed cycleof data processing is available with theaddress of the fifth software register tothe clients software.

    With the state diagram of the con-trol paths FSM, the strategy to priori-tize the write cycles to the PLBbecomes clear (Figure 5). Pulling outthe data from the OUT_FIFO domi-nates over filling the IN_FIFO, to pre-vent a full OUT_FIFO from stoppingthe PRM from processing the algo-rithm. Reading from or writing to theexternal memory occurs in alternatesequences, because only one kind ofbus access at a time is available. Whena software reset from the clients com-

    IN_FIFO OUT_FIFO Memory access Next state

    dont care not empty writing WRITE_REQ

    not full empty reading READ_REQ

    full empty STARTED

    Table 1 FSM control decisions in state STARTED with write priority

    IPIFUser_Logic

    Data Path

    IN_FIFO

    Control Path

    PRM-Interface OUT_FIFO

    5 SWRegisters

    FSMAddress

    Generator

    32Data_in

    Data_in_en

    Data_in

    Data_in_ready

    FIFO_full_n

    FIFO_read

    FIFO_empty_n

    Data_out

    Data_out

    Enable

    Data_out_free

    Data_out_enFIFO_write

    Data_in

    FIFO_full_n

    FIFO_read

    FIFO_empty_n

    Data_out

    FIFO_write

    3232

    32

    32

    8

    32

    Bus2IP_MstRd_d

    Bus2IP_DataBus2IP_BE

    Bus2IP_WRCE

    Bus2IP_RDCE

    Bus2IP_Mst_CmdAck

    Start

    Start_read IP2Bus_MstRD_Req

    IP2Bus_MstWR_Req

    IP2Bus_MstWR_d

    IP2Bus_Mst_addr

    IP2Bus_Mst_BE

    End_read

    Start_writeEnd_write

    Bus2IP_Mst_Cmplt

    Figure 4 Bus master peripheral operates as a processor element. The PRM interface includes the dynamic part with a component instantiation of the PRM.

  • pute thread starts the FSM (Figure 3),the first thing that happens is a readfrom the external memory (stateREAD_REQ). From then on, the busmaster follows the decision logic givenby the transition conditions from stateSTARTED (Table 1).

    The FSM Mealy outputs (labelExit/) prepare the address counters toincrement when a bus transfer is com-pleted. Here, the two counters areintroduced directly into the FSM code.Usually we prefer timers and addresscounters as separate clocked process-es enabled simply by FSM outputs, inorder to keep the counters transitionlogic small and free from unnecessarymultiplexer inputs for counter statefeedback. At this point, the XST syn-thesis compiler results present RTLschematics with a clear FSM extrac-tion parallel to loadable counters, withclock-enable inputs driven by anexpected state decoding logic. Despitea more readable behavioral VHDL-

    X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    coding style, the FPGA resources andsimple primitives get utilized without aloss of features.

    DEFINING THE DYNAMIC PARTWITH PLANAHEADThe design flow for configuration of astatic and a dynamic part within theFPGA is a complex developmentprocess that involves several steps withthe physical-design constraints toolPlanAhead. Our first effort was ascript-based design flow for aPetaLinux-driven dynamic-reconfigura-tion platform implemented on an ML505board. [6] With the current iteration, thedesign steps for integrating a PRRdirectly into a peripherals user logic aremuch more practical than the formermethod of adding bus macros and adevice control register (DCR) as a PLBinterface for the PRM and an extra PLB-DCR bridge for enabling the bus macros.

    Here is how we fixed the dynamicparts size and position with the

    AREA_GROUP constraints, which areincluded in the UCF file of thePlanAhead project, as shown in thecode below.

    INST "dyn_interface_0/dyn_inter-face_0/USER_LOGIC_I/PRR"

    AREA_GROUP ="pblock_dyn_interface_0_USER_LOGIC_I_PRR ";

    AREA_GROUP "pblock_dyn_interface_0_USER_LOGIC_I_PRR "RANGE=SLICE_X0Y0:SLICE_X57Y239 ;

    AREA_GROUP "pblock_dyn_interface_0_USER_LOGIC_I_PRR "RANGE=RAMB18_X0Y0:RAMB18_X3Y95 ;

    AREA_GROUP "pblock_dyn_interface_0_USER_LOGIC_I_PRR "RANGE=RAMB36_X0Y0:RAMB36_X3Y47 ;

    An instance name concatenationspecifies the inner partial-reconfigura-tion region (prm_interface.vhd) withinstance name PRR. For all FPGAresources we want to include in thedesired PRR, we specify a rectangularregion with its lower-left and upper-right coordinates.

    This special choice covers slicesand BRAM only, because the availableDSP elements belong to dedicated

    Resource Amount

    LUT 55,680

    FD_LD 111,360

    SLICEL 7,440

    SLICEEM 6,480

    RAMBFIFO36E1 192

    Table 2 Allocated resources for

    the dynamic part of the SoC client

    STARTED

    WRITE_REQ

    WAIT_FOR_WCMPWAIT_FOR_WCMP

    WAIT_FOR_CMP

    READ_REQ

    IDLE

    IN_FIFO_full_n ==1and

    Read_address !=end_address_read_data

    Exit / Read_address

  • clock regions and are utilized for theMultiport Memory Controller (MPMC)implementation (Table 2).

    To prevent the PRM netlists thatISE generates from using excludedresources, we set the synthesisoptions to dsp_utilization_ratio = 0;use_dsp48 = false; iobuf = false.Finally, the FPGA Editor offers aninsight: that the static parts place-ment is located in an area separatedcompletely from the PRR, which inthis special case uses very fewresources (Figure 6).

    AN SOC CLIENT WITH IMAGE-PROCESSING PRMWe proved the SoC clients operationand its TCP/IP server communicationwith a Sobel/median filter combina-tion implemented in a PRM (Figure7). We developed the image-process-ing neighborhood operations with theXilinx System Generator, which gaveus the advantage of Simulink simu-lation and automatic RTL code gener-ation. A deserializer converted theinput pixel stream to a 3 x 3-pixelarray, which sequences like a maskover the whole image and providesthe input to the filters parallel sum ofproducts or to the successive com-parisons of the median filter. [7]Input and output pixel vectors of thefilters have a width of 4 bits, so weinserted a PRM wrapper that multi-plexes the eight nibbles of the 32-bitinput vector from the synchroniza-tion FIFO. With a MATLAB script,we convert an 800 x 600 PNG imageto 4-bit gray-scale pixels for the PRMinput stimulus. At the filters output,eight 4-bit registers are successivelyfilled and concatenated for the wordtransfer to the OUT-FIFO (Figure 4).

    Table 3 summarizes the results oftiming measurements performed withthree operational steps of the SoCclient: receiving a PRM bit file, recon-figuration of the PRR and image-pro-cessing sequences. We captured thereceiving and image-processingcycles, from the first to the last data

    X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    Interval duration

    Filter Slices PRM receive Reconfiguration Image module (seconds) 3.5-Mbyte processing

    bit file (sec) (ms)

    Binarize 3 77 31.25 25.25

    Erosion 3x3 237 73 31.25 85.93

    Median 3x3 531 73 31.25 77.09

    Sobel 3x3 479 73 31.25 86.45

    Table 3 Timing measurement results; reconfiguration with disabled interrupts. Processor and peripheral clock rate fclk = 100 MHz.

    Figure 6 Resource placement of the static part (right side) and dynamic part (left side, with white oval) according to the area specification for the PRR

    26 Xcell Journal Second Quarter 2012

  • transfer, with a digital oscilloscopemeasurement at a GPIO output tog-gled by XGpio_WriteReg() calls.

    The reconfiguration intervals allhave the same duration, because noXilkernel scheduling events disrupt-ed the software-driven HWICAP oper-ation. An FSM-controlled HWICAPoperation without MicroBlaze inter-action will yield a shorter durationwith a reconfiguration speed of morethan 112 kbytes/second, even withenabled interrupts.

    During PRM transmission fromthe broker to the SoC client, the con-nection soon aborted. With a 1-mil-lisecond delay between each trans-mitted 100 bytes, the SoC client per-formed a nondisturbed communica-tion. Parallel to the image-processingcycles, normal Xilkernel threadingcaused PLB access competition andtherefore, the SoC client operatedunder typical conditions. The bina-rize sequence has a duration value of600 x 800/100 MHz = 4.8 ms, becauseonly a single comparison is active.This sequence is nested in two imagetransfers via the PLB, which take aminimum of five clocks per word, asextracted from a functional bus sim-ulation: 2 x 5 x 600 x 800/(8 x 100MHz) = 6 ms. Because all measure-ment numbers for the data transfersare larger than rough estimates ledus to expect, we are in the midst of adetailed analysis of the full timingchain buildup by bus reading, FIFOfilling and emptying, image-process-ing pipeline and bus writing.

    X C E L L E N C E I N D I S T R I B U T E D C O M P U T I N G

    Second Quarter 2012 Xcell Journal 27

    POWER OF PARTIALRECONFIGURATIONTo compute complex algorithms, it isprofitable to employ the power of dis-tributed computing networks. State-of-the-art implementations of thesenetworks operate with CPUs andGPUs only. Our prototype of anFPGA-based distributed SoC networkarchitecture utilizes the parallel sig-nal-processing features of FPGAs tocompute complex algorithms.

    The Xilinx partial-reconfigurationtechnology holds the key to utilizingshared FPGA resources all over theworld. In our architecture, the staticpart of the SoC client reconfiguresthe dynamic part of the FPGA withupdated accelerators in a self-con-trolled way. We have to improve theSoC client to run the HWICAP withenabled interrupts, so that it keepsfully reactive. A step in that directionis an FSM-controlled reconfigura-tion, which puts no load on theprocessor. But we need to analyzethe influence of PLB transfers andthe MPMC bottleneck as well.

    To manage the SoC client, aXilkernel linked with the LwIP sup-plies concurrency with threads for thereconfiguration drivers, the dynamicparts bus interface and other applica-tions. We further concentrate on tim-ing analysis of the client-server systemand the dynamic parts processingcycles in order to identify the soft-ware/RTL-model configuration withan improved data throughput and areliable communication.

    For the next stage of our SoC clientdesign, we have to take the AXI4 busfeatures into account. In general, PRMexchanges can be treated as additionalhardware tasks operating in conjunc-tion with a set of software tasks. Lastbut not least, we are still refining theservers software design to achieveimproved user satisfaction.

    References

    1. Unofficial BOINC Wiki, BoincFAQ: Introduction to boinc,http://www.boinc-wiki.info/

    2. Markus Tervooren, all projectstats.com,http://www.allprojectstats.com/

    3. S. Kumar, J. Pelzl, G. Pfeiffer, M. Schimmler and C. Paar, Breakingciphers with COPACOBANA, a cost-optimized parallel code breaker, orhow to break DES for 8,980 eur, http://www.copacobana.org/paper/CHES2006_copacobana_slides.pdf

    4. Frank Opitz, Development of anFPGA-based distributed computingplatform. Masters thesis, HAWHamburg, 2011, http://opus.haw-ham-burg.de/volltexte/2012/1450/pdf/Masterarbeit_Frank_Opitz.pdf

    5. Ivo Bolsens, Programming ModernFPGAs, http://www.xilinx.com/univ/mpsoc2006keynote.pdf

    6. Armin Jeyrani Mamegani,Implementation and evaluation ofmethods for partial and dynamicreconfiguration of SoC- FPGAs.Masters thesis, HAW Hamburg,2010, http://opus.haw-hamburg.de/volltexte/2010/1083/pdf/MA_A_Jeyrani.pdf

    7. Edris Sahak, Partial reconfigura-tion of an SoC-based image-process-ing pipeline. Bachelors thesis, HAWHamburg, 2011, http://opus.haw-ham-burg.de/volltexte/2011/1420/pdf/BA_E_Sahak.pdf

    Figure 7 PRM processing results for edge detection. Gray-scale input stimulus image to the PRM is shown at left, while the response from a PRM with a

    Sobel/median filter combination is seen at right.

    http://www.boinc-wiki.infohttp://www.allprojectstats.comhttp://www.copacobana.org/paper/CHE S2006_copacobana_slides.pdfhttp://opus.haw-ham-burg.de/volltexte/2012/1450/pdf/Masterhttp://opus.haw-ham-burg.de/volltexte/2012/1450/pdf/Masterhttp://opus.haw-ham-burg.de/volltexte/2012/1450/pdf/Masterhttp://www.xilinx.com/univhttp://opus.haw-hamburg.dehttp://opus.haw-ham-burg.de/volltexte/2011/1420/pdf/BA_E_http://opus.haw-ham-burg.de/volltexte/2011/1420/pdf/BA_E_http://opus.haw-ham-burg.de/volltexte/2011/1420/pdf/BA_E_

  • 28 Xcell Journal Second Quarter 2012

    Theres probably no better vehicle than a robot to get highschool and middle school students hooked on science andtechnology. For pupils in grades 8-12, tinkering with a robotis a hands-on way to grasp new ideas and to see technology inaction. Building a robot can provide a powerful motivation for stu-dents to overcome intellectual challenges and achieve levels ofexcellence in scientific and technical subjects.

    To that end, I would like to present an educational platform thatteachers can use to support educational robotics activities for techni-cal high schools. The idea for this platform was born at the highschool where I teachITCS Erasmo da Rotterdam, near Milan,Italyin the context of building an entry for the RoboCup Juniorcompetition. Our group participated in the category rescue robotthat is, a machine able to identify victims within a re-created disasterscenario. These robots must accomplish tasks varying in complexityfrom walking a line on a flat surface up to negotiating paths throughobstacles on uneven terrain and saving some well-defined victims.

    The fundamental characteristics of our FPGA-based design, ifcompared with the normally diffuse platforms based on differentkinds of microcomputers, are openness, flexibility, capability toevolve and reusability.

    A Xilinx Spartan FPGA forms thebasis for a powerful teaching tool

    thats able to evolve,reinventing itself

    according to student needs.

    XCELLENCE IN EDUCATION

    FPGAs Enable Flexible Platformfor High School Robotics

    by Giulio VitaleProfessor and FPGA Design ConsultantITCS Erasmo da Rotterdam, Bollate, [email protected]

    mailto:[email protected]

  • Second Quarter 2012 Xcell Journal 29

    We built the 2011 version of therobot on a Xilinx Spartan-3E device,using the Digilent Nexys2 educationalboard. As of this writing, we are portingthis versionwhich will compete in thenext RoboCup Junior Italy in April2012to a Spartan-6 FPGA. The next,2013 version is scheduled to run on aZynq-7000 Extensible ProcessingPlatform device.

    Working within this paradigm, wewere able to design a rescue robotthat evolved, year after year, from thefirst prototype to the current version,named Nessie 2011 (a pun on bothNexys and the Loch Ness monster,whose long neck is reminiscent ofour robots). The flexibility of theFPGA allowed a complete remodel-ing of the robots architecture, fol-lowing the progression of the stu-dents knowledge, while leaving itsbasic physical structure substantiallyunaltered and maintaining the samedesign infrastructure.

    THE CHALLENGEThe ITCS Erasmo da Rotterdam is atechnical high school located in a sub-urb of Milan in northern Italy, with astudent population that is stronglyheterogeneous and, often, not so easyto coax into deepening their scientificand technological knowledge.

    Starting four years ago, I decided toactivate an open space, which I namedthe Permanent Laboratory for DidacticRobotics, where students can experi-ment in a different way from the stan-dard classroom. Here, they approachthe technical disciplines in an interac-tive environment in which pupils cannegotiate some unexpected aspects ofthe subject matter, make choicesregarding the subjects they will pur-sue, organize their own jobs andreceive direct feedback from theresults of their actions. In other words,they get to experiment in an activelearning space based on the old andwell-known model of situated cogni-tion [1], where students collaboratewith one another and with their

    instructor while pursuing a commongoal with a shared understanding.

    In this learning space, pupils canpractice a cognitive apprenticeshipthat uses problem-solving methodolo-gies. The teacher assumes the role of aprofessional expert who proposesspecific processes involving authentictasks and strategies, and allows stu-dents to try them independently,coaching only as needed.

    Robotics was the natural choice toprovide a fertile breeding ground forthe convergence of different disci-plines and the exchange of knowl-edge. We decided that the fun of takingpart in the RoboCup Junior competi-tion would provide a strong stimulusto incentivize student participation.

    THE SOLUTIONI understood that to be effective, Iwould have to propose topics normal-ly covered in regular classes on digitalelectronics and informatics, but aimedat more complex applications thanthose the students would be able tosolve alone. Instead, they would needto work in groups or with the supportof an experienced teacher who couldpropose appropriate models.

    I knew what to build but I did notknow how to build it. Everything hadto be born and developed in the labo-ratory, with students discussing thedesign and seeking solutions together.

    After some deliberation, I came tothe conclusion that the most likely

    solution was one based on a flexibleplatform, such as an FPGA, ratherthan on standard microcomputers.Thats because an FPGA was the onlydevice able to provide the requiredcharacteristics, and to keep pacewith the dynamic and evolutionaryscope of the laboratory activities.

    I chose, initially, to use an educa-tional card based on the Spartan-3Ebecause it could provide the necessarycharacteristics we were seekingnamely, openness, flexibility, ability toevolve, reusability of the hardwareand richness of performance.

    Openness, because students mustactively participate in the entiredesign flow, from the sensor inter-face to the CPU and from this tothe actuators.

    Flexibility, because the com-plete architecture of the systemand the nature and type of devicesshould not be fixed in advance,but must emerge from theresearch process activated withina creative context for learning.

    Ability to evolve, because aftereach RoboCup competition, stu-dents must learn the shortcom-ings of their work and know howto make the appropriate modifi-cations to try to reach moreadvanced solutions. The systemmust grow in parallel with thestudents expertise.

    X C E L L E N C E I N E D U C AT I O N

    Figure 1 Thanks to FPGA flexibility, the same platform can evolve in parallel with studentunderstanding, reusing the same hardware and redesigning as needed. Photos show

    Nessies evolution from 2008 through 2011.

  • Reusability, in order to avoidunnecessary waste of the hard-ware and the school budget.

    High performance at an afford-able cost. We had to control alarge number of devices andperipherals that were not fullydefined, but needed to operatewith a high degree of parallelism.The CPU should be very powerfulbut relatively simple in its archi-tecture and easy to interface.

    NESSIE 2012: BLOCK DIAGRAMAND DESCRIPTIONWe achieved our goal using a Digilentcard carrying a Spartan 3E-1200onboard, which was the commonthread of the four-year project devel-opment. As you can see in Figure 1,the rescue robot the students

    designed showed a clear evolutionfrom a machine that barely moved in2008 to one that, in 2011, made us oneof just 15 teams, out of 65 participants,to reach the finals.

    The level of student expertise hasgrown from year to year, laying a foun-dation for further improvements thatwe are planning for this yearsRoboCup Junior competition in April.

    First, we have moved from theSpartan-3E to the Spartan-6 family, con-verting the bus infrastructure of thestandard Processor Local Bus to theAXI4 interface. Second, we h