Click here to load reader

Xcell Journal issue 84

  • View
    257

  • Download
    10

Embed Size (px)

DESCRIPTION

The summer 2013 edition of Xcell Journal includes a cover story that examines Xilinx new innovative UltraScale architecture, which Xilinx will deploy in its 20nm planar and 16nm FinFET All Programmable devices. The issue also includes a variety of fantastic methodology and how-to articles for Xilinx users of all skill levels.

Text of Xcell Journal issue 84

  • www.xilinx.com/xcell/

    S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

    Xcell journalXcell journalI S SUE 84 , TH IRD QUAR TER 2013 Efficient Bitcoin Miner SystemImplemented on Zynq SoC

    Benchmark: Vivados ESLCapabilities Speed IP Design on Zynq SoC

    How to Boost Zynq SoCPerformance by Creating Your Own Peripheral

    Xilinx GoesUltraScale at 20 nm and FinFET

    XRadio: A Rockin FPGA-only Radio for Teaching System Design

    page28

  • Unlock Your Systems True

    Potential

    Xilinx Zynq-7000 AP SoC Mini-Module Plus System

    Avnet, Inc. 2013. All rights reserved. AVNET is a registered trademark of Avnet, Inc.Xilinx and Zynq are trademarks or registered trademarks of Xilinx, Inc.

    The Xilinx Zynq-7000 All Programmable SoC Mini-Module Plus designed by Avnet, is a

    small system-on-module (SOM) that integrates all the necessary functions and interfaces

    for a Zynq-7000 AP SoC system. For development purposes, the module can be combined

    with the Mini-Module Plus Baseboard II and a Mini-Module Plus Power Supply to provide an

    out-of-box development system ready for prototyping. The modular concept also allows the

    same Zynq-7000 AP SoC to be placed on a user-created baseboard, simplifying the users

    baseboard design and reducing the development risk.

    www.em.avnet.com/MMP-7Z045-G

  • L E T T E R F R O M T H E P U B L I S H E R

    Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

    2013 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

    The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

    PUBLISHER Mike [email protected]

    EDITOR Jacqueline Damian

    ART DIRECTOR Scott Blair

    DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

    ADVERTISING SALES Dan [email protected]

    INTERNATIONAL Melissa Zhang, Asia [email protected]

    Christelle Moraga, Europe/Middle East/[email protected]

    Tomoko Suto, [email protected]

    REPRINT ORDERS 1-800-493-5551

    Xcell journal

    www.xilinx.com/xcell/

    Xilinx UltraScale Raises Value of All Programmable

    If youve been keeping up with developments in the road map for advanced semi-conductor process technologies, youre probably aware that the cost of producingsilicon is accelerating. Some have asserted that the cost benefit afforded by MooresLawnamely, that the cost stays the same even though the transistor count doublesis running out of gas as processes shrink below 28 nanometers. Indeed, for the 20-nm planarand 14/16-nm FinFET generation, industry analysts foresee a tandem trend of rising NRE costsand spiraling price tags for each transistor at each new node.

    Analysts say that to make a decent profit on a chip implemented at 20 nm, you will firstneed to clear an estimated lowball cost of $150 million for designing and manufacturing thedevice. And to hit that lowball mark, youd better get it right the first time, because maskcosts are approaching $5 million to $8 million per mask.

    FinFET looks even more challenging. Analysts predict that the average cost for the16/14-nm FinFET node will be around $325 millionmore than double the cost of 20-nmmanufacturing! Why so high? Analysts believe companies will encounter excessiveretooling, methodology and IP-requalification costs in switching from planar-transistorprocesses to FinFETs. Tools and methods will have to deal with new complexities in signalintegrity, electromigration, width quantization, and resistance and capacitance. The combi-nation of new transistor, lithography and manufacturing techniques and toolsalong withrevamped and more expensive IPrepresents a perfect storm for design teams.

    To say the least, its hard to justify targeting 16/14-nm FinFET unless the end productwill sell in multiple tens of millions of units and have returns counted in billions rather thanmillions of dollars. As such, eliminating the NRE costs for customers designing on theseadvanced nodes is a growing sales advantage for all vendors playing in the FPGA space. Butif you have been following Xilinxs progress for the last five years, youve seen that the com-pany has clearly demonstrated technical superiority and a unique value proposition. TodaysXilinx All Programmable devices offer unmatchable system functionality with highcapacity and performance, lower power and reduced BOM costs, paired with more flexi-bility and greater programmability than any other silicon device on the market.

    Xilinx executed several breakout moves at the 28-nm node, becoming the first to marketwith 28-nm FPGAs, 3D ICs and Zynq-7000 All Programmable SoCs. At the same time, thecompany introduced the Vivado Design Suite, incorporating state-of-the-art EDA tech-nologies. There are more breakthroughs in store at the 20-nm and 16-nm FinFET nodes.In July, Xilinx once again became the first merechant chip company to tape out an IC ata new manufacturing node: 20 nm. These new devices will include numerous architec-tural innovations in Xilinxs new UltraScale architecture.

    As you will read in the cover story, the UltraScale architecture will deliver many benefitsfor customers, perhaps the most noteworthy of which is a vast increase in system usagebetter than 90 percent. Its worth noting that the power of the UltraScale architecture couldnot have been fully realized without Vivado. The tool suite has greatly boosted productivityfor Xilinx customers using the 28-nm devices. Now, this sophisticated EDA technology ishelping Xilinx unleash the benefits of 20-nm and FinFET silicon. Xilinx will deliver thatunmatchable advantage to customerswith no NRE charges.

    Mike SantariniPublisher

  • C O N T E N T S

    VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES

    Cover StoryXilinx 20-nm Planar and 16-nm FinFET Go UltraScale

    88

    Xcellence in Networking

    Efficient Bitcoin Miner System Implemented on Zynq SoC 16

    Xcellence in Aerospace & Defense

    Ensuring FPGA Reconfiguration in Space23

    Letter From the PublisherXilinx UltraScale Raises

    Value of All Programmable 4

    16162323

  • T H I R D Q U A R T E R 2 0 1 3 , I S S U E 8 4

    THE XILINX XPERIENCE FEATURES

    Excellence in Magazine & Journal Writing2010, 2011

    Excellence in Magazine & Journal Design and Layout2010, 2011, 2012

    Tools of Xcellence Designing a Low-Latency H.264

    System Using the Zynq SoC 54

    Innovative Power Design Maximizes

    Virtex-7 Transceiver Performance 60

    Xtra, Xtra The latest Xilinx tool updates and patches, as of July 2013 64

    Xclamations! Share your wit and wisdom by supplyinga caption for our wild and wacky artwork 66

    XTRA READING

    2828

    Xperiment

    XRadio: A Rockin FPGA-only Radio for Teaching System Design 28

    Xperts Corner

    Benchmark: Vivados ESL Capabilities Speed IP Design on Zynq SoC 34

    Xplanation: FPGA 101

    How to Boost Zynq Performance by Creating Your Own Peripheral 42

    Xplanation: FPGA 101

    JESD204B: Critical Link in the FPGA-Converter Chain 48

    3434

    4848

  • COVER STORY

    8 Xcell Journal Third Quarter 2013

    Xilinx 20-nm Planar and 16-nmFinFET Go UltraScale by Mike SantariniPublisher, Xcell JournalXilinx, Inc. [email protected]

  • Third Quarter 2013 Xcell Journal 9

    C O V E R S T O R Y

    Building on its breakout move at 28 nanome-ters, Xilinx has announced two industry firstsat the 20-nm node. Xilinx is the first merchantchip company to tape out a 20-nm device.Whats more, the new device will be the firstXilinx will bring to market using its UltraScale

    technologythe programmable industrys firstASIC-class architecture. The UltraScale archi-tecture takes full advantage of the state-of-the-art EDA technologies in the Vivado DesignSuite, enabling customers to quickly create anew generation of All Programmable innova-tions. The move continues the Generation

    Ahead advantage over competitors thatbegan at 28 nm, when Xilinx deliv-ered two first-of-their-kind devices:the Zynq-7000 All ProgrammableSoC and Virtex-7 3D ICs.

    Xilinx was the first company todeliver 28-nm devices and at 20 nm,

    we have maintained the industrys mostaggressive tapeout schedule, said Steve

    Glaser, senior vice president of marketingand corporate strategy at Xilinx. Our hardwork is paying off yet again, and we areclearly a year ahead of our competition indelivering high-end devices and half a yearahead delivering midrange devices.

    But as Xilinx did at 28 nm, the companyis also adding some industry-first innova-tions to its new portfolio. Among the firstthe company is revealing is the newUltraScale architecture, which it will beimplementing at 20-nm, 16-nm FinFET and

    Xilinxs industry-first 20-nmtapeout signals the dawning ofan UltraScale-era, ASIC-classprogrammable architecture. B

  • beyond. Its the first programmablearchitecture to enable users to imple-ment ASIC-class designs using AllProgrammable devices, said Glaser.The UltraScale architecture enablesXilinx to deliver 20-nm and FinFET16-nm All Programmable deviceswith massive I/O and memory band-width, the fastest packet processing,the fastest DSP processing, ASIC-likeclocking, power management andmultilevel security.

    AN ARCHITECTURAL ADVANTAGEThe UltraScale architecture includeshundreds of structural enhancements,many of which Xilinx would havebeen unable to fully implement had itnot already developed its Vivado

    Design Suite, which brings leading-edge EDA tool capabilities to Xilinxcustomers. For example, Vivadosadvanced placement-and-routing capa-bilities enable users to take full advan-tage of UltraScales massive data-pro-cessing abilities, so that design teamswill be able to achieve better than 90percent utilization in UltraScaledevices while also hitting ambitiousperformance and power targets. Thisutilization is far beyond whats possiblewith competing devices, which todayrequire users to trade off performancefor utilization. Customers are thusforced to move to that vendors biggerand more expensive device, only tofind they have to slow down the clockyet again to meet system power goals.

    This problem didnt plague Xilinx atthe 28-nm process node thanks toXilinxs routing architecture. Nor will itbe an issue at 20 nm, because theUltraScale architecture can implementmassive data flow for wide buses tohelp customers develop systems withmultiterabit throughput and with mini-mal latency. The UltraScale architec-ture co-optimized with Vivado resultsin highly optimized critical paths withbuilt-in, high-speed memory cascadingto remove bottlenecks in DSP andpacket processing. Enhanced DSP sub-systems combine critical-path opti-mization with new 27x18-bit multipli-ers and dual adders that enable a mas-sive jump in fixed- and IEEE-754 float-ing-point arithmetic performance and

    10 Xcell Journal Third Quarter 2013

    C O V E R S T O R Y

    Figure 1 The UltraScale architecture enablesmassive I/O and memory bandwidth, fastpacket processing and DSP processing, alongwith ASIC-like clocking, power managementand multilevel security.

    Monolithic to 3D IC

    Planar to FinFET

    ASIC-class performance

    UltraSCALEArchitecture

  • efficiency. The wide-memory imple-mentation also applies to UltraScale3D ICs, which will see a step-functionimprovement in interdie bandwidthfor even higher overall performance.

    UltraScale devices feature furtherI/O and memory-bandwidth enhance-ments, including support for next-gen-eration memory interfacing that offersa dramatic reduction in latency andoptimized I/O performance. The archi-tecture offers multiple hardened, ASIC-class IP cores, including 10/100GEthernet, Interlaken and PCIe.

    The UltraScale architecture alsoenables multiregion clockinga fea-ture typically found only in ASIC-classdevices. Multiregion clocking allowsdesigners to build high-performance,low-power clock networks withextremely low clock skew in their sys-tems. The co-optimization of theUltraScale architecture and Vivadoalso allows design teams to employ awider variety of power-gating tech-niques across a wider range of func-tional elements in Xilinxs 20-nm AllProgrammable devices to achieve fur-ther power savings in their designs.

    Last but not least, UltraScale sup-ports advanced approaches to AES bit-stream decryption and authentication,key obfuscation and secure deviceprogramming to deliver best-in-classsystem security.

    TAILORED FOR KEY MARKETSGlaser said that UltraScale deviceswill enable design teams to achieveeven greater levels of system integra-tion while maximizing overall systemperformance, lowering total systempower and reducing the overall BOMcost of their systems. Building onwhat we did at 28 nm, Xilinx at the20-nm and FinFET 16/14 nodes is rais-

    ing the value proposition of our AllProgrammable technology far beyondthe days when FPGAs were consideredmerely a nice alternative to logicdevices, he said. Our unique systemvalue is abundantly clear across abroad number of applications.

    Glaser said devices using Xilinxs 20-nm and FinFET 16-nm UltraScalearchitecture squarely address a num-ber of growth market segments, suchas optical transport networking (OTN),high-performance computing in thenetwork, digital video and wirelesscommunications (see Figure 2). All ofthese sectors must address increasingrequirements for performance, cost,lower power and massive integration.

    ULTRASCALE TO SMARTER OTN NETWORKSThe OTN market is currently undergo-ing a change to smarter networks, asdata traffic dictates a massive buildoutto create ever-more-advanced wired

    networking equipment that can movedata from 100 Gbps today to 400 Gbpssoon and, tomorrow, to 1 Tbps. Whilethe traditional business approachwould be to simply throw massiveamounts of newer and faster equipmentat the problem, today network opera-tors are seeking smarter ways to buildout their networks and to improve thenetworks they have installed while find-ing better ways to manage costsall tovastly improve service and profitability.

    For example, the data center marketis looking to software-defined network-ing (SDN) as a way to optimize networkutilization and cut costs. SDN will allownetwork managers to leverage thecloud and virtual networks to providedata access to any type of Internet-con-nected device. The hardware to enablethese software-defined networks willhave to be extremely flexible, reliableand high performance. Greater pro-

    Third Quarter 2013 Xcell Journal 11

    C O V E R S T O R Y

    100 Gbps

    400 Gbps

    1 Tbps

    1080P

    4K/2K

    8K

    3G

    LTE

    LTE-A

    OTNNetworking

    Digital Video

    WirelessCommunications

    SmarterApplications

    Requirements

    Massive PacketProcessing

    >400 Gbps wire-speed

    Massive Data Flow>5 Tbps

    Massive I/O & Memory BW

    >5 Tbps

    Massive DSPPerformance

    >7 TMACs

    UltraScaleArchitecture

    Figure 2 The UltraScale architecture will speed the development of multiple types of next-generation systems, especially in wired and wireless communications.

    Typically found only in ASIC-class devices, multiregion clocking allows designers to build high-performance, low-power clocknetworks with extremely low clock skew in their systems.

    (Continued on page 14)

  • C O V E R S T O R Y

    12 Xcell Journal Third Quarter 2013

    by Mike Santarini

    If youve been keeping up with the latest news in semi-conductor process technology, youve likely beenreading that the worlds most advanced foundries willmanufacture devices using new transistor structurescalled FinFETs as the fundamental building blocks forchips theyll produce at whats collectively called the 16/14-nanometer node. But you may be wondering just whatFinFETs are, how they differ from standard transistors,and what benefits and challenges they bring.

    One of the key inventions in semiconductor manufac-turing that has enabled todays $292 billion semiconduc-tor business was Jean Hoernis creation of the planarprocess at Fairchild Semiconductor back in the 1950s.The planar process makes it possible to implement ever-smaller transistors, arrange them into a vast array of cir-cuits and interconnect them efficiently on a horizontalplane, rather than having them stick up, as discrete com-ponents do on printed-circuit boards. The planar processhas enabled the extreme miniaturization of ICs. With theplanar process, semiconductors are built in layers on topof, and etched into, ultrapure silicon wafers. Figure 1ashows a planar transistor (essentially an elaborate on/offswitch) composed of three main features: source, gateand drain, which have been etched and layered into a sil-icon substrate. Figure 1b shows a FinFET. Notice thathere, the gate surrounds the channel on three sides,whereas in the planar transistor the gate only covers thetop of the channel.

    The source is where electrons enter the transistor.Depending on the voltage at the gate, the transistors gatewill be either opened or closed in the channel under thegate, similar to the way a light switch can be switched onor off. If the gate allows the signal through the channel, thedrain will move the electrons to the next transistor in thecircuit. The ideal transistor allows lots of current throughwhen it is on and as little current as possible when off,and can switch between on and off billions of times persecond. The speed of the switching is the fundamentalparameter that determines the performance of an IC. Chipdesign companies arrange and group transistors into anarray of circuits, which in turn are arranged and groupedinto functional blocks (such as processors, memory andlogic blocks). Then, these blocks too are arranged andgrouped to create the marvelous ICs in the plethora ofelectronic marvels we benefit from today.

    Since the 1960s, the semiconductor industry has intro-duced a myriad of silicon innovations to keep pace withMoores Law, which states that the number of transistors in anIC will double every two years. This doubling of transistorsmeans that todays leading-edge ICs contain billions of tran-sistors, crammed into essentially the same die size as 1960schips but running exponentially faster with exponentiallylower power, at 1.2 volts or below. Today, a transistors fea-tures are so small that some are only dozens of atoms wide.

    KEEPING UP WITH MOOREThe industry is always struggling to keep pace with MooresLaw in the face of what appear to be true physical limita-tions to further advancement in semiconductors. Over thelast 10 years, process specialists have striven to ensure theelectrical integrity of the planar transistor. Simply put, theplanar transistors source, gate and drain features were sosmall and performance demands so high that these transis-tors could not adequately control electrons moving throughthem. The electrons would essentially leak or get pulledthrough by the drain even if a device was turned off.

    For battery-powered devices such as mobile phones, thismeans the battery would drain charge faster, even if thephone was turned off. For AC-powered devices (those thatplug into a wall socket), it means they waste power and runhotter. This heat, if profound and unmanaged by tricks ofcooling, can lessen the lifetime of a product, because leakagecreates heat and heat increases leakage. Of course, the cool-ing adds extra cost to products, too. The leakage problembecame strikingly noticeable in semiconductors at the 130-nm node. It was in this era of extreme leakage that Microsofthad to recall its Xbox 360 worldwide due to overheating;leakage is one of the reasons why microprocessor companieshad to move to less efficient multicore processors over fasterbut much hotter single-processor architectures.

    After 130 nm, several refinements came along to reduceheat and improve planar transistors on all fronts to keep upwith Moores Law. At 90 nm, the industry moved to low-k-dielectric insulators to improve switching, while the 65- and40-nm nodes introduced further refinements through stressedsilicon. At 28 nm, the industry moved to high-k gate dielectricsand metal gates. Xilinxs use of the TSMC 28-nm HPL (high-performance, low-power) process provided the ideal mix ofperformance and power, giving Xilinx a Generation Aheadadvantage over competitors at that geometry.

    The planar transistor process will be viable at the 20-nmnode thanks to a high-k gate dielectric, metal gates and a

    THANKS, MR. HOERNI, BUT THE FIN IS INDawning of a new era in transistor technology benefits Xilinx customers.

  • Third Quarter 2013 Xcell Journal 13

    C O V E R S T O R Y

    series of more elaborate manufacturing steps such as dou-ble patterning. The 20-nm node provides phenomenal ben-efits in terms of power as well as performance, but the costis increasing marginally because of elaborate manufactur-ing to ensure silicon integrity.

    Thanks in large part to remarkable research started byCal Berkeley professor Chenming Hu under a DARPA con-tract, the 20-nm process will likely be the last hurrah for theplanar transistor (at least as we know it today), as theindustry moves to FETs built with fins.

    INS AND OUTS OF FINSIn a planar transistor of today, electrical current flows fromsource to drain through a flat, 2D horizontal channel under-neath the gate. The gate voltage controls current flow through

    the channel. As transistorsize shrinks with the intro-duction of each new siliconprocess, the planar transistor cannotadequately stop the flow of currentwhen it is in an off state, which results inleakage and heat. In a FinFET MOSFET transis-tor, the gate wraps around the channel on threesides, giving the gate much better electrostatic control to stopthe current when the transistor is in the off state. Superiorgate control in turn allows designers to increase the currentand switching speed and, thus, the performance of the IC.Because the gate wraps around three sides of the fin-shapedchannel, the FinFET is often called a 3D transistor (not to beconfused with 3D ICs, like the Virtex-7 2000T, which Xilinxpioneered with its stacked-silicon technology).

    In a three-dimensional transistor (see Figure 1b), gate con-trol of the channel is on three sides rather than just one, as in

    conventional two-dimensional planar transistors (see Figure1a). Even better channel control can be achieved with a thin-ner fin, or in the future with a gate-all-around structurewhere the channel will be enclosed by a gate on all sides.

    The industry believes the 16-nm/14-nm FinFET processwill enable a 50 percent performance increase at the samepower as a device built at 28 nm. Alternatively, the FinFETdevice will consume 50 percent less power at the same per-formance. The performance-per-watt benefits added to thecontinued increases in capacity make FinFET processesextremely promising for devices at 16 or 14 nm and beyond.

    That said, the cost and complexity of designing andmanufacturing 3D transistors is going to be higher at leastfor the short term, as EDA companies figure out ways toadequately model the device characteristics of these new

    processes and augment their tools and flows toaccount for signal integrity, electromigration,width quantization, resistance and capaci-

    tance. This complexity makes designing ASICs and ASSPseven riskier and more expensive than before.

    Xilinx, however, shields users from the manufacturingdetails. Customers can reap the benefits of increased per-formance per watt and Xilinxs Generation Ahead designflows to bring innovations based on the new UltraScalearchitecture to market faster.

    Figure 1 The position of the gatediffers in the two-dimensional traditional planar transistor (a) vs. the three-dimensional FinFETtransistor (b).

    (a)

    (b)

  • grammability coupled more tightlywith processing will be mandatory.

    While network manufacturing com-panies develop next-generation soft-ware-defined networks, their cus-tomers are constantly looking to pro-long the life of existing equipment.Companies have implemented single-chip CFP2 optical modules in OTNswitches with Xilinx 28-nm Virtex-7F580T 3D ICs to exponentiallyincrease network bandwidth (seecover story, Xcell Journal issue 80,http://issuu.com/xcelljournal/docs/xcell80/8?e=2232228/2002872).With UltraScale, Xilinx is introducingdevices that enable further integra-tion and innovation, allowing compa-nies to create systems capable ofdriving multiple CFP4 optical mod-ules for yet another exponential gainin data throughput (see Figure 3).

    ULTRASCALE TO SMARTER DIGITAL VIDEO AND BROADCASTDigital video is another market facing amassive buildout. HDTV manufactur-ers and the supporting broadcast infra-structure are rapidly building TVs andbroadcast infrastructure equipment(cameras, communications and pro-duction gear) for 1080p, 4k/2k and 8kvideo. TVs will not only need to con-form to these upcoming standards,they must also be feature rich and rea-sonably priced. The market is extreme-ly competitive, so having devices thatcan adjust to changing standardsquickly and can help manufacturersdifferentiate their TVs is imperative.And while screen resolution and pic-ture quality will increase greatly withthe transition from 1080p to 4k/2k to8k, TV customers will want ever thin-ner models. This means new TVdesigns will require even greater inte-

    gration while achieving the perform-ance, power and BOM cost require-ments the consumer market demands.

    While broadcast equipment willneed to advance drastically to con-form to the requirements for 4k/2k and8k video, perhaps the biggest cus-tomer requirements for these systemswill be flexibility and upgradability.Broadcast standards are ever evolving.Meanwhile, the equipment to supportthese standards is expensive and noteasily swapped in and out. As a result,broadcast companies are gettingsmarter about their buying and aremaking longevity and upgradabilitytheir top requirements. This makesequipment based on All Programmabledevices a must-have for all sectors ofthe next-generation broadcast market,from the camera at the ballgame to theproduction equipment in the controlroom to the TV in your living room.

    C O V E R S T O R Y

    14 Xcell Journal Third Quarter 2013

    OTL4.10100GGFEC

    OTU4Framer

    100GMux &Mon

    100GSAR

    100GInter-Laken

    CFP10x11G 10x12.5G

    2x100G

    7VX690T

    OTL4.44x100GGFEC

    4x100GOTU4

    Framer

    4x100GMux &Mon

    4x100GSAR

    4x100GInter-Laken

    CFP 4x28G

    CFP 4x28G

    CFP 4x28G

    CFP 4x28G

    5x25G

    5x25G

    5x25G

    5x25G

    4x100G

    OTL4.10100GGFEC

    OTU4Framer

    100GMux &Mon

    100GSAR

    100GInter-Laken

    CFP10x11G 10x12.5G

    7VX690T

    InterlakenOTL4.10

    InterlakenOTL4.10

    Existing Infrastrucure

    Virtex UltraScale Solution

    Solution Benefits

    Programmable System Integration2 Devices 1 Device

    System Performance2x

    FPGA BOM Cost+40%

    FPGA Total Power+25%

    Accelerated ProductivityHardened 4x100GMAC-to-Interlaken

    2x Port Density

    UltraSCALE

    Figure 3 The UltraScale architecture will enable higher-performance, lower-power single-chip CFP4 modules that will vastly improve the performance of existing OTN equipment with a low-cost card-swap upgrade.

    (Continued from page 11)

  • Figure 4 illustrates how amidrange Kintex UltraScale devicewill enable professional and pro-sumer camera manufacturers toquickly bring to market new cameraswith advanced capabilities for 4k/2kand 8k while dramatically cuttingcost, size and weight.

    ULTRASCALE TO SMARTERWIRELESS COMMUNICATIONS Without a doubt, the most rapidlyexpanding business in electronics ismobile communications. With usageof mobile devices increasing tenfoldover the last 10 years and projec-tions for even more dramaticgrowth in the coming 10, carriersare scrambling to devise ways towoo customers to service plans inthe face of increased competition.Wireless coverage and network per-formance are key differentiatorsand so carriers are always looking

    for network equipment that is fasterand can carry a vast amount of dataquickly. They could, of course, buymore and more higher-performancebasestations, but the infrastructurecosts of building out and maintainingmore equipment cuts into profitabili-ty, especially as competitors try tolure customers away by offering morecost-effective plans. Very much liketheir counterparts in wired communi-cations, wireless carriers are seekingsmarter ways to run their businessand are looking for innovative net-work architectures.

    One such architecture, called cloudradio-access networks (C-RAN), willenable carriers to pool all the base-band processing resources that wouldnormally be associated with multiplebasestations and perform basebandprocessing in the cloud. Carriers canautomatically shift their resources toareas experiencing high demand, such

    as those near live sporting events, andthen reallocate the processing to otherareas as the crowd disperses. Theresult? Better quality of service for cus-tomers and a reduction in capital aswell as operating expenditures for carri-ers, with improved profitability.

    Creating viable C-RAN architecturesrequires highly sophisticated and high-ly flexible systems-on-chip that inte-grate high-speed I/O with parallel andserial processing. Xilinxs UltraScaleFPGAs and SoCs will play key roles inthis buildout.

    Xilinx is building UltraScale 20-nmversions of its Kintex and Virtex AllProgrammable FPGAs, 3D ICs and ZynqAll Programmable SoCs. The company isalso developing Virtex UltraScaledevices in TSMCs FinFET 16 technol-ogy (see sidebar, page 12). For moreinformation on Xilinxs UltraScale archi-tecture, visit Xilinxs UltraScale pageand read the backgrounder.

    Third Quarter 2013 Xcell Journal 15

    C O V E R S T O R Y

    DDR2 /DDR3

    Virtex-7 X485T

    Image Processing

    DDR2 /DDR3

    ImageSensor

    LVDS

    CPU

    Virtex-7 X485T

    Video Processing

    HD-SDI3G-SDI

    Virtex-7 X485T

    Connectivity

    Existing Infrastrucure

    DDR4

    Image Processing,Video Processingand Connectivity

    ImageSensor

    Serial Interface

    CPU 10G-SDI12G-SDI10G VoIP

    Kintex UltraScale Solution

    DDR4

    Solution Benefits

    Programmable System Integration4 Devices 2 Devices

    System Performance+30%

    FPGA BOM Cost+40%

    FPGA Total Power-50%

    Accelerated ProductivityVivado High-Level Synthesis

    Hardened PCIe Gen3

    UltraSCALE

    Figure 4 The UltraScale architecture will let professional and consumer-grade camera manufacturers bring 8k cameras to market sooner, saving power, reducing weight, improving performance and lowering BOM cost.

  • 16 Xcell Journal Third Quarter 2013

    XCELLENCE IN NETWORKING

    Efficient Bitcoin Miner System Implemented on Zynq SoC

  • Third Quarter 2013 Xcell Journal 17

    Bitcoin is a virtual currency that has become increas-ingly popular over the past few years. As a result,Bitcoin followers have invested some of their assets insupport of this currency by either purchasing or miningBitcoins. Mining is the process of implementing mathematicalcalculations for the Bitcoin Network using computer hard-ware. In return for their services, Bitcoin miners receive alump-sum reward, currently 25 Bitcoins, and any includedtransaction fees. Mining is extremely competitive, since net-work rewards are divided up according to how much calcula-tion all miners have done.

    Bitcoin mining began as a software process on cost-ineffi-cient hardware such as CPUs and GPUs. But as Bitcoins pop-ularity has grown, the mining process has made dramaticshifts. Early-stage miners had to invest in power-hungryprocessors in order to obtain decent hash rates, as miningrates are called. Although CPU/GPU mining is very inefficient,it is flexible enough to accommodate changes in the Bitcoinprotocol. Over the years, mining operations have slowly grav-itated toward dedicated or semi-dedicated ASIC hardware foroptimized and efficient hash rates. This shift to hardware hasgenerated greater efficiency in mining, but at the cost of lowerflexibility in responding to changes in the mining protocol.

    An ASIC is dedicated hardware that is used in a specificapplication to perform certain specified tasks efficiently. Soalthough ASIC Bitcoin miners are relatively inexpensive andproduce excellent hash rates, the trade-off is their inflexibili-ty to protocol changes.

    Like an ASIC, an FPGA is also an efficient miner and is rel-atively inexpensive. In addition, FPGAs are more flexible thanASICs and can adjust to Bitcoin protocol changes. The currentproblem is to design a completely efficient and fairly flexiblemining system that does not rely on a PC host or relay deviceto connect with the Bitcoin Network. Our team accomplishedthis task using a Xilinx Zynq-7000 All Programmable SoC.

    THE BIG PICTURETo accomplish our goal of implementing a complete mining sys-tem, including a working Bitcoin node and a miner that is bothefficient and flexible, we needed a powerful FPGA chip notonly for flexibility, but also for performance. In addition to theFPGA, we also needed to use a processing engine for efficiency.

    X C E L L E N C E I N N E T W O R K I N G

    by Alexander StandridgeMSc CandidateCalifornia State University, [email protected]

    Calvin Ho MSc CandidateCalifornia State University, [email protected]

    Shahnam Mirzaei Assistant ProfessorCalifornia State University, [email protected]

    Ramin RoostaProfessorCalifornia State University, [email protected]

    The integration of programmable logic with a processor subsystemin one device allows for the development of an adaptable and affordable Bitcoin miner.

  • On this complete system-on-chip(SoC), we would need an optimizedkernel to run all required Bitcoin activ-ities, including network maintenanceand transactions handling. The hard-ware that satisfied all these conditionswas the Zynq-7020 SoC residing on theZedBoard development board. Ataround $300 to $400, the ZedBoard is

    fairly inexpensive for its class (seehttp://www.zedboard.org/buy). TheZynq-7020 SoC chip contains a pair ofARM Cortex-A9 processors embed-ded in 85,000 Artix-7 FPGA logiccells. The ZedBoard also comes with512 Mbytes of DDR3 memory thatenabled us to run our SoC designfaster. Lastly, the ZedBoard has an SD

    card slot for mass storage. It enabledus to put the entire updated Bitcoinprogram onto it.

    Using the ZedBoard, we implement-ed our SoC Bitcoin Miner. It includes ahost, a relay, a driver and a miner, asseen in Figure 1. We used the non-graphical version of the originalBitcoin client, known as Bitcoind, as a

    X C E L L E N C E I N N E T W O R K I N G

    Bitcoin Network

    Bitcoind (Host)

    Relay

    Driver

    Miner

    SOC Miner

    Figure 1 The basic data flow ofthe miner showing the relation-ships between the components

    and the Bitcoin Network

    At its simplest, the Bitcoin mining process boils down to a SHA-256process and a comparator. The SHA-256 process manipulates the block header information twice, and then compares it with

    the Bitcoin Networks expanded target difficulty.

    #define ROTR(x,n) ( (x >> n) | (x > 3))#define s1(x) (ROTR(x,17) ^ ROTR(x,19) ^ (x >> 10))#define s2(a) (ROTR(a,2) ^ ROTR(a,13) ^ ROTR(a,22))#define s3(e) (ROTR(e,6) ^ ROTR(e,11) ^ ROTR(e,25))

    #define w(a,b,c,d) (s1(d) + c + s0(b) + a)

    #define maj(x,y,z) ((x&y)^(x&z)^(y&z))#define ch(x,y,z) ((x&y)^((~x)&z))

    #define t1(e,f,g,h,k,w) (h + s3(e) + ch(e,f,g) + k + w)#define t2(a,b,c) (s2(a) + maj(a,b,c))

    for( i = 0; i < 64; i++){// loop 64 times

    if( i > 15) {W[i] = w(W[i-16],W[i-15],W[i-7],W[i-

    2]); // Temp values}unsigned int temp1 = t1(e,f,g,h,K[i],W[i]);unsigned int temp2 = t2(a,b,c);h = g;

    // Round resultsg = f;f = e;e = d + temp1;d = c;c = b;b = a;a = temp1 + temp2;

    }

    Figure 2 Vivado HLS implementation of the SHA-256 process

    18 Xcell Journal Third Quarter 2013

  • host to interact with the BitcoinNetwork. The relay uses the driver topass work from the host to the miner.

    THE DEPTHS OF THE MINING COREWe started developing the mining corewith Vivado HLS (high-level synthe-sis), using the U.S. Department ofCommerces specification for SHA-256.With Vivado HLS ability of fast behav-ioral testing, we quickly laid out sever-al prototype mining cores, rangingfrom a simple single-process to a com-plex multiprocess system. But before

    exploring the details of the mining-core structure, its useful to look at thebasics of the SHA-256 process.

    The SHA-256 process pushes 64bytes through 64 iterations of bitwiseshifts, additions and exclusive-ors,producing a 32-byte hash. During theprocess, eight 4-byte registers holdthe results of each round, and whenthe process is done these registersare concatenated to produce thehash. If the input data is less than 63bytes, it is padded to 63 bytes and the64th byte is used to store the length ofthe input data. More commonly, the

    input data is larger than 63 bytes. Inthis case, the data is padded out to thenearest multiple of 64, with the lastbyte reserved for the data length. Each64-byte chunk is then run through theSHA-256 process one after another,using the output of the previouschunk, known as the midstate, as thebasis for the next chunk.

    With Vivado HLS, the implementa-tion of the SHA-256 process is a simplefor loop; an array to hold the con-stants required by each round; anadditional array to hold the temporaryvalues used by later rounds; eight vari-ables to hold the results of eachround; and definitions for the logicoperations, as seen in Figure 2.

    At its simplest, the Bitcoin miningprocess boils down to a SHA-256process and a comparator. The SHA-256 process manipulates the blockheader information twice, and thencompares it with the Bitcoin Networksexpanded target difficulty, as seen inFigure 3A. This is a simple concept thatbecomes complicated in hardware,because the first-version Bitcoin blockheader is 80 bytes long. This means thatthe initial pass requires two runs of theSHA-256 process, while the second passrequires only a single run, as seen inFigure 3B. This double pass, definingtwo separate SHA-256 modules, compli-cates the situation because we want tokeep the SHA-256 module as generic aspossible to save development time andto allow reuse. This generic require-ment on the SHA-256 module definesthe inputs and outputs for us, settingthem as the 32-byte initial values, 64-byte data and 32-byte hash. Thus, withthe core aspect of the SHA-256 processisolated, we can then manipulate theseinputs to any layout we desire.

    The first of the three prototypes wedeveloped used a single SHA-256Process module. From the beginning,we knew that this core would be theslowest of the three because of thelack of pipelining; however, we werelooking to see how compact we couldmake the SHA-256 process.

    X C E L L E N C E I N N E T W O R K I N G

    Third Quarter 2013 Xcell Journal 19

    SHA-256 Process

    Comparator

    Header

    32 Bytes

    1 Bit

    A

    SHA-256 Process

    Comparator

    80-Byte Input

    32 Bytes

    SHA-256 Process

    32 Bytes

    1 Bit B

    SHA-256 Process

    Comparator

    80-Byte Input

    32 Bytes

    SHA-256 Process

    32 Bytes

    1 Bit1 Bit D

    SHA-256 Process

    SHA-256 Process

    SHA-256 Process

    SHA-256 Process

    64-Byte Input

    32 BytesMidstate

    32 Bytes

    32 Bytes

    16 bytes+Padding

    Padding

    Midstate

    Padding

    C

    Figure 3 A (top left) is the simplest form of the core, showing the general flow.B splits the single, universal SHA-256 process module into a pair of dedicatedmodules for each phase. C breaks down the specific SHA-256 processes into a

    single generic module that can be repeatedly used. D utilizes a quirk in the blockheader information allowing the removal of the first generic SHA-256 process.

  • The second prototype containedthree separate SHA-256 Process mod-ules chained together, as seen inFigure 3C. This configuration allowedfor static inputs where the firstprocess handles the first 64 bytes ofthe header information. The secondprocess handles the remaining 16bytes from the header plus the 48 bytesof required padding. The third processhandles the resulting hash from thefirst two. These three separateprocesses simplified the control logicand allowed for simple pipelining.

    The third and final prototype usedtwo SHA-256 Process modules andtook advantage of a quirk in theBitcoin header information and min-ing process that the Bitcoin miningcommunity makes use of. The blockheader information is laid out in sucha way that only the first 64 byteschange when the block is modified.This means that once the first 64bytes are processed, we can save theoutput data (midstate) and only

    process the last 16 bytes of the head-er, for a significant hash rate boost(see Figure 3D).

    The comparator is a major compo-nent, and properly designed can signif-icantly increase performance. A solu-tion is a hash that has a numeric valueless than the 32-byte expanded targetdifficulty defined by the Bitcoin sys-tem. This means we must compareeach hash we produce and wait to seeif a solution is found. Due to a quirkwith the expanded target difficulty, thefirst 4 bytes will always be zero regard-less of what the difficulty is, and as thenetworks difficulty increases so willthe number of leading zeros. At thetime of this writing, the first 13 bytesof the expanded target difficulty werezero. Thus, instead of comparing thewhole hash to the target difficulty, wecheck to see if the first x number ofbytes are zero. If they are zero, thenwe compare the hash to the target dif-ficulty. If they are not zero, then wetoss the hash out and start over.

    With the results of our prototypes inhand, we settled on a version of thethird core developed by the Bitcoinopen-source community specificallyfor the Xilinx FPGAs.

    ISE DEVELOPMENTBitcoin mining is a race to find a num-ber between zero and 232-1 that givesyou a solution. Thus, there are reallyonly two ways to improve mining-core performance: faster processingor divide-and-conquer. Using aSpartan-3 and a Spartan-6 develop-ment board, we tested a series of dif-ferent frequencies, pipelining tech-niques and parallelization.

    For our frequency tests, we startedwith the Spartan-3E developmentboard. However, it quickly becameapparent that nothing could be donebeyond 50 MHz and as such, we leftthe frequency tests to the Spartan-6.

    We tested 50, 100 and 150 MHz onthe Spartan-6 development board,which resulted in predictable results,achieving 0.8, 1.6 and 2.4 mega hashesper second (MHps) respectively. Wealso tried 75 MHz and 125 MHz duringthe parallelization and pipeliningtests, but these frequencies were onlya compromise to make the miner fitonto the Spartan-6.

    One of the reasons we chose themining core from the open-source com-munity was that it was designed with adepth variable to control the number ofpipelined stages. The depth is an expo-nential value between 0 and 6, whichcontrols the number of power-two iter-ations the VHDL generate commandexecutesnamely, 20 to 26. We per-formed the initial frequency tests with adepth of 0that is, only a single roundimplemented for each process.

    On the Spartan-3e we testeddepths of 0 and 1 before running intotiming-constraint issues. The depth of0 at 50 MHz gave us roughly 0.8 MHps,while the depth of 1 achieved roughly1.6 MHps.

    On the Spartan-6 we were able toreach a depth of 3 before running into

    20 Xcell Journal Third Quarter 2013

    X C E L L E N C E I N N E T W O R K I N G

    4

    3.5

    3

    2.5

    2

    1.5

    1

    0.5

    000:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00

    Wo

    rker

    Sp

    eed

    (M

    Hp

    s)

    Figure 4 The hash rate of the mining core over a four-day period of core-optimizationtesting. The graph is an average over time, giving a general idea of performance of the

    mining core. The results decreased due to hitting the routing limits of the test platforms.

  • timing-constraint issues. At 50 MHz, theSpartan-6 achieved the same results asthe Spartan-3 at a depth of 0 and 1. Wenoticed an interesting trend at thispoint. Doubling the frequency had thesame effect as increasing the depth ofthe pipeline. The upper limit was set bythe availability of routing resources,with our peak averaging out at roughly3.8 MHps, as seen in Figure 4.

    The SHA-256 Process module has10 different 32-bit additions. In thisimplementation, we are attempting todo all of them in a single clock cycle.To shorten the longest path, we triedto divide the adder chains into a seriesof stages. However, doing so radicallychanged the control logic for thewhole mining core and required acomplete rewrite. We dropped thesechanges to save time and effort.

    The final performance improvementwe tested was parallelization. Withonly minor modifications, we doubledand quadrupled the number of SHA-256 sets. A set consists of two SHA-256Process modules. This modificationeffectively cut the potential numberrange for each set into halves, for thedoubled number of SHA-256 sets, orfourths for the quadrupled number. Tofit these extra SHA processes onto theSpartan-6, we had to reduce the systemfrequency. Four sets of the SHA-256Process modules would run at 75 MHz,while two sets would run at 125 MHz.The hash rate improvements were dif-ficult to document. We could easily seethe hash rate of an individual SHA-256set; however, a multiple-set miningcore could find a solution faster thanthe hash rate implied.

    USING THE EDKAfter FPGA testing, our next step wasto tie the mining core into the ZynqSoCs AXI4 bus. Xilinxs EmbeddedDevelopment Kit (EDK) comesloaded with a configuration utilityspecifically designed for the ZynqSoC, allowing us to easily configureevery aspect. By default, the systemhas the 512 Mbytes of DDR3, theEthernet, the USB and the SD inter-face enabled, which is everything weneeded for our Bitcoin SoC.

    The twin Cortex-A9 processors fea-ture the AXI4 interface instead of thePLB system used by previous soft-coresystems. All of the peripherals areattached to processors through theAXI4 interface, as shown in Figure 5.

    The EDK custom peripheral wizardprovides code stubs for the differentvariants of the AXI4 interface andserves as the basis for the develop-ment of the mining cores AXI inter-face. For simplicity, we used the AXI4-Lite interface, which provides simpleread and write functionality to themining core. Ideally, developers wouldwant to use the regular AXI4 interfaceto take advantage of the advancedinterface controls, such as data bursts.

    Keeping with the goal of simplicity,we used three memory-mapped regis-ters to handle the mining-core I/O. Thefirst register feeds the host data packetto the miner. This register keeps trackof the amount of data passed and locksitself after 11 words have been trans-ferred. This lock is removed automati-cally when a solution is found. We canremove the lock manually if necessarythrough the status register.

    We defined the second register asthe status register, flagging certainbits for the various states the miningcore goes through. With the simplicityof our design, we used only threeflags: a loaded flag, a running flag anda solution-found flag. The loaded flagis triggered when the mining corereceives 11 words, as mentionedbefore, and is cleared when we writeto it. The start/running flag is set by us

    X C E L L E N C E I N N E T W O R K I N G

    Third Quarter 2013 Xcell Journal 21

    SD Card

    Ethernet

    ARM Cortex-A9

    ARM Cortex-A9 Mining Core

    DDR3 RAM

    ProcessorSubsystem

    (PS)Programmable

    Logic (PL)

    AXI4 Bus

    Figure 5 The relatively new AXI4 interface connects the ZedBoards hard peripherals with the ARM processors and Artix FPGA logic.

  • that before we can use the reservedaddress we have to request an addressremap for the mining core. The kernelgives us a virtual address through theremap, which we can then use to com-municate to the mining-core registers.

    Now that we have access to thehardware, the initialization functionthen runs a simple test on the miningcore to confirm that the core is operat-ing properly. If it is, we register themajor and minor numberswhich areused to identify the device in user pro-gramswith the kernel. Every func-tion has a counter, and the exit functionis the counter to the initialization func-tion. In this function, we undo every-thing done during initializationspecifically, release the major andminor numbers used by the driver, thenrelease the mapped virtual address.

    When we first call the mining corefrom the relay program, the open func-tion runs. All we do here is to confirmthat the self-test run during the initial-ization function is successful. If theself-test fails, then we throw a system-error message and fail out. The closefunction is called when the device isreleased from the relay program.However, since we only check the self-test results in the open function, thereis nothing to do in the close function.The read function looks at the databuffer and determines which port theuser is reading, and it then gets thedata from the mining core and passesit back. The write function determineswhich register the user is writing toand passes the data to the mining core.

    The final component of our systemis a small relay program to pass work

    to start the mining core, and iscleared by the mining core when asolution is found. The final register isthe output register, which is loadedwith the solution when one is found.

    We tested each stage of the AXI4-Lite interface development beforeadding the next component. We alsotested the AXI4-Lite interface inXilinxs Software Development Kitusing a bare-bones system. The testserved two purposes: first, to confirmthat the mining cores AXI4-Lite inter-face was working correctly and sec-ond, to ensure that the mining corewas receiving the data in the correctendian format.

    EMBEDDED LINUXWith the mining core tied into theprocessors, we moved to softwaredevelopment. Ideally we wanted tobuild our own firmware, potentiallywith a Linux core, to maximize per-formance. For ease of development,we leveraged a full Linux installation.We used Xillinux, developed byXillybus, an Ubuntu derivative forrapid embedded-system development.

    Our first order of business was tocompile Bitcoind for the Cortex-A9architecture. We used the original open-source Bitcoin software to ensure com-patibility with the Bitcoin Network.

    Testing Bitcoind was a simplematter. We ran the daemon, waitedfor the sizable block chain to down-load and then used the line commandto start Bitcoinds built-in CPU min-ing software.

    For the Linux driver, you mustimplement a handful of specific func-tions and link them to the Linux kernelfor correct operation. When the driveris loaded a specific initialization func-tion runs, preparing the system forinteraction with the hardware. In thisfunction, we first confirm that thememory address to which the miningcore is tied is available for use, thenreserve the address. Unlike a bare-bones system, Linux uses a virtualmemory address scheme, which means

    22 Xcell Journal Third Quarter 2013

    X C E L L E N C E I N N E T W O R K I N G

    from Bitcoind to the mining corethrough the driver, and return results. Asa matter of course, the relay programwould check to see if Bitcoind was run-ning and that the mining core was readyand operating properly. Since the relayprogram would normally sit idle for themajority of the time, we will implementa statistics compilation subsystem togather data and organize it in a log file.Ideally, we would use a Web host inter-face for configuration, statistical displayand device status.

    COMPLETE, EFFICIENT MINING SYSTEMOur solution to an efficient and com-plete Bitcoin mining system is the XilinxZynq-7000 All Programmable SoC on theZedBoard. This board is flexible tochanges in the Bitcoin protocol and alsooffers a high-performance FPGA solu-tion along with SoC capability. Toimprove on our current design withoutjeopardizing the adaptability, we couldadd more mining cores that connect tothe main Bitcoin node. These additionscould boost the performance significant-ly and provide a higher overall hash rate.

    Another improvement that couldfurther optimize this design is to havededicated firmware. Our currentdesign is running Ubuntu Linux 12.04,which has many unnecessary process-es running concurrently with theBitcoin program, like SSH. It is awaste of the boards resources to runthese processes while Bitcoin is run-ning. In a future build, we would elim-inate these processes by running ourown custom firmware dedicated toonly Bitcoin tasks.

    To improve on our current design without jeopardizing the adaptability, we could add more

    mining cores that connect to the main Bitcoin node.These additions could boost the performance

    and provide a higher overall hash rate.

  • Third Quarter 2013 Xcell Journal 23

    XCELLENCE IN AEROSPACE & DEFENSE

    Ensuring FPGA Reconfiguration in Space by Robrt GleinSystem EngineerFriedrich-Alexander-Universitt Erlangen-Nrnberg (FAU), Information

    Technology (Communication Electronics)[email protected]

    Florian RittnerFirmware DesignerFraunhofer IISRF and Microwave Design Department

    Alexander HofmannTelecommunications EngineerFraunhofer IISRF and Microwave Design Department

  • 24 Xcell Journal Third Quarter 2013

    In 2007, the German FederalMinistry of Economics andTechnology commissionedthe German Aerospace Center

    to perform a feasibility study of anexperimental satellite to explore com-munications methods. Government aswell as academic researchers will usethe satellite, called the Heinrich Hertzcommunications satellite, to performexperiments in the Ka and K band. Itis scheduled for launch in 2017. A key element of this system is theFraunhofer-designed onboard proces-sor, which we implemented on Xilinx

    Virtex-5QV FPGAs. The Fraunhofer onboard proces-

    sorwhich is involved in the in-orbitverification of modules in spaceis theessential module of the satellites regen-erative transponder. Because theHeinrich Hertz satellite will be used fora broad number of communicationsexperiments, we chose to leverage theFPGAs capability to be reconfigured forvarious tasks. This saves space, reducesweight and enables the use of futurecommunications protocols. Figure 1illustrates this onboard processor andother parts of the scientific in-orbit veri-fication payload of the satellite.

    But because space is perhaps theharshest environment for electronics,we had to take a few key steps toensure the system could be reconfig-ured reliably. The radiation-hardenedXilinx FPGAs offer a perfect solutionfor reconfigurable space applicationsregarding total ionizing dose, single-event effects, vibration and thermalcycles. The reconfiguration of the dig-ital hardware of these FPGAs is anessential function in this geostationarysatellite mission in order to ensure

    system flexibility. To achieve a robustreconfiguration without a single pointof failure, we devised a two-prongedmethodology that employs two con-figuration methods in parallel. Besidesa state-of-the-art (SOTA) method, wealso introduced a fail-safe reconfigura-tion method via partial reconfigura-tion, which enables a self-reconfigura-tion of the FPGAs.

    A straightforward way to configureFPGAs is to use an external radiation-hardened configuration processor witha rad-hard mass memory (for example,flash). In this case, the (partial) bit fileis stored in the memory and the proces-sor configures the FPGA with the bit-stream. This approach offers good flex-ibility if your design requires a fullFPGA reconfiguration and has manybit files onboard. You can update orverify bit files via a low-data-ratetelemetry/telecommand channel or adigital communication up- or downlinkwith a higher data rate. This SOTA con-figuration method is nonredundant. Letus assume a failure of the processor orof the configuration interface, asshown in Figure 2. Now it is impossibleto reconfigure the FPGA and all Virtex-5QV devices are useless.

    OVERALL CONFIGURATIONCONCEPT FOR THE FPGAFigure 2 illustrates both configurationmethods for one FPGAthe SOTAmethod and the fail-safe configuration(also the initial configuration) method,which only needs one radiation-hard-ened nonvolatile memory (PROM ormagnetoresistive RAM) as an externalcomponent. This component stores theinitial firmware, recognizable as con-tent of the FPGA. We use these two con-

    figuration methods in parallel in order tobenefit from the advantages of both.With this approach we also increase thereconfiguration reliability and overcomethe single-point-of-failure problem.

    DETAILS OF FAIL-SAFE CONFIGURATION METHODWe designed this fail-safe initial con-figuration method in order to enableself-reconfiguration of the FPGA. TheFPGA configures itself with the initialconfiguration at startup of theonboard processor. We use the serialmaster configuration protocol for thePROM or the Byte Peripheral Interfacefor the magnetoresistive RAM (MRAM).However, only one nonvolatile memoryper FPGA is required for this fail-safeconfiguration method.

    We separated the firmware of the ini-tial configuration in dynamic (blue) andstatic (light-red and red) areas as shownin Figure 2. With partial reconfiguration(PR), it is possible to reconfiguredynamic areas while static and otherdynamic areas are still running.

    After a power-up sequence, theFPGA automatically configures itselfwith the initial firmware and is readyto receive new partial or even com-plete (only by MRAM) bit files. A trans-mitter on Earth packs these bit filesinto a data stream, encrypts, encodesand modulates it, and sends it over theuplink to the satellite. After signal con-ditioning and digitizing, the first recon-figurable partition (RP0) inside theFPGA receives the digital data stream.RP0 performs a digital demodulationand decoding. The Block RAM (BRAM)shares the packed bit file with theMicroBlaze system. The MicroBlazeunpacks the bit file and stores it in the

    A fail-safe method of reconfiguring rad-hard Xilinx FPGAs improves reliability in remote space applications.

    X C E L L E N C E I N A E R O S PA C E & D E F E N S E

  • X C E L L E N C E I N A E R O S PA C E & D E F E N S E

    SRAM or SDRAM. The MicroBlazereconfigures a dynamic reconfigurablepartition (RP) or all RPs with a direct-memory access (DMA) via the InternalConfiguration Access Port (ICAP).The static part of the FPGA should notbe reconfigured.

    In case of an overwritten RP0, a high-priority command over the satellite con-trol bus could reboot the onboardprocessor. This command would restorethe initial fail-safe configuration.

    There are plenty of new possibleapplications for reconfigurable parti-tions, including digital transmitters,digital receivers and replacement of ahighly reliable communications proto-col, among others. You can also recon-figure and switch to new submoduleslike demappers or decoders for adap-tive coding and modulation.

    You can even replace the whole fail-safe configuration bit file with a newdouble-checked initial bit file viaMRAM as storage for the initial config-uration. With a digital transmitter, theonboard processor can also send theconfiguration data back to Earth inorder to verify bit files on the ground.

    CONFIGURATION RELIABILITYA major benefit of using both configu-ration methods (SOTA and initial) isan increase of the overall system relia-bility. There is a drawback in the initialfail-safe method regarding the usableFPGA resources. In a geosynchronous

    orbit mission with a term of 15 years,reliability of the most important pro-cedures such as configurationdepends on the failure-in-time (FIT)rates of both configuration methods.We assume a typical FIT rate of 500 ina representative configuration proces-sor and its peripherals for the SOTAconfiguration method. The PROMachieves a much better FIT rate of 61(part stress method for failure-rateprediction at operating conditionsregarding MIL-HDBK 217).

    Nevertheless, both methods aresingle points of failure. Only the par-

    allel operation guarantees a signifi-cant increase in the reliability of theconfiguration procedure. Calculatingthe standalone reliability of theSOTA configuration method resultsin Rsc = 0.9364 and of the initialmethod in Ric = 0.9920, with the

    given mission parameters. The over-all configuration reliability for sys-tems in parallel is Rc = 0.9929.

    We have to compare Rc and Rsc inorder to obtain the gain of the recon-figuration reliability. The achievedgain (Rc - Rsc) is 5.65 percent. The ini-tial configuration has some restric-tions in terms of FPGA resources.Nevertheless, you can still reconfigurethe whole FPGA anytime with themore complex SOTA configurationmethod if the external configurationprocessor is still operating.

    Both methods can correct failures

    of the FPGA caused by configurationsingle-event upsets. For the SOTA con-figuration method, the external config-uration processor performs a scrub-bing, as does the self-scrubbedMicroBlaze for the initial configura-tion method.

    Third Quarter 2013 Quarter 2013 Xcell Journal 25

    Two-Spot-Beam

    Rx

    Two-Spot-Beam

    TxFilter

    Filter

    LTWTA

    Filter

    LNA Down-converter

    Down-converter

    Up-converter

    SwitchMatrix

    SwitchMatrix

    FOBP

    Figure 1 Part of the scientific in-orbit verification payload of the Heinrich Hertz satellite; the gold box in the middle,

    labeled FOBP, is the Fraunhofer onboard processor. LTWTA stands for linear traveling-wave tube amplifier.

    In case of an overwrite, a high-priority command over thesatellite control bus could reboot

    the onboard processor and restorethe initial fail-safe configuration.

  • IMPLEMENTING FAIL-SAFECONFIGURATIONFigure 2 illustrates the detailed archi-tecture of one of the onboard proces-sors FPGAs. The FPGA is divided intothree parts: MicroBlaze, static intellec-tual-property (IP) cores and RPs. Anissue of the MicroBlaze system is mem-ory management. Alongside the PROMor MRAM, we attached an SRAM andan SDRAM. These two different memo-ry types enable applications like tempo-rary packet buffers and content storage(for example, video files). The memorymultiplexer block (Mem Mux) enableshigher flexibility. It multiplexes thememory to RPs or to the MicroBlazeregion. Semaphores control the multi-plexer via a general-purpose input/out-put (GPIO) connection. So, only oneregion can use the single-port memoryat any given time. The initial configura-

    tion also contains internal memory withBRAM, presenting an opportunity for afast data exchange of both regions.

    The build wrapper in XilinxPlatform Studio (XPS) enables a uni-versal integration of the MicroBlazesystem with support and connection ofindependent ports.

    We implemented IP cores such asBRAM, RocketIO and digital clockmanager (DCM). The RP switchingblock connects the RPs by means ofsettings in the RP switching configura-tion, which flexibly interconnects theRPs. A data flow control (DFC) man-ages the data flow among the RPs. TheFPGAs communicate to each other bymeans of RocketIO and low-voltage dif-ferential signaling (LVDS).

    We divided the RPs into two parti-tions with 20 percent configurable logicblock (CLB) resources of the FPGA and

    two partitions with 10 percent CLBresources. We also tried to share theDSP48 slices and BRAM equally andmaximally to all partitions.

    The main part of the implementationis the system design and floorplanning,which we realized with the PlanAheadtool from Xilinx. We combined theMicroBlaze system, the IP cores, thesoftware design and the partial userhardware description language (HDL)modules, called reconfigurable mod-ules. We also defined the location ofRPs, planned design runs, chose a syn-thesis strategy and constrained thedesign in this step. A BRAM memorymap file defines the BRAM resourcesfor data and instruction memory. Thisenables an integration of the initialMicroBlaze software in the bit file.

    Figure 3 shows the implementationresult. Currently we are developing a

    26 Xcell Journal Third Quarter 2013

    RP0Size 20%

    RP1Size 20%

    RP2Size 10%

    RP3Size 10%

    RP Switching Block

    BRAMBlock

    Clocks

    Uplink/Downlink

    RIO GTX RIO CoresFPGA Core

    TOP Wrapper (top_wrap.vhd)

    Mem MuxMPMC

    SDRAM

    BlockEnable Bank

    GPIO2

    GPIO1

    MEM Semaphores

    EMCSRAM

    EXTBC

    BRAMBlock

    INTBC

    INTBC

    INTBC

    Micro-Blaze

    BRAM(Data &Instr.)

    ResetModule

    CLKGen

    GPIOLED

    UART

    Timer &WD

    MDM

    ICAPController

    8 x GPIO8 x LED

    JTAG

    ConfigInter-face

    SRAM SDRAM PROM/MRAMExternal Configuration

    Microprocessor

    External PCB

    External Config-uration Flash

    TM/TC

    Interlinks TOP/UCF

    Dynamic Reconfigurable Partitions Static Partitions

    Same Board Static Partitions

    Data-Path PLB MB

    BRAM-Bus LMB MB

    DCM120 MHz

    DCM360 MHz

    Other FPGAs LVDS

    ,

    RP Switching Configuration

    Figure 2 Architecture of the first FPGA in the initial configuration system. The static sections are in red and light red,

    and the dynamic portions in light blue (left). The yellow lightning bolt represents a SOTA configuration failure.

    X C E L L E N C E I N A E R O S PA C E & D E F E N S E

  • X C E L L E N C E I N A E R O S PA C E & D E F E N S E

    highly reliable receiver for the firstpartition (RP0), which will be used inthe initial configuration (on the satel-lite) to receive (demodulating, demap-ping, deinterleaving and decoding) theconfiguration data.

    Table 1 illustrates target and actualsizes of the RPs. We achieved a slight-ly higher CLB count than we targeted,whereby the DSP slices nearly reachthe maximum. The BRAM resources

    correspond to the CLBs because ofthe MicroBlazes demand for BRAMs.The overhead of the static area isapproximately 30 percent. Wedesigned this firmware for the com-mercial Virtex-5 FX130 in order toprove the concept with an elegantbreadboard. At this point, we areabout to port this design to the Virtex-5QV and implement further features,such as triple-modular redundancy orerror-correction codes.

    PROBLEM SOLVEDA single point of failure regarding FPGAconfiguration is a problem in highly reli-able applications. You can overcomethis issue with an initial configuration,which needs only one additional deviceper FPGA. Partial reconfigurationenables the remote FPGA configurationvia this fail-safe initial configuration.The parallel use of both configurationscombines their main advantages: theincrease of 5.65 percent in configura-tion reliability and the flexibility toreconfigure the complete FPGA.

    System design and implementationchallenges require a trade-offbetween complexity regarding therouting or clock resources and theusable design resources. An elegantbreadboard shows the successfulimplementation and verification in areal-world scenario.

    Third Quarter 2013Third Quarter 2013 Xcell Journal 27

    CLBs DSP Slices BRAM

    Target Result Target Result Target Result

    4,096 4,400 max. 80 max. 60

    20% 21% max. 25% max. 20%

    2,048 2,720 max. 76 max. 38

    10% 13% max. 24% max. 13%

    12,288 14,240 max. 312 max. 196

    60% 70% max. 98% max. 66%

    RP0/RP1

    RP2/RP3

    sum of allRP

    resources

    Figure 3 PlanAhead view of the implemented design. The blue rectangles represent the RP regions. RP0 is seen as purple, the MicroBlaze as yellow, BRAM as green and other static regions as red.

    Table 1 FPGA resources of the reconfigurable partitions (RPx)

    FPGA S OLUTIONS

    ? 100K Logic Cells + 240 DSP Slices? DDR3 SDRAM + Gigabit Ethernet? SO-DIMM form factor (68 x 30 mm)

    Mars AX3 Artix-7 FPGA Module

    from

    Mars ZX3 SoC Module

    ? Xilinx Zynq-7000 All Programmable SoC (Dual Cortex-A9 + Xilinx Artix-7 FPGA)

    ? DDR3 SDRAM + NAND Flash? Gigabit Ethernet + USB 2.0 OTG? SO-DIMM form factor (68 x 30 mm)

    We speak FPGA.

    www.enclustra.com

    ? Xilinx Kintex-7 FPGA? High-performance DDR3 SDRAM? USB 3.0, PCIe 2.0 + 2 Gigabit Ethernet ports? Smaller than a credit card

    Mercury KX1 FPGA Module

    FPGA Manager Solution

    Simple, fast host-to-FPGA data transfer, for PCI Express, USB 3.0 and Gigabit Ethernet. Supports user applications written in C, C++, C#, VB.net,

    MATLAB, Simulink and LabVIEW.

  • 28 Xcell Journal Third Quarter 2013

    XPERIMENT

    Its easy to build an attractivedemonstration platform for students and beginners around a Xilinx Spartan-6 device and a few peripheral components.

    XRadio:A Rockin' FPGA-onlyRadio for TeachingSystem Design

    XRadio:A Rockin' FPGA-onlyRadio for TeachingSystem Design

  • Third Quarter 2013 Xcell Journal 29

    Recently here at the Beijing Institute of Technology, we setout to develop a digital design educational platform thatdemonstrated the practical use of FPGAs in communica-tions and signal processing. The platform needed to beintuitive and easy to use, to help students understand allaspects of digital design. It also had to be easy for studentsto customize for their own unique system designs.

    Around that time, we in the electrical engineering depart-ment had been having a lively debate as to whether it waspossible to use an FPGAs I/O pins as a comparator or a 1-bitA/D converter directly. We decided to test that premise by try-ing to incorporate this FPGA-based comparator in the designof our XRadio platform, an entirely digital FM radio receiverthat we designed using a low-cost Xilinx Spartan-6 FPGAand a few general peripheral components. We demonstratedthe working radio to Xilinx University Program (XUP) direc-tor Patrick Lysaght, during his visit last year to the BeijingInstitute of Technology. He was excited by the simplicity ofthe design, and encouraged us to share the story of theXRadio with the academic community worldwide.

    SYSTEM ARCHITECTUREWe built the XRadio platform almost entirely in the FPGAand omitted the use of traditional analog components suchas amplifiers or discrete filters, as shown in Figure 1. As afirst step, we created a basic antenna by linking a simplecoupling circuit with a wire connected to the FPGAs I/Opins. This antenna transmits RF signals to the FPGA, whichimplements the signal processing of the FM receiver as digi-tal downconversion and frequency demodulation. We thenoutput the audio signal through the I/O pins to the head-phones. We added mechanical rotary incremental encodersto control the XRadios tuning and volume. We designed thesystem so that the tuning and volume information is dis-played on the seven-segment LED module.

    Figure 2 shows a top-level logic diagram of the FPGA.In this design, the RF signals coupled into the input bufferof the FPGA are quantified as a 1-bit digital signal. Thequantified signal is then multiplied by the local oscillation

    X P E R I M E N T

    Rby Qiongzhi WuLecturerBeijing Institute of [email protected] YangMS StudentBeijing Institute of [email protected]

  • signal generated by a numericallycontrolled oscillator (NCO). Themultiplied signal is filtered to obtainan orthogonal IQ (in-phase plusquadrature phase) baseband signal.It then obtains the audio data streamfrom the IQ signal through the frequen-cy demodulator and low-pass filter.

    The data stream next goes to thepulse-width modulation (PWM) mod-ule, which generates pulses on itsoutput. By filtering the pulses, weobtain a proportional analog value todrive the headphones. We connecteda controller module to the mechani-cal rotary incremental encoders and

    LEDs. This module gets a pulse sig-nal from incremental encoders toadjust the output frequency of theNCO and audio volume controlledby the PWM module.

    IMPLEMENTATION DETAILSThe first challenge we had to over-come was how to couple the signalfrom the antenna to the FPGA. In ourfirst experimental design, we config-ured an FPGA I/O as a standard sin-gle-ended I/O; then we used resistorsR1 and R2 to build a voltage divider togenerate a bias voltage between VIHand VIL at the FPGA pin. The signalthat the antenna receives can drivethe input buffer through the couplingcapacitor C1. The signal is sampledby two levels of D flip-flops, whichare driven by an internal 240-MHzclock. The output of the flip-flopsobtains an equal interval 1-bit sam-pling data stream.

    To test this circuit, we fed theresult to another FPGA pin and meas-ured it with a spectrum analyzer tosee if the FPGA received the signalaccurately. However, it didnt workwell because the Spartan-6 FPGAsinput buffer has a small, 120-millivolthysteresis, according to the analyzer.Although the hysteresis is good for

    30 Xcell Journal Third Quarter 2013

    X P E R I M E N T

    Incremental Encoder

    Earphone

    LED Module

    SimpleAntenna

    Antenna BUF

    NCO

    SIN & COS

    Encoder

    CICLPFs

    I/Q

    LEDDisplay

    LPF PWM EarphoneFD

    Controller

    Figure 1 The XRadio hardware concept

    Figure 2 Top-level diagram of the XRadio FPGA

  • preventing noise generally, in thisapplication it wasnt wanted. We hadto devise a way to increase thestrength of the signal.

    To solve this problem, we foundthat the differential input bufferprimitive, IBUFDS, has a very highsensitivity between its positive andnegative terminals. According to ourtest, a differential voltage as low as

    1 mV peak-to-peak is enough tomake IBUFDS swing between 0 and1. Figure 3 shows the resultinginput circuit. In this implementa-tion, the resistors R1, R2 and R3generate a common voltage in boththe P and N terminals of IBUFDS.The received signals feed into the Psidefile:///app/ds/terminal throughcoupling capacitor C1. The AC signal

    is filtered out by the capacitor C2 onthe N side, which acts as an AC refer-ence. With this circuit, the FPGA con-verts the FM broadcast signals into a1-bit data stream successfully.

    The radio picks up about seven FMchannels with different amplitudesincluding 103.9-MHz Beijing TrafficRadio. The design significantlyreduces the signal-to-noise ratiobecause of the quantizing noise gener-ated by 1-bit sampling. But there arefour channels that are still distin-guishable against the noise floor.Therefore, we were able to prove ourtheory that an FPGAs I/O pins can beused effectively as a comparator or a1-bit A/D converter in our XRadio.

    For digital downconversion, weused the DDS Compiler 4.0 IP core tobuild a numerically controlled oscilla-tor with a 240-MHz system clock. Thesine and cosine output frequencyranged from 87 MHz to 108 MHz. Thelocal-oscillator signals that the NCOgenerates are multiplied with the 1-bitsampling stream and travel through a

    X P E R I M E N T

    Third Quarter 2013 Xcell Journal 31

    VCCO

    FPGA

    P

    NC1

    R3 R2

    R1

    C2

    CLK

    D Q D QIBUFDS

    BasebandData ADDR DATA

    ROM

    Phase Angle

    CLK

    D Q LPFAudioData

    +

    The design slashes the signal-to-noise ratio because of the quantizing noise generated by 1-bit sampling. But fourchannels are still distinguishable against the noise floor.

    Thus, we were able to prove our theory that an FPGAs I/O pinscan be used as a comparator or as a 1-bit A/D converter.

    Figure 3 The antenna input method, making use of the differential input buffer primitive (IBUFDS)

    Figure 4 Frequency demodulator, which converts the frequency of the IQ baseband signal to our audio signal

  • low-pass filter to obtain the orthogo-nal baseband signal. Here, we usedthe CIC Compiler 2.0 IP core to builda three-order, low-pass CIC decima-tion filter with a downsampling fac-tor R = 240. The sample rate drops to1 MSPS because the filtered base-band orthogonal signal is a narrow-band signal.

    Figure 4 shows the frequencydemodulator, which converts thefrequency of the IQ baseband signalto our audio signal. We extracted theIQ datas instantaneous phase angleusing a lookup table in ROM. The IQdata is merged and is used as anaddress of the ROM; then the ROMoutputs both the real and the imagi-nary part of the corresponding com-plex angle. Next we have a differ-ence operation in which one tap by a

    32 Xcell Journal Third Quarter 2013

    X P E R I M E N T

    Slice Logic Utilization Used Available Utilization

    Slice registers

    Slice LUTs

    Occupied slices

    MUXCYs

    RAMB16BWERS

    RAMB8Bwers

    BUFG/BUFGMUXs

    PLL_ADVS

    1,181

    1,062

    406

    544

    2

    1

    4

    1

    11,440

    5,720

    1,430

    2,860

    32

    64

    16

    2

    10%

    18%

    28%

    19%

    6%

    1%

    25%

    50%

    Table 1 Resource utilization on the XC6SLX9 FPGA

    Figure 5 Co-author and grad student Xingran Yang shows off the working model.

  • flip-flop delays the phase-angle data.When the delayed data is subtractedfrom the original data, the result isexactly the wanted audio signal. Toimprove the output signal-to-noiseratio, we filtered the audio signalthrough a low-pass filter using sim-ple averaging.

    The frequency demodulators 8-bitaudio data stream output is scaledaccording to the volume controlparameter and sent into an 8-bitPWM module. The duty cycle ofPWM pulses represents the ampli-tude of the audio signal. The pulsesoutput at the FPGAs I/O pins anddrive the headphone through capaci-

    tance. Here, the headphone acts as alow-pass filter, eliminating the high-frequency part of the pulses remain-ing in the audio signal.

    Two rotary incremental encoderscontrol the frequency and volume ofthe radio. Each of the encoders out-puts two pulse signals. The rotationaldirection and rotational speed canbe determined by the pulse widthand phase. State machines and coun-ters translate the rotation states intoa frequency control word and a vol-ume control word. Meanwhile, thefrequency volume values are decod-ed and then displayed on the seven-segment LEDs.

    ROOM TO CUSTOMIZEThe printed-circuit board for theXRadio platform is shown in Figure5. With a Spartan-6 XC6SLX9 FPGA,you can implement a radio programthat consumes only a small part ofthe FPGAs logic resources, as shownin Table 1. This means there is roomfor students to customize the designand make improvements. For exam-ple, today with the 1-bit sampling,XRadios audio quality is not as goodas a standard FM radio. However,this leaves room for students to fig-ure out how to improve the XRadiosaudio quality and how to implementit in stereo.

    X P E R I M E N T

    Third Quarter 2013 Xcell Journal 33

    Debugging Xilinx's ZynqTM -7000 family

    with ARM CoreSight

    RTOS support, including Linux

    kernel and process debugging

    SMP/AMP multicore CortexTM-

    A9 MPCoreTMs debugging

    Up to 4 GByte realtime trace

    including PTM/ITM

    Profiling, performance and

    statistical analysis of Zynq's

    multicore CortexTM -A9 MPCoreTM

  • 34 Xcell Journal Third Quarter 2013

    XPERTS CORNER

    by Syed Zahid Ahmed, PhDSenior ResearcherCNRS-ENSEA-UCP ETIS Lab, Paris [email protected]

    Sbastien Fuhrmann, MScR&D EngineerCNRS-ENSEA-UCP ETIS Lab (now at Elsys Design), Paris [email protected]

    Bertrand Granado, PhDFull ProfessorCNRS-ENSEA-UCP ETIS Lab and UPMC LIP6 Lab, Paris [email protected]

    Benchmark: VivadosESL Capabilities SpeedIP Design on Zynq SoC

  • Third Quarter 2013 Xcell Journal 35

    F PGAs are widely used as prototyping oractual SoC implementation vehicles forsignal-processing applications. Theirmassive parallelism and abundant heteroge-neous blocks of on-chip memory along withDSP building blocks allow efficient implemen-tations that often rival standard microproces-sors, DSPs and GPUs. With the emergence ofthe Xilinx 28-nanometer Zynq-7000 AllProgrammable SoC, equipped with a hardARM processor alongside the programmablelogic, Xilinx devices have become even moreappealing for embedded systems.

    Despite the devices attractive properties,some designers avoid FPGAs because of thecomplexity of programming theman esotericjob involving hardware description languagessuch as VHDL and Verilog. But the advent ofpractical electronic system-level (ESL) designpromises to ease the task and put FPGA pro-gramming within the grasp of a broader range ofdesign engineers.

    ESL tools, which raise design abstraction tolevels above the mainstream register transferlevel (RTL), have a long history in the industry. [1]After numerous false starts, these high-leveldesign tools are finally gaining traction in real-world applications. Xilinx has embraced theESL methodology to solve the programmabilityissue of its FPGAs in its latest design tool offer-ing, the Vivado Design Suite. Vivado HLS, thehigh-level synthesis tool, incorporates built-inESL to automate the transformation of a designfrom C, C++ or SystemC to HDL.

    The availability of Vivado HLS and the ZynqSoC motivated our team to do a cross-analysisof an ESL approach vs. hand-coding as part ofour ongoing research into IP-based image-pro-cessing solutions, undertaken in a collaborativeindustrial-academic project. We hoped that ourwork would provide an unbiased, independentcase study of interest to other design teams,since the tools are still new and not much liter-ature is available on the general user experi-ence. BDTI experiments [2, 3] provide a niceoverview of the quality and potential of ESLtools for FPGAs at a higher level, but the BDTIteam rewrote the source application in an opti-mal manner for implementation with ESL tools.Our work, by contrast, used the original legacysource code of the algorithms, which was writ-ten for the PC.

    X P E R T S C O R N E R

    Automated methodology delivers results similar to hand-coded RTL for twoimage-processing IP cores

  • In comparing hand-coded blocks ofIP that were previously validated on aXilinx competitors 40-nm midrangeproducts to ESL programming on aZynq SoC, we found that the ESLresults rivaled the manual implemen-tations in multiple aspects, with onenotable exception: latency. Moreover,the ESL approach dramaticallyslashed the development time.

    Key objectives of our work were:

    Assess the effort/quality of porting existing (C/C++) code to hardware with no or minimalmodification;

    Do a detailed design spaceexploration (ESL vs. hand-codedIP);

    Investigate productivity trade-offs (efficiency vs. design time);

    Examine portability issues(migration to other FPGAs orASICs);

    Gauge the ease of use of ESLtools for hardware, software andsystem designers.

    VIVADO-BASED METHODOLOGYWe used Xilinx Vivado HLS 2012.4(ISE 14.4) design tools and the ZynqSoC-based ZedBoard (XC7z20clg484-1)for prototyping of the presented exper-iments. We chose two IP cores out of

    10 from our ongoing Image ProcessingFPGA-SoC collaborative project forscanning applications. The chosenpieces of IP were among the most com-plex in the project. Optimally designedin Verilog and validated on the Xilinxcompetitors 40-nm product, they per-form specific image-processing func-tions for scanning products of theindustrial partner, Sagemcom.

    Figure 1 outlines the experimenta-tion flow using Xilinx tools. For pur-poses of our experiments, we beganwith the original C source code of thealgorithms. It is worth noting that thecode was written for the PC in a clas-sical way and is not optimal for effi-cient porting to an embedded system.

    Our objective was to make mini-malideally, zeromodifications tothe code. While Vivado HLS supportsANSI C/C++, we did have to makesome basic tweaks for issues likedynamic memory allocation and com-plex structure pointers. We accom-plished this task without touching thecore of the coding, as recommendedby Xilinx documentation. We createdthe AXI4 master and slave (AXI-Lite)interfaces on function arguments andprocessed the code within VivadoHLS with varying constraints. Wespecified these constraints usingdirectives to create different solu-tions that we validated on the Zynq

    Z20 SoC using the ZedBoard. The sin-gle constraint we had to make forcomparisons sake involved latency.We had to assume the Zynq SoCsAXI4 interface would have compara-ble latency at the same frequency asthe bus of the competitors FPGA inthe hand-coded design. Porting theRTL code (the core of the IP) to theXilinx environment was straightfor-ward, but redesigning the interfacewas beyond the scope of our work.

    We used the Vivado HLS toolsbuilt-in SystemC co-simulation fea-ture to verify the generated hardwarebefore prototyping on the board. Thefact that Vivado HLS also generatessoftware drivers (access functions)for the IP further accelerated the timeit took to validate and debug the IP.

    We used Core-0 of the Zynqdevices ARM Cortex-A9 MPCore at666.7 MHz, with a DDR3 interface at533.3 MHz. An AXI4 timer took careof latency measurements. Figure 2presents the validation flow we usedon the FPGA prototype. The image isprocessed with the original C sourceon ARM to get the golden referenceresults. We then processed the sameimage with the IP and compared themto validate the results.

    As a standalone project, we alsoevaluated the latency measurementsof the two blocks of IP for a 100-MHzMicroBlaze processor with 8+8KBI+D cache, and hardware multiplierand divider units connected directlywith DDR3 via the AXI4 bus. We usedthe latency measurements of the twoIP cores on the MicroBlaze as a refer-ence to compare the speedupachieved by the Vivado HLS-generat-ed IP and the ARM processor, whichhas brought significant software com-pute power to FPGA designers.

    EXPERIMENTAL RESULTSLets now take a closer look at thedesign space explorations of the twopieces of IP, which we have labeledIP1 and IP2, using the HLS tools.Weve outlined the IP1 experiment in

    36 Xcell Journal Third Quarter 2013

    X P E R T S C O R N E R

    C Source

    Mallocs AXI Interfaces Constraints

    Vivado HLS

    ARM Cortex-A9Processing System

    DDR3Controller

    AXI4

    Vivado HLS

    IPAXI

    Timer Zynq Z20SoC

    SW D

    river

    s

    HW

    Figure 1 The test hardware system generation for the Zynq Z20 SoC

  • detail but to avoid redundancy, for IP2weve just shown the results. Finally,we will also discuss some feedbackwe received from the experts at Xilinxto our query regarding the efficiencyof the AXI4 burst implementationused in the Xilinx ESL tools.

    Tables 1 and 2 outline the details ofour design space explorations for IP1in the form of steps (we have listedthem as S1, S2, etc.). For our experi-ments, we used a gray-scale testimage of 256 x 256 pixels. It is impor-tant to note that the clock period,latency and power figures are just ref-erence values given by a tool afterquick synthesis for the cross-analysisof multiple implementations. Theactual values can significantly varyafter final implementation, as you willsee in the discussion of the FPGA pro-totype results (Table 2).

    S1: RAW CODE TRANSFORMATIONIn the first step, we compiled the codewith no optimization constraints(except the obligatory modificationsdescribed earlier). As yo