68
www.xilinx.com/xcell/ SOLUTIONS FOR A PROGRAMMABLE WORLD Xcell journal Xcell journal ISSUE 84, THIRD QUARTER 2013 Efficient Bitcoin Miner System Implemented on Zynq SoC Benchmark: Vivado’s ESL Capabilities Speed IP Design on Zynq SoC How to Boost Zynq SoC Performance by Creating Your Own Peripheral Xilinx Goes UltraScale at 20 nm and FinFET XRadio: A Rockin’ FPGA-only Radio for Teaching System Design page 28

Xcell Journal issue 84

Embed Size (px)

DESCRIPTION

The summer 2013 edition of Xcell Journal includes a cover story that examines Xilinx new innovative UltraScale architecture, which Xilinx will deploy in its 20nm planar and 16nm FinFET All Programmable devices. The issue also includes a variety of fantastic methodology and how-to articles for Xilinx users of all skill levels.

Citation preview

Page 1: Xcell Journal issue 84

www.xilinx.com/xcell/

S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

Xcell journalXcell journalI S SUE 84 , TH IRD QUAR TER 2013

Efficient Bitcoin Miner SystemImplemented on Zynq SoC

Benchmark: Vivado’s ESLCapabilities Speed IP Design on Zynq SoC

How to Boost Zynq SoCPerformance by Creating Your Own Peripheral

Xilinx GoesUltraScale at 20 nm and FinFET

XRadio: A Rockin’ FPGA-only Radio for Teaching System Design

page28

Page 2: Xcell Journal issue 84

Unlock Your System’s True

Potential

Xilinx® Zynq®-7000 AP SoC Mini-Module Plus System

© Avnet, Inc. 2013. All rights reserved. AVNET is a registered trademark of Avnet, Inc.Xilinx and Zynq are trademarks or registered trademarks of Xilinx, Inc.

The Xilinx® Zynq®-7000 All Programmable SoC Mini-Module Plus designed by Avnet, is a

small system-on-module (SOM) that integrates all the necessary functions and interfaces

for a Zynq-7000 AP SoC system. For development purposes, the module can be combined

with the Mini-Module Plus Baseboard II and a Mini-Module Plus Power Supply to provide an

out-of-box development system ready for prototyping. The modular concept also allows the

same Zynq-7000 AP SoC to be placed on a user-created baseboard, simplifying the user’s

baseboard design and reducing the development risk.

www.em.avnet.com/MMP-7Z045-G

Page 4: Xcell Journal issue 84

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2013 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

PUBLISHER Mike [email protected]

EDITOR Jacqueline Damian

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan [email protected]

INTERNATIONAL Melissa Zhang, Asia [email protected]

Christelle Moraga, Europe/Middle East/[email protected]

Tomoko Suto, [email protected]

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

Xilinx UltraScale Raises Value of All Programmable

If you’ve been keeping up with developments in the road map for advanced semi-conductor process technologies, you’re probably aware that the cost of producingsilicon is accelerating. Some have asserted that the cost benefit afforded by Moore’s

Law—namely, that the cost stays the same even though the transistor count doubles—is running out of gas as processes shrink below 28 nanometers. Indeed, for the 20-nm planarand 14/16-nm FinFET generation, industry analysts foresee a tandem trend of rising NRE costsand spiraling price tags for each transistor at each new node.

Analysts say that to make a decent profit on a chip implemented at 20 nm, you will firstneed to clear an estimated lowball cost of $150 million for designing and manufacturing thedevice. And to hit that lowball mark, you’d better get it right the first time, because maskcosts are approaching $5 million to $8 million per mask.

FinFET looks even more challenging. Analysts predict that the average cost for the16/14-nm FinFET node will be around $325 million—more than double the cost of 20-nmmanufacturing! Why so high? Analysts believe companies will encounter excessiveretooling, methodology and IP-requalification costs in switching from planar-transistorprocesses to FinFETs. Tools and methods will have to deal with new complexities in signalintegrity, electromigration, width quantization, and resistance and capacitance. The combi-nation of new transistor, lithography and manufacturing techniques and tools—along withrevamped and more expensive IP—represents a perfect storm for design teams.

To say the least, it’s hard to justify targeting 16/14-nm FinFET unless the end productwill sell in multiple tens of millions of units and have returns counted in billions rather thanmillions of dollars. As such, eliminating the NRE costs for customers designing on theseadvanced nodes is a growing sales advantage for all vendors playing in the FPGA space. Butif you have been following Xilinx’s progress for the last five years, you’ve seen that the com-pany has clearly demonstrated technical superiority and a unique value proposition. Today’sXilinx® All Programmable devices offer unmatchable system functionality with highcapacity and performance, lower power and reduced BOM costs, paired with more flexi-bility and greater programmability than any other silicon device on the market.

Xilinx executed several breakout moves at the 28-nm node, becoming the first to marketwith 28-nm FPGAs, 3D ICs and Zynq®-7000 All Programmable SoCs. At the same time, thecompany introduced the Vivado® Design Suite, incorporating state-of-the-art EDA tech-nologies. There are more breakthroughs in store at the 20-nm and 16-nm FinFET nodes.In July, Xilinx once again became the first merechant chip company to tape out an IC ata new manufacturing node: 20 nm. These new devices will include numerous architec-tural innovations in Xilinx’s new UltraScale™ architecture.

As you will read in the cover story, the UltraScale architecture will deliver many benefitsfor customers, perhaps the most noteworthy of which is a vast increase in system usage—better than 90 percent. It’s worth noting that the power of the UltraScale architecture couldnot have been fully realized without Vivado. The tool suite has greatly boosted productivityfor Xilinx customers using the 28-nm devices. Now, this sophisticated EDA technology ishelping Xilinx unleash the benefits of 20-nm and FinFET silicon. Xilinx will deliver thatunmatchable advantage to customers—with no NRE charges.

Mike SantariniPublisher

Page 6: Xcell Journal issue 84

C O N T E N T S

VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES

Cover StoryXilinx 20-nm Planar and 16-nm FinFET Go UltraScale

88

Xcellence in Networking

Efficient Bitcoin Miner System Implemented on Zynq SoC… 16

Xcellence in Aerospace & Defense

Ensuring FPGA Reconfiguration in Space…23

Letter From the PublisherXilinx UltraScale Raises

Value of All Programmable… 4

16162323

Page 7: Xcell Journal issue 84

T H I R D Q U A R T E R 2 0 1 3 , I S S U E 8 4

THE XILINX XPERIENCE FEATURES

Excellence in Magazine & Journal Writing2010, 2011

Excellence in Magazine & Journal Design and Layout2010, 2011, 2012

Tools of Xcellence Designing a Low-Latency H.264

System Using the Zynq SoC… 54

Innovative Power Design Maximizes

Virtex-7 Transceiver Performance… 60

Xtra, Xtra The latest Xilinx tool updates

and patches, as of July 2013… 64

Xclamations! Share your wit and wisdom by supplying

a caption for our wild and wacky artwork… 66

XTRA READING

2828

Xperiment

XRadio: A Rockin’ FPGA-only Radio for Teaching System Design… 28

Xperts Corner

Benchmark: Vivado’s ESL Capabilities Speed IP Design on Zynq SoC… 34

Xplanation: FPGA 101

How to Boost Zynq Performance by Creating Your Own Peripheral… 42

Xplanation: FPGA 101

JESD204B: Critical Link in the FPGA-Converter Chain… 48

3434

4848

Page 8: Xcell Journal issue 84

COVER STORY

8 Xcell Journal Third Quarter 2013

Xilinx 20-nm Planar and 16-nmFinFET Go UltraScale by Mike SantariniPublisher, Xcell JournalXilinx, Inc. [email protected]

Page 9: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 9

C O V E R S T O R Y

Building on its breakout move at 28 nanome-ters, Xilinx® has announced two industry firstsat the 20-nm node. Xilinx is the first merchantchip company to tape out a 20-nm device.What’s more, the new device will be the firstXilinx will bring to market using its UltraScale®

technology—the programmable industry’s firstASIC-class architecture. The UltraScale archi-tecture takes full advantage of the state-of-the-art EDA technologies in the Vivado® DesignSuite, enabling customers to quickly create anew generation of All Programmable innova-tions. The move continues the Generation

Ahead advantage over competitors thatbegan at 28 nm, when Xilinx deliv-ered two first-of-their-kind devices:the Zynq®-7000 All ProgrammableSoC and Virtex®-7 3D ICs.

“Xilinx was the first company todeliver 28-nm devices and at 20 nm,

we have maintained the industry’s mostaggressive tapeout schedule,” said Steve

Glaser, senior vice president of marketingand corporate strategy at Xilinx. “Our hardwork is paying off yet again, and we areclearly a year ahead of our competition indelivering high-end devices and half a yearahead delivering midrange devices.”

But as Xilinx did at 28 nm, the companyis also adding some industry-first innova-tions to its new portfolio. Among the firstthe company is revealing is the newUltraScale architecture, which it will beimplementing at 20-nm, 16-nm FinFET and

Xilinx’s industry-first 20-nmtapeout signals the dawning ofan UltraScale-era, ASIC-classprogrammable architecture. B

Page 10: Xcell Journal issue 84

beyond. “It’s the first programmablearchitecture to enable users to imple-ment ASIC-class designs using AllProgrammable devices,” said Glaser.“The UltraScale architecture enablesXilinx to deliver 20-nm and FinFET16-nm All Programmable deviceswith massive I/O and memory band-width, the fastest packet processing,the fastest DSP processing, ASIC-likeclocking, power management andmultilevel security.”

AN ARCHITECTURAL ADVANTAGEThe UltraScale architecture includeshundreds of structural enhancements,many of which Xilinx would havebeen unable to fully implement had itnot already developed its Vivado

Design Suite, which brings leading-edge EDA tool capabilities to Xilinxcustomers. For example, Vivado’sadvanced placement-and-routing capa-bilities enable users to take full advan-tage of UltraScale’s massive data-pro-cessing abilities, so that design teamswill be able to achieve better than 90percent utilization in UltraScaledevices while also hitting ambitiousperformance and power targets. Thisutilization is far beyond what’s possiblewith competing devices, which todayrequire users to trade off performancefor utilization. Customers are thusforced to move to that vendor’s biggerand more expensive device, only tofind they have to slow down the clockyet again to meet system power goals.

This problem didn’t plague Xilinx atthe 28-nm process node thanks toXilinx’s routing architecture. Nor will itbe an issue at 20 nm, because theUltraScale architecture can implementmassive data flow for wide buses tohelp customers develop systems withmultiterabit throughput and with mini-mal latency. The UltraScale architec-ture co-optimized with Vivado resultsin highly optimized critical paths withbuilt-in, high-speed memory cascadingto remove bottlenecks in DSP andpacket processing. Enhanced DSP sub-systems combine critical-path opti-mization with new 27x18-bit multipli-ers and dual adders that enable a mas-sive jump in fixed- and IEEE-754 float-ing-point arithmetic performance and

10 Xcell Journal Third Quarter 2013

C O V E R S T O R Y

Figure 1 – The UltraScale architecture enablesmassive I/O and memory bandwidth, fastpacket processing and DSP processing, alongwith ASIC-like clocking, power managementand multilevel security.

➤ Monolithic to 3D IC

➤ Planar to FinFET

➤ ASIC-class performance

UltraSCALEArchitecture

Page 11: Xcell Journal issue 84

efficiency. The wide-memory imple-mentation also applies to UltraScale3D ICs, which will see a step-functionimprovement in interdie bandwidthfor even higher overall performance.

UltraScale devices feature furtherI/O and memory-bandwidth enhance-ments, including support for next-gen-eration memory interfacing that offersa dramatic reduction in latency andoptimized I/O performance. The archi-tecture offers multiple hardened, ASIC-class IP cores, including 10/100GEthernet, Interlaken and PCIe®.

The UltraScale architecture alsoenables multiregion clocking—a fea-ture typically found only in ASIC-classdevices. Multiregion clocking allowsdesigners to build high-performance,low-power clock networks withextremely low clock skew in their sys-tems. The co-optimization of theUltraScale architecture and Vivadoalso allows design teams to employ awider variety of power-gating tech-niques across a wider range of func-tional elements in Xilinx’s 20-nm AllProgrammable devices to achieve fur-ther power savings in their designs.

Last but not least, UltraScale sup-ports advanced approaches to AES bit-stream decryption and authentication,key obfuscation and secure deviceprogramming to deliver best-in-classsystem security.

TAILORED FOR KEY MARKETSGlaser said that UltraScale deviceswill enable design teams to achieveeven greater levels of system integra-tion while maximizing overall systemperformance, lowering total systempower and reducing the overall BOMcost of their systems. “Building onwhat we did at 28 nm, Xilinx at the20-nm and FinFET 16/14 nodes is rais-

ing the value proposition of our AllProgrammable technology far beyondthe days when FPGAs were consideredmerely a nice alternative to logicdevices,” he said. “Our unique systemvalue is abundantly clear across abroad number of applications.”

Glaser said devices using Xilinx’s 20-nm and FinFET 16-nm UltraScalearchitecture squarely address a num-ber of growth market segments, suchas optical transport networking (OTN),high-performance computing in thenetwork, digital video and wirelesscommunications (see Figure 2). All ofthese sectors must address increasingrequirements for performance, cost,lower power and massive integration.

ULTRASCALE TO SMARTER OTN NETWORKSThe OTN market is currently undergo-ing a change to smarter networks, asdata traffic dictates a massive buildoutto create ever-more-advanced wired

networking equipment that can movedata from 100 Gbps today to 400 Gbpssoon and, tomorrow, to 1 Tbps. Whilethe traditional business approachwould be to simply throw massiveamounts of newer and faster equipmentat the problem, today network opera-tors are seeking smarter ways to buildout their networks and to improve thenetworks they have installed while find-ing better ways to manage costs—all tovastly improve service and profitability.

For example, the data center marketis looking to software-defined network-ing (SDN) as a way to optimize networkutilization and cut costs. SDN will allownetwork managers to leverage thecloud and virtual networks to providedata access to any type of Internet-con-nected device. The hardware to enablethese software-defined networks willhave to be extremely flexible, reliableand high performance. Greater pro-

Third Quarter 2013 Xcell Journal 11

C O V E R S T O R Y

100 Gbps

400 Gbps

1 Tbps

1080P

4K/2K

8K

3G

LTE

LTE-A

OTNNetworking

Digital Video

WirelessCommunications

SmarterApplications

Requirements

Massive PacketProcessing

>400 Gbps wire-speed

Massive Data Flow>5 Tbps

Massive I/O & Memory BW

>5 Tbps

Massive DSPPerformance

>7 TMACs

UltraScaleArchitecture

Figure 2 – The UltraScale architecture will speed the development of multiple types of next-generation systems, especially in wired and wireless communications.

Typically found only in ASIC-class devices, multiregion clocking allows designers to build high-performance, low-power clocknetworks with extremely low clock skew in their systems.

(Continued on page 14)

Page 12: Xcell Journal issue 84

C O V E R S T O R Y

12 Xcell Journal Third Quarter 2013

by Mike Santarini

If you’ve been keeping up with the latest news in semi-conductor process technology, you’ve likely beenreading that the world’s most advanced foundries will

manufacture devices using new transistor structurescalled “FinFETs” as the fundamental building blocks forchips they’ll produce at what’s collectively called the 16/14-nanometer node. But you may be wondering just whatFinFETs are, how they differ from standard transistors,and what benefits and challenges they bring.

One of the key inventions in semiconductor manufac-turing that has enabled today’s $292 billion semiconduc-tor business was Jean Hoerni’s creation of the planarprocess at Fairchild Semiconductor back in the 1950s.The planar process makes it possible to implement ever-smaller transistors, arrange them into a vast array of cir-cuits and interconnect them efficiently on a horizontalplane, rather than having them stick up, as discrete com-ponents do on printed-circuit boards. The planar processhas enabled the extreme miniaturization of ICs. With theplanar process, semiconductors are built in layers on topof, and etched into, ultrapure silicon wafers. Figure 1ashows a planar transistor (essentially an elaborate on/offswitch) composed of three main features: source, gateand drain, which have been etched and layered into a sil-icon substrate. Figure 1b shows a FinFET. Notice thathere, the gate surrounds the channel on three sides,whereas in the planar transistor the gate only covers thetop of the channel.

The source is where electrons enter the transistor.Depending on the voltage at the gate, the transistor’s gatewill be either opened or closed in the channel under thegate, similar to the way a light switch can be switched onor off. If the gate allows the signal through the channel, thedrain will move the electrons to the next transistor in thecircuit. The ideal transistor allows lots of current throughwhen it is “on” and as little current as possible when “off,”and can switch between on and off billions of times persecond. The speed of the switching is the fundamentalparameter that determines the performance of an IC. Chipdesign companies arrange and group transistors into anarray of circuits, which in turn are arranged and groupedinto functional blocks (such as processors, memory andlogic blocks). Then, these blocks too are arranged andgrouped to create the marvelous ICs in the plethora ofelectronic marvels we benefit from today.

Since the 1960s, the semiconductor industry has intro-duced a myriad of silicon innovations to keep pace withMoore’s Law, which states that the number of transistors in anIC will double every two years. This doubling of transistorsmeans that today’s leading-edge ICs contain billions of tran-sistors, crammed into essentially the same die size as 1960schips but running exponentially faster with exponentiallylower power, at 1.2 volts or below. Today, a transistor’s fea-tures are so small that some are only dozens of atoms wide.

KEEPING UP WITH MOOREThe industry is always struggling to keep pace with Moore’sLaw in the face of what appear to be true physical limita-tions to further advancement in semiconductors. Over thelast 10 years, process specialists have striven to ensure theelectrical integrity of the planar transistor. Simply put, theplanar transistor’s source, gate and drain features were sosmall and performance demands so high that these transis-tors could not adequately control electrons moving throughthem. The electrons would essentially leak or get pulledthrough by the drain even if a device was turned off.

For battery-powered devices such as mobile phones, thismeans the battery would drain charge faster, even if thephone was turned off. For AC-powered devices (those thatplug into a wall socket), it means they waste power and runhotter. This heat, if profound and unmanaged by tricks ofcooling, can lessen the lifetime of a product, because leakagecreates heat and heat increases leakage. Of course, the cool-ing adds extra cost to products, too. The leakage problembecame strikingly noticeable in semiconductors at the 130-nm node. It was in this era of extreme leakage that Microsofthad to recall its Xbox 360 worldwide due to overheating;leakage is one of the reasons why microprocessor companieshad to move to less efficient multicore processors over fasterbut much hotter single-processor architectures.

After 130 nm, several refinements came along to reduceheat and improve planar transistors on all fronts to keep upwith Moore’s Law. At 90 nm, the industry moved to low-k-dielectric insulators to improve switching, while the 65- and40-nm nodes introduced further refinements through stressedsilicon. At 28 nm, the industry moved to high-k gate dielectricsand metal gates. Xilinx’s use of the TSMC 28-nm HPL (high-performance, low-power) process provided the ideal mix ofperformance and power, giving Xilinx a Generation Aheadadvantage over competitors at that geometry.

The planar transistor process will be viable at the 20-nmnode thanks to a high-k gate dielectric, metal gates and a

THANKS, MR. HOERNI, BUT THE FIN IS INDawning of a new era in transistor technology benefits Xilinx customers.

Page 13: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 13

C O V E R S T O R Y

series of more elaborate manufacturing steps such as dou-ble patterning. The 20-nm node provides phenomenal ben-efits in terms of power as well as performance, but the costis increasing marginally because of elaborate manufactur-ing to ensure silicon integrity.

Thanks in large part to remarkable research started byCal Berkeley professor Chenming Hu under a DARPA con-tract, the 20-nm process will likely be the last hurrah for theplanar transistor (at least as we know it today), as theindustry moves to FETs built with fins.

INS AND OUTS OF FINSIn a planar transistor of today, electrical current flows fromsource to drain through a flat, 2D horizontal channel under-neath the gate. The gate voltage controls current flow through

the channel. As transistorsize shrinks with the intro-duction of each new siliconprocess, the planar transistor cannotadequately stop the flow of currentwhen it is in an “off” state, which results inleakage and heat. In a FinFET MOSFET transis-tor, the gate wraps around the channel on threesides, giving the gate much better electrostatic control to stopthe current when the transistor is in the “off” state. Superiorgate control in turn allows designers to increase the currentand switching speed and, thus, the performance of the IC.Because the gate wraps around three sides of the fin-shapedchannel, the FinFET is often called a 3D transistor (not to beconfused with 3D ICs, like the Virtex-7 2000T, which Xilinxpioneered with its stacked-silicon technology).

In a three-dimensional transistor (see Figure 1b), gate con-trol of the channel is on three sides rather than just one, as in

conventional two-dimensional planar transistors (see Figure1a). Even better channel control can be achieved with a thin-ner fin, or in the future with a gate-all-around structurewhere the channel will be enclosed by a gate on all sides.

The industry believes the 16-nm/14-nm FinFET processwill enable a 50 percent performance increase at the samepower as a device built at 28 nm. Alternatively, the FinFETdevice will consume 50 percent less power at the same per-formance. The performance-per-watt benefits added to thecontinued increases in capacity make FinFET processesextremely promising for devices at 16 or 14 nm and beyond.

That said, the cost and complexity of designing andmanufacturing 3D transistors is going to be higher at leastfor the short term, as EDA companies figure out ways toadequately model the device characteristics of these new

processes and augment their tools and flows toaccount for signal integrity, electromigration,width quantization, resistance and capaci-

tance. This complexity makes designing ASICs and ASSPseven riskier and more expensive than before.

Xilinx, however, shields users from the manufacturingdetails. Customers can reap the benefits of increased per-formance per watt and Xilinx’s Generation Ahead designflows to bring innovations based on the new UltraScalearchitecture to market faster.

Figure 1 – The position of the gatediffers in the two-dimensional traditional planar transistor (a) vs. the three-dimensional FinFETtransistor (b).

(a)

(b)

Page 14: Xcell Journal issue 84

grammability coupled more tightlywith processing will be mandatory.

While network manufacturing com-panies develop next-generation soft-ware-defined networks, their cus-tomers are constantly looking to pro-long the life of existing equipment.Companies have implemented single-chip CFP2 optical modules in OTNswitches with Xilinx 28-nm Virtex-7F580T 3D ICs to exponentiallyincrease network bandwidth (seecover story, Xcell Journal issue 80,http://issuu.com/xcelljournal/docs/

xcell80/8?e=2232228/2002872).With UltraScale, Xilinx is introducingdevices that enable further integra-tion and innovation, allowing compa-nies to create systems capable ofdriving multiple CFP4 optical mod-ules for yet another exponential gainin data throughput (see Figure 3).

ULTRASCALE TO SMARTER DIGITAL VIDEO AND BROADCASTDigital video is another market facing amassive buildout. HDTV manufactur-ers and the supporting broadcast infra-structure are rapidly building TVs andbroadcast infrastructure equipment(cameras, communications and pro-duction gear) for 1080p, 4k/2k and 8kvideo. TVs will not only need to con-form to these upcoming standards,they must also be feature rich and rea-sonably priced. The market is extreme-ly competitive, so having devices thatcan adjust to changing standardsquickly and can help manufacturersdifferentiate their TVs is imperative.And while screen resolution and pic-ture quality will increase greatly withthe transition from 1080p to 4k/2k to8k, TV customers will want ever thin-ner models. This means new TVdesigns will require even greater inte-

gration while achieving the perform-ance, power and BOM cost require-ments the consumer market demands.

While broadcast equipment willneed to advance drastically to con-form to the requirements for 4k/2k and8k video, perhaps the biggest cus-tomer requirements for these systemswill be flexibility and upgradability.Broadcast standards are ever evolving.Meanwhile, the equipment to supportthese standards is expensive and noteasily swapped in and out. As a result,broadcast companies are gettingsmarter about their buying and aremaking longevity and upgradabilitytheir top requirements. This makesequipment based on All Programmabledevices a must-have for all sectors ofthe next-generation broadcast market,from the camera at the ballgame to theproduction equipment in the controlroom to the TV in your living room.

C O V E R S T O R Y

14 Xcell Journal Third Quarter 2013

OTL4.10100GGFEC

OTU4Framer

100GMux &Mon

100GSAR

100GInter-Laken

CFP10x11G 10x12.5G

2x100G

7VX690T

OTL4.44x100GGFEC

4x100GOTU4

Framer

4x100GMux &Mon

4x100GSAR

4x100GInter-Laken

CFP 4x28G

CFP 4x28G

CFP 4x28G

CFP 4x28G

5x25G

5x25G

5x25G

5x25G

4x100G

OTL4.10100GGFEC

OTU4Framer

100GMux &Mon

100GSAR

100GInter-Laken

CFP10x11G 10x12.5G

7VX690T

InterlakenOTL4.10

InterlakenOTL4.10

Existing Infrastrucure

Virtex UltraScale Solution

Solution Benefits

Programmable System Integration2 Devices ‡ 1 Device

System Performance2x

FPGA BOM Cost+40%

FPGA Total Power+25%

Accelerated ProductivityHardened 4x100GMAC-to-Interlaken

2x Port Density

UltraSCALE

Figure 3 – The UltraScale architecture will enable higher-performance, lower-power single-chip CFP4 modules that will vastly improve the performance of existing OTN equipment with a low-cost card-swap upgrade.

(Continued from page 11)

Page 15: Xcell Journal issue 84

Figure 4 illustrates how amidrange Kintex® UltraScale devicewill enable professional and “pro-sumer” camera manufacturers toquickly bring to market new cameraswith advanced capabilities for 4k/2kand 8k while dramatically cuttingcost, size and weight.

ULTRASCALE TO SMARTERWIRELESS COMMUNICATIONS Without a doubt, the most rapidlyexpanding business in electronics ismobile communications. With usageof mobile devices increasing tenfoldover the last 10 years and projec-tions for even more dramaticgrowth in the coming 10, carriersare scrambling to devise ways towoo customers to service plans inthe face of increased competition.Wireless coverage and network per-formance are key differentiatorsand so carriers are always looking

for network equipment that is fasterand can carry a vast amount of dataquickly. They could, of course, buymore and more higher-performancebasestations, but the infrastructurecosts of building out and maintainingmore equipment cuts into profitabili-ty, especially as competitors try tolure customers away by offering morecost-effective plans. Very much liketheir counterparts in wired communi-cations, wireless carriers are seekingsmarter ways to run their businessand are looking for innovative net-work architectures.

One such architecture, called cloudradio-access networks (C-RAN), willenable carriers to pool all the base-band processing resources that wouldnormally be associated with multiplebasestations and perform basebandprocessing “in the cloud.” Carriers canautomatically shift their resources toareas experiencing high demand, such

as those near live sporting events, andthen reallocate the processing to otherareas as the crowd disperses. Theresult? Better quality of service for cus-tomers and a reduction in capital aswell as operating expenditures for carri-ers, with improved profitability.

Creating viable C-RAN architecturesrequires highly sophisticated and high-ly flexible systems-on-chip that inte-grate high-speed I/O with parallel andserial processing. Xilinx’s UltraScaleFPGAs and SoCs will play key roles inthis buildout.

Xilinx is building UltraScale 20-nmversions of its Kintex and Virtex AllProgrammable FPGAs, 3D ICs and ZynqAll Programmable SoCs. The company isalso developing Virtex UltraScaledevices in TSMC’s “FinFET 16” technol-ogy (see sidebar, page 12). For moreinformation on Xilinx’s UltraScale archi-tecture, visit Xilinx’s UltraScale pageand read the backgrounder.

Third Quarter 2013 Xcell Journal 15

C O V E R S T O R Y

DDR2 /DDR3

Virtex-7 X485T

Image Processing

DDR2 /DDR3

ImageSensor

LVDS

CPU

Virtex-7 X485T

Video Processing

HD-SDI3G-SDI

Virtex-7 X485T

Connectivity

Existing Infrastrucure

DDR4

Image Processing,Video Processingand Connectivity

ImageSensor

Serial Interface

CPU 10G-SDI12G-SDI10G VoIP

Kintex UltraScale Solution

DDR4

Solution Benefits

Programmable System Integration4 Devices ‡ 2 Devices

System Performance+30%

FPGA BOM Cost+40%

FPGA Total Power-50%

Accelerated ProductivityVivado High-Level Synthesis

Hardened PCIe Gen3

UltraSCALE

Figure 4 – The UltraScale architecture will let professional and consumer-grade camera manufacturers bring 8k cameras to market sooner, saving power, reducing weight, improving performance and lowering BOM cost.

Page 16: Xcell Journal issue 84

16 Xcell Journal Third Quarter 2013

XCELLENCE IN NETWORKING

Efficient Bitcoin Miner System Implemented on Zynq SoC

Page 17: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 17

Bitcoin is a virtual currency that has become increas-ingly popular over the past few years. As a result,Bitcoin followers have invested some of their assets in

support of this currency by either purchasing or “mining”Bitcoins. Mining is the process of implementing mathematicalcalculations for the Bitcoin Network using computer hard-ware. In return for their services, Bitcoin miners receive alump-sum reward, currently 25 Bitcoins, and any includedtransaction fees. Mining is extremely competitive, since net-work rewards are divided up according to how much calcula-tion all miners have done.

Bitcoin mining began as a software process on cost-ineffi-cient hardware such as CPUs and GPUs. But as Bitcoin’s pop-ularity has grown, the mining process has made dramaticshifts. Early-stage miners had to invest in power-hungryprocessors in order to obtain decent hash rates, as miningrates are called. Although CPU/GPU mining is very inefficient,it is flexible enough to accommodate changes in the Bitcoinprotocol. Over the years, mining operations have slowly grav-itated toward dedicated or semi-dedicated ASIC hardware foroptimized and efficient hash rates. This shift to hardware hasgenerated greater efficiency in mining, but at the cost of lowerflexibility in responding to changes in the mining protocol.

An ASIC is dedicated hardware that is used in a specificapplication to perform certain specified tasks efficiently. Soalthough ASIC Bitcoin miners are relatively inexpensive andproduce excellent hash rates, the trade-off is their inflexibili-ty to protocol changes.

Like an ASIC, an FPGA is also an efficient miner and is rel-atively inexpensive. In addition, FPGAs are more flexible thanASICs and can adjust to Bitcoin protocol changes. The currentproblem is to design a completely efficient and fairly flexiblemining system that does not rely on a PC host or relay deviceto connect with the Bitcoin Network. Our team accomplishedthis task using a Xilinx® Zynq®-7000 All Programmable SoC.

THE BIG PICTURETo accomplish our goal of implementing a complete mining sys-tem, including a working Bitcoin node and a miner that is bothefficient and flexible, we needed a powerful FPGA chip notonly for flexibility, but also for performance. In addition to theFPGA, we also needed to use a processing engine for efficiency.

X C E L L E N C E I N N E T W O R K I N G

by Alexander StandridgeMSc CandidateCalifornia State University, [email protected]

Calvin Ho MSc CandidateCalifornia State University, [email protected]

Shahnam Mirzaei Assistant ProfessorCalifornia State University, [email protected]

Ramin RoostaProfessorCalifornia State University, [email protected]

The integration of programmable logic with a processor subsystemin one device allows for the development of an adaptable and affordable Bitcoin miner.

Page 18: Xcell Journal issue 84

On this complete system-on-chip(SoC), we would need an optimizedkernel to run all required Bitcoin activ-ities, including network maintenanceand transactions handling. The hard-ware that satisfied all these conditionswas the Zynq-7020 SoC residing on theZedBoard development board. Ataround $300 to $400, the ZedBoard is

fairly inexpensive for its class (seehttp://www.zedboard.org/buy). TheZynq-7020 SoC chip contains a pair ofARM® Cortex™-A9 processors embed-ded in 85,000 Artix®-7 FPGA logiccells. The ZedBoard also comes with512 Mbytes of DDR3 memory thatenabled us to run our SoC designfaster. Lastly, the ZedBoard has an SD

card slot for mass storage. It enabledus to put the entire updated Bitcoinprogram onto it.

Using the ZedBoard, we implement-ed our SoC Bitcoin Miner. It includes ahost, a relay, a driver and a miner, asseen in Figure 1. We used the non-graphical version of the originalBitcoin client, known as Bitcoind, as a

X C E L L E N C E I N N E T W O R K I N G

Bitcoin Network

Bitcoind (Host)

Relay

Driver

Miner

SOC Miner

Figure 1 – The basic data flow ofthe miner showing the relation-ships between the components

and the Bitcoin Network

At its simplest, the Bitcoin mining process boils down to a SHA-256process and a comparator. The SHA-256 process manipulates the block header information twice, and then compares it with

the Bitcoin Network’s expanded target difficulty.

#define ROTR(x,n) ( (x >> n) | (x << (32 - n)))// Logic operations

#define s0(x) (ROTR(x,7) ^ ROTR(x,18) ^ (x >> 3))#define s1(x) (ROTR(x,17) ^ ROTR(x,19) ^ (x >> 10))#define s2(a) (ROTR(a,2) ^ ROTR(a,13) ^ ROTR(a,22))#define s3(e) (ROTR(e,6) ^ ROTR(e,11) ^ ROTR(e,25))

#define w(a,b,c,d) (s1(d) + c + s0(b) + a)

#define maj(x,y,z) ((x&y)^(x&z)^(y&z))#define ch(x,y,z) ((x&y)^((~x)&z))

#define t1(e,f,g,h,k,w) (h + s3(e) + ch(e,f,g) + k + w)#define t2(a,b,c) (s2(a) + maj(a,b,c))

for( i = 0; i < 64; i++){// loop 64 times

if( i > 15) {W[i] = w(W[i-16],W[i-15],W[i-7],W[i-

2]); // Temp values}unsigned int temp1 = t1(e,f,g,h,K[i],W[i]);unsigned int temp2 = t2(a,b,c);h = g;

// Round resultsg = f;f = e;e = d + temp1;d = c;c = b;b = a;a = temp1 + temp2;

}

Figure 2 – Vivado HLS implementation of the SHA-256 process

18 Xcell Journal Third Quarter 2013

Page 19: Xcell Journal issue 84

host to interact with the BitcoinNetwork. The relay uses the driver topass work from the host to the miner.

THE DEPTHS OF THE MINING COREWe started developing the mining corewith Vivado® HLS (high-level synthe-sis), using the U.S. Department ofCommerce’s specification for SHA-256.With Vivado HLS’ ability of fast behav-ioral testing, we quickly laid out sever-al prototype mining cores, rangingfrom a simple single-process to a com-plex multiprocess system. But before

exploring the details of the mining-core structure, it’s useful to look at thebasics of the SHA-256 process.

The SHA-256 process pushes 64bytes through 64 iterations of bitwiseshifts, additions and exclusive-ors,producing a 32-byte hash. During theprocess, eight 4-byte registers holdthe results of each round, and whenthe process is done these registersare concatenated to produce thehash. If the input data is less than 63bytes, it is padded to 63 bytes and the64th byte is used to store the length ofthe input data. More commonly, the

input data is larger than 63 bytes. Inthis case, the data is padded out to thenearest multiple of 64, with the lastbyte reserved for the data length. Each64-byte chunk is then run through theSHA-256 process one after another,using the output of the previouschunk, known as the midstate, as thebasis for the next chunk.

With Vivado HLS, the implementa-tion of the SHA-256 process is a simple“for” loop; an array to hold the con-stants required by each round; anadditional array to hold the temporaryvalues used by later rounds; eight vari-ables to hold the results of eachround; and definitions for the logicoperations, as seen in Figure 2.

At its simplest, the Bitcoin miningprocess boils down to a SHA-256process and a comparator. The SHA-256 process manipulates the blockheader information twice, and thencompares it with the Bitcoin Network’sexpanded target difficulty, as seen inFigure 3A. This is a simple concept thatbecomes complicated in hardware,because the first-version Bitcoin blockheader is 80 bytes long. This means thatthe initial pass requires two runs of theSHA-256 process, while the second passrequires only a single run, as seen inFigure 3B. This double pass, definingtwo separate SHA-256 modules, compli-cates the situation because we want tokeep the SHA-256 module as generic aspossible to save development time andto allow reuse. This generic require-ment on the SHA-256 module definesthe inputs and outputs for us, settingthem as the 32-byte initial values, 64-byte data and 32-byte hash. Thus, withthe core aspect of the SHA-256 processisolated, we can then manipulate theseinputs to any layout we desire.

The first of the three prototypes wedeveloped used a single SHA-256Process module. From the beginning,we knew that this core would be theslowest of the three because of thelack of pipelining; however, we werelooking to see how compact we couldmake the SHA-256 process.

X C E L L E N C E I N N E T W O R K I N G

Third Quarter 2013 Xcell Journal 19

SHA-256 Process

Comparator

Header

32 Bytes

1 Bit

A

SHA-256 Process

Comparator

80-Byte Input

32 Bytes

SHA-256 Process

32 Bytes

1 Bit B

SHA-256 Process

Comparator

80-Byte Input

32 Bytes

SHA-256 Process

32 Bytes

1 Bit1 Bit D

SHA-256 Process

SHA-256 Process

SHA-256 Process

SHA-256 Process

64-Byte Input

32 BytesMidstate

32 Bytes

32 Bytes

16 bytes+Padding

Padding

Midstate

Padding

C

Figure 3 – A (top left) is the simplest form of the core, showing the general flow.B splits the single, universal SHA-256 process module into a pair of dedicatedmodules for each phase. C breaks down the specific SHA-256 processes into a

single generic module that can be repeatedly used. D utilizes a quirk in the blockheader information allowing the removal of the first generic SHA-256 process.

Page 20: Xcell Journal issue 84

The second prototype containedthree separate SHA-256 Process mod-ules chained together, as seen inFigure 3C. This configuration allowedfor static inputs where the firstprocess handles the first 64 bytes ofthe header information. The secondprocess handles the remaining 16bytes from the header plus the 48 bytesof required padding. The third processhandles the resulting hash from thefirst two. These three separateprocesses simplified the control logicand allowed for simple pipelining.

The third and final prototype usedtwo SHA-256 Process modules andtook advantage of a quirk in theBitcoin header information and min-ing process that the Bitcoin miningcommunity makes use of. The blockheader information is laid out in sucha way that only the first 64 byteschange when the block is modified.This means that once the first 64bytes are processed, we can save theoutput data (midstate) and only

process the last 16 bytes of the head-er, for a significant hash rate boost(see Figure 3D).

The comparator is a major compo-nent, and properly designed can signif-icantly increase performance. A solu-tion is a hash that has a numeric valueless than the 32-byte expanded targetdifficulty defined by the Bitcoin sys-tem. This means we must compareeach hash we produce and wait to seeif a solution is found. Due to a quirkwith the expanded target difficulty, thefirst 4 bytes will always be zero regard-less of what the difficulty is, and as thenetwork’s difficulty increases so willthe number of leading zeros. At thetime of this writing, the first 13 bytesof the expanded target difficulty werezero. Thus, instead of comparing thewhole hash to the target difficulty, wecheck to see if the first x number ofbytes are zero. If they are zero, thenwe compare the hash to the target dif-ficulty. If they are not zero, then wetoss the hash out and start over.

With the results of our prototypes inhand, we settled on a version of thethird core developed by the Bitcoinopen-source community specificallyfor the Xilinx FPGAs.

ISE DEVELOPMENTBitcoin mining is a race to find a num-ber between zero and 232-1 that givesyou a solution. Thus, there are reallyonly two ways to improve mining-core performance: faster processingor divide-and-conquer. Using aSpartan®-3 and a Spartan-6 develop-ment board, we tested a series of dif-ferent frequencies, pipelining tech-niques and parallelization.

For our frequency tests, we startedwith the Spartan-3E developmentboard. However, it quickly becameapparent that nothing could be donebeyond 50 MHz and as such, we leftthe frequency tests to the Spartan-6.

We tested 50, 100 and 150 MHz onthe Spartan-6 development board,which resulted in predictable results,achieving 0.8, 1.6 and 2.4 mega hashesper second (MHps) respectively. Wealso tried 75 MHz and 125 MHz duringthe parallelization and pipeliningtests, but these frequencies were onlya compromise to make the miner fitonto the Spartan-6.

One of the reasons we chose themining core from the open-source com-munity was that it was designed with adepth variable to control the number ofpipelined stages. The depth is an expo-nential value between 0 and 6, whichcontrols the number of power-two iter-ations the VHDL “generate” commandexecutes—namely, 20 to 26. We per-formed the initial frequency tests with adepth of 0—that is, only a single roundimplemented for each process.

On the Spartan-3e we testeddepths of 0 and 1 before running intotiming-constraint issues. The depth of0 at 50 MHz gave us roughly 0.8 MHps,while the depth of 1 achieved roughly1.6 MHps.

On the Spartan-6 we were able toreach a depth of 3 before running into

20 Xcell Journal Third Quarter 2013

X C E L L E N C E I N N E T W O R K I N G

4

3.5

3

2.5

2

1.5

1

0.5

000:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00

Wo

rker

Sp

eed

(M

Hp

s)

Figure 4 – The hash rate of the mining core over a four-day period of core-optimizationtesting. The graph is an average over time, giving a general idea of performance of the

mining core. The results decreased due to hitting the routing limits of the test platforms.

Page 21: Xcell Journal issue 84

timing-constraint issues. At 50 MHz, theSpartan-6 achieved the same results asthe Spartan-3 at a depth of 0 and 1. Wenoticed an interesting trend at thispoint. Doubling the frequency had thesame effect as increasing the depth ofthe pipeline. The upper limit was set bythe availability of routing resources,with our peak averaging out at roughly3.8 MHps, as seen in Figure 4.

The SHA-256 Process module has10 different 32-bit additions. In thisimplementation, we are attempting todo all of them in a single clock cycle.To shorten the longest path, we triedto divide the adder chains into a seriesof stages. However, doing so radicallychanged the control logic for thewhole mining core and required acomplete rewrite. We dropped thesechanges to save time and effort.

The final performance improvementwe tested was parallelization. Withonly minor modifications, we doubledand quadrupled the number of SHA-256 sets. A set consists of two SHA-256Process modules. This modificationeffectively cut the potential numberrange for each set into halves, for thedoubled number of SHA-256 sets, orfourths for the quadrupled number. Tofit these extra SHA processes onto theSpartan-6, we had to reduce the systemfrequency. Four sets of the SHA-256Process modules would run at 75 MHz,while two sets would run at 125 MHz.The hash rate improvements were dif-ficult to document. We could easily seethe hash rate of an individual SHA-256set; however, a multiple-set miningcore could find a solution faster thanthe hash rate implied.

USING THE EDKAfter FPGA testing, our next step wasto tie the mining core into the ZynqSoC’s AXI4 bus. Xilinx’s EmbeddedDevelopment Kit (EDK) comesloaded with a configuration utilityspecifically designed for the ZynqSoC, allowing us to easily configureevery aspect. By default, the systemhas the 512 Mbytes of DDR3, theEthernet, the USB and the SD inter-face enabled, which is everything weneeded for our Bitcoin SoC.

The twin Cortex-A9 processors fea-ture the AXI4 interface instead of thePLB system used by previous soft-coresystems. All of the peripherals areattached to processors through theAXI4 interface, as shown in Figure 5.

The EDK custom peripheral wizardprovides code stubs for the differentvariants of the AXI4 interface andserves as the basis for the develop-ment of the mining core’s AXI inter-face. For simplicity, we used the AXI4-Lite interface, which provides simpleread and write functionality to themining core. Ideally, developers wouldwant to use the regular AXI4 interfaceto take advantage of the advancedinterface controls, such as data bursts.

Keeping with the goal of simplicity,we used three memory-mapped regis-ters to handle the mining-core I/O. Thefirst register feeds the host data packetto the miner. This register keeps trackof the amount of data passed and locksitself after 11 words have been trans-ferred. This lock is removed automati-cally when a solution is found. We canremove the lock manually if necessarythrough the status register.

We defined the second register asthe status register, flagging certainbits for the various states the miningcore goes through. With the simplicityof our design, we used only threeflags: a loaded flag, a running flag anda solution-found flag. The loaded flagis triggered when the mining corereceives 11 words, as mentionedbefore, and is cleared when we writeto it. The start/running flag is set by us

X C E L L E N C E I N N E T W O R K I N G

Third Quarter 2013 Xcell Journal 21

SD Card

Ethernet

ARM Cortex-A9

ARM Cortex-A9 Mining Core

DDR3 RAM

ProcessorSubsystem

(PS)Programmable

Logic (PL)

AXI4 Bus

Figure 5 – The relatively new AXI4 interface connects the ZedBoard’s hard peripherals with the ARM processors and Artix FPGA logic.

Page 22: Xcell Journal issue 84

that before we can use the reservedaddress we have to request an addressremap for the mining core. The kernelgives us a virtual address through theremap, which we can then use to com-municate to the mining-core registers.

Now that we have access to thehardware, the initialization functionthen runs a simple test on the miningcore to confirm that the core is operat-ing properly. If it is, we register themajor and minor numbers—which areused to identify the device in user pro-grams—with the kernel. Every func-tion has a counter, and the exit functionis the counter to the initialization func-tion. In this function, we undo every-thing done during initialization—specifically, release the major andminor numbers used by the driver, thenrelease the mapped virtual address.

When we first call the mining corefrom the relay program, the open func-tion runs. All we do here is to confirmthat the self-test run during the initial-ization function is successful. If theself-test fails, then we throw a system-error message and fail out. The closefunction is called when the device isreleased from the relay program.However, since we only check the self-test results in the open function, thereis nothing to do in the close function.The read function looks at the databuffer and determines which port theuser is reading, and it then gets thedata from the mining core and passesit back. The write function determineswhich register the user is writing toand passes the data to the mining core.

The final component of our systemis a small relay program to pass work

to start the mining core, and iscleared by the mining core when asolution is found. The final register isthe output register, which is loadedwith the solution when one is found.

We tested each stage of the AXI4-Lite interface development beforeadding the next component. We alsotested the AXI4-Lite interface inXilinx’s Software Development Kitusing a bare-bones system. The testserved two purposes: first, to confirmthat the mining core’s AXI4-Lite inter-face was working correctly and sec-ond, to ensure that the mining corewas receiving the data in the correctendian format.

EMBEDDED LINUXWith the mining core tied into theprocessors, we moved to softwaredevelopment. Ideally we wanted tobuild our own firmware, potentiallywith a Linux core, to maximize per-formance. For ease of development,we leveraged a full Linux installation.We used Xillinux, developed byXillybus, an Ubuntu derivative forrapid embedded-system development.

Our first order of business was tocompile Bitcoind for the Cortex-A9architecture. We used the original open-source Bitcoin software to ensure com-patibility with the Bitcoin Network.

Testing Bitcoind was a simplematter. We ran the daemon, waitedfor the sizable block chain to down-load and then used the line commandto start Bitcoind’s built-in CPU min-ing software.

For the Linux driver, you mustimplement a handful of specific func-tions and link them to the Linux kernelfor correct operation. When the driveris loaded a specific initialization func-tion runs, preparing the system forinteraction with the hardware. In thisfunction, we first confirm that thememory address to which the miningcore is tied is available for use, thenreserve the address. Unlike a bare-bones system, Linux uses a virtualmemory address scheme, which means

22 Xcell Journal Third Quarter 2013

X C E L L E N C E I N N E T W O R K I N G

from Bitcoind to the mining corethrough the driver, and return results. Asa matter of course, the relay programwould check to see if Bitcoind was run-ning and that the mining core was readyand operating properly. Since the relayprogram would normally sit idle for themajority of the time, we will implementa statistics compilation subsystem togather data and organize it in a log file.Ideally, we would use a Web host inter-face for configuration, statistical displayand device status.

COMPLETE, EFFICIENT MINING SYSTEMOur solution to an efficient and com-plete Bitcoin mining system is the XilinxZynq-7000 All Programmable SoC on theZedBoard. This board is flexible tochanges in the Bitcoin protocol and alsooffers a high-performance FPGA solu-tion along with SoC capability. Toimprove on our current design withoutjeopardizing the adaptability, we couldadd more mining cores that connect tothe main Bitcoin node. These additionscould boost the performance significant-ly and provide a higher overall hash rate.

Another improvement that couldfurther optimize this design is to havededicated firmware. Our currentdesign is running Ubuntu Linux 12.04,which has many unnecessary process-es running concurrently with theBitcoin program, like SSH. It is awaste of the board’s resources to runthese processes while Bitcoin is run-ning. In a future build, we would elim-inate these processes by running ourown custom firmware dedicated toonly Bitcoin tasks.

To improve on our current design without jeopardizing the adaptability, we could add more

mining cores that connect to the main Bitcoin node.These additions could boost the performance

and provide a higher overall hash rate.

Page 23: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 23

XCELLENCE IN AEROSPACE & DEFENSE

Ensuring FPGA Reconfiguration in Space by Robért Glein

System EngineerFriedrich-Alexander-Universität Erlangen-Nürnberg (FAU), InformationTechnology (Communication Electronics)[email protected]

Florian RittnerFirmware DesignerFraunhofer IISRF and Microwave Design Department

Alexander HofmannTelecommunications EngineerFraunhofer IISRF and Microwave Design Department

Page 24: Xcell Journal issue 84

24 Xcell Journal Third Quarter 2013

In 2007, the German FederalMinistry of Economics andTechnology commissionedthe German Aerospace Center

to perform a feasibility study of anexperimental satellite to explore com-munications methods. Government aswell as academic researchers will usethe satellite, called the Heinrich Hertzcommunications satellite, to performexperiments in the Ka and K band. Itis scheduled for launch in 2017. A key element of this system is theFraunhofer-designed onboard proces-sor, which we implemented on Xilinx®

Virtex®-5QV FPGAs. The Fraunhofer onboard proces-

sor—which is involved in the in-orbitverification of modules in space—is theessential module of the satellite’s regen-erative transponder. Because theHeinrich Hertz satellite will be used fora broad number of communicationsexperiments, we chose to leverage theFPGAs’ capability to be reconfigured forvarious tasks. This saves space, reducesweight and enables the use of futurecommunications protocols. Figure 1illustrates this onboard processor andother parts of the scientific in-orbit veri-fication payload of the satellite.

But because space is perhaps theharshest environment for electronics,we had to take a few key steps toensure the system could be reconfig-ured reliably. The radiation-hardenedXilinx FPGAs offer a perfect solutionfor reconfigurable space applicationsregarding total ionizing dose, single-event effects, vibration and thermalcycles. The reconfiguration of the dig-ital hardware of these FPGAs is anessential function in this geostationarysatellite mission in order to ensure

system flexibility. To achieve a robustreconfiguration without a single pointof failure, we devised a two-prongedmethodology that employs two con-figuration methods in parallel. Besidesa state-of-the-art (SOTA) method, wealso introduced a fail-safe reconfigura-tion method via partial reconfigura-tion, which enables a self-reconfigura-tion of the FPGAs.

A straightforward way to configureFPGAs is to use an external radiation-hardened configuration processor witha rad-hard mass memory (for example,flash). In this case, the (partial) bit fileis stored in the memory and the proces-sor configures the FPGA with the bit-stream. This approach offers good flex-ibility if your design requires a fullFPGA reconfiguration and has manybit files onboard. You can update orverify bit files via a low-data-ratetelemetry/telecommand channel or adigital communication up- or downlinkwith a higher data rate. This SOTA con-figuration method is nonredundant. Letus assume a failure of the processor orof the configuration interface, asshown in Figure 2. Now it is impossibleto reconfigure the FPGA and all Virtex-5QV devices are useless.

OVERALL CONFIGURATIONCONCEPT FOR THE FPGAFigure 2 illustrates both configurationmethods for one FPGA—the SOTAmethod and the fail-safe configuration(also the initial configuration) method,which only needs one radiation-hard-ened nonvolatile memory (PROM ormagnetoresistive RAM) as an externalcomponent. This component stores theinitial firmware, recognizable as con-tent of the FPGA. We use these two con-

figuration methods in parallel in order tobenefit from the advantages of both.With this approach we also increase thereconfiguration reliability and overcomethe single-point-of-failure problem.

DETAILS OF FAIL-SAFE CONFIGURATION METHODWe designed this fail-safe initial con-figuration method in order to enableself-reconfiguration of the FPGA. TheFPGA configures itself with the initialconfiguration at startup of theonboard processor. We use the serialmaster configuration protocol for thePROM or the Byte Peripheral Interfacefor the magnetoresistive RAM (MRAM).However, only one nonvolatile memoryper FPGA is required for this fail-safeconfiguration method.

We separated the firmware of the ini-tial configuration in dynamic (blue) andstatic (light-red and red) areas as shownin Figure 2. With partial reconfiguration(PR), it is possible to reconfiguredynamic areas while static and otherdynamic areas are still running.

After a power-up sequence, theFPGA automatically configures itselfwith the initial firmware and is readyto receive new partial or even com-plete (only by MRAM) bit files. A trans-mitter on Earth packs these bit filesinto a data stream, encrypts, encodesand modulates it, and sends it over theuplink to the satellite. After signal con-ditioning and digitizing, the first recon-figurable partition (RP0) inside theFPGA receives the digital data stream.RP0 performs a digital demodulationand decoding. The Block RAM (BRAM)shares the packed bit file with theMicroBlaze™ system. The MicroBlazeunpacks the bit file and stores it in the

A fail-safe method of reconfiguring rad-hard Xilinx FPGAs improves reliability in remote space applications.

X C E L L E N C E I N A E R O S PA C E & D E F E N S E

Page 25: Xcell Journal issue 84

X C E L L E N C E I N A E R O S PA C E & D E F E N S E

SRAM or SDRAM. The MicroBlazereconfigures a dynamic reconfigurablepartition (RP) or all RPs with a direct-memory access (DMA) via the InternalConfiguration Access Port (ICAP).The static part of the FPGA should notbe reconfigured.

In case of an overwritten RP0, a high-priority command over the satellite con-trol bus could reboot the onboardprocessor. This command would restorethe initial fail-safe configuration.

There are plenty of new possibleapplications for reconfigurable parti-tions, including digital transmitters,digital receivers and replacement of ahighly reliable communications proto-col, among others. You can also recon-figure and switch to new submoduleslike demappers or decoders for adap-tive coding and modulation.

You can even replace the whole fail-safe configuration bit file with a newdouble-checked initial bit file viaMRAM as storage for the initial config-uration. With a digital transmitter, theonboard processor can also send theconfiguration data back to Earth inorder to verify bit files on the ground.

CONFIGURATION RELIABILITYA major benefit of using both configu-ration methods (SOTA and initial) isan increase of the overall system relia-bility. There is a drawback in the initialfail-safe method regarding the usableFPGA resources. In a geosynchronous

orbit mission with a term of 15 years,reliability of the most important pro-cedures such as configurationdepends on the failure-in-time (FIT)rates of both configuration methods.We assume a typical FIT rate of 500 ina representative configuration proces-sor and its peripherals for the SOTAconfiguration method. The PROMachieves a much better FIT rate of 61(part stress method for failure-rateprediction at operating conditionsregarding MIL-HDBK 217).

Nevertheless, both methods aresingle points of failure. Only the par-

allel operation guarantees a signifi-cant increase in the reliability of theconfiguration procedure. Calculatingthe standalone reliability of theSOTA configuration method resultsin Rsc = 0.9364 and of the initialmethod in Ric = 0.9920, with the

given mission parameters. The over-all configuration reliability for sys-tems in parallel is Rc = 0.9929.

We have to compare Rc and Rsc inorder to obtain the gain of the recon-figuration reliability. The achievedgain (Rc - Rsc) is 5.65 percent. The ini-tial configuration has some restric-tions in terms of FPGA resources.Nevertheless, you can still reconfigurethe whole FPGA anytime with themore complex SOTA configurationmethod if the external configurationprocessor is still operating.

Both methods can correct failures

of the FPGA caused by configurationsingle-event upsets. For the SOTA con-figuration method, the external config-uration processor performs a scrub-bing, as does the self-scrubbedMicroBlaze for the initial configura-tion method.

Third Quarter 2013 Quarter 2013 Xcell Journal 25

Two-Spot-BeamRx

Two-Spot-BeamTxFilter

Filter

LTWTA

Filter

LNA Down-converter

Down-converter

Up-converter

SwitchMatrix

SwitchMatrix

FOBP

Figure 1 – Part of the scientific in-orbit verification payload of the Heinrich Hertz satellite; the gold box in the middle,

labeled FOBP, is the Fraunhofer onboard processor. LTWTA stands for linear traveling-wave tube amplifier.

In case of an overwrite, a high-priority command over thesatellite control bus could reboot

the onboard processor and restorethe initial fail-safe configuration.

Page 26: Xcell Journal issue 84

IMPLEMENTING FAIL-SAFECONFIGURATIONFigure 2 illustrates the detailed archi-tecture of one of the onboard proces-sor’s FPGAs. The FPGA is divided intothree parts: MicroBlaze, static intellec-tual-property (IP) cores and RPs. Anissue of the MicroBlaze system is mem-ory management. Alongside the PROMor MRAM, we attached an SRAM andan SDRAM. These two different memo-ry types enable applications like tempo-rary packet buffers and content storage(for example, video files). The memorymultiplexer block (Mem Mux) enableshigher flexibility. It multiplexes thememory to RPs or to the MicroBlazeregion. Semaphores control the multi-plexer via a general-purpose input/out-put (GPIO) connection. So, only oneregion can use the single-port memoryat any given time. The initial configura-

tion also contains internal memory withBRAM, presenting an opportunity for afast data exchange of both regions.

The build wrapper in XilinxPlatform Studio (XPS) enables a uni-versal integration of the MicroBlazesystem with support and connection ofindependent ports.

We implemented IP cores such asBRAM, RocketIO™ and digital clockmanager (DCM). The RP switchingblock connects the RPs by means ofsettings in the RP switching configura-tion, which flexibly interconnects theRPs. A data flow control (DFC) man-ages the data flow among the RPs. TheFPGAs communicate to each other bymeans of RocketIO and low-voltage dif-ferential signaling (LVDS).

We divided the RPs into two parti-tions with 20 percent configurable logicblock (CLB) resources of the FPGA and

two partitions with 10 percent CLBresources. We also tried to share theDSP48 slices and BRAM equally andmaximally to all partitions.

The main part of the implementationis the system design and floorplanning,which we realized with the PlanAheadtool from Xilinx. We combined theMicroBlaze system, the IP cores, thesoftware design and the partial userhardware description language (HDL)modules, called reconfigurable mod-ules. We also defined the location ofRPs, planned design runs, chose a syn-thesis strategy and constrained thedesign in this step. A BRAM memorymap file defines the BRAM resourcesfor data and instruction memory. Thisenables an integration of the initialMicroBlaze software in the bit file.

Figure 3 shows the implementationresult. Currently we are developing a

26 Xcell Journal Third Quarter 2013

RP0Size 20%

RP1Size 20%

RP2Size 10%

RP3Size 10%

RP Switching Block

BRAMBlock

Clocks

Uplink/Downlink

RIO GTX RIO Cores

FPGA CoreTOP Wrapper (top_wrap.vhd)

Mem Mux

MPMCSDRAM

BlockEnable Bank

GPIO2

GPIO1

MEM Semaphores

EMCSRAM

EXTBC

BRAMBlock

INTBC

INTBC

INTBC

Micro-Blaze

BRAM(Data &Instr.)

ResetModule

CLKGen

GPIOLED

UART

Timer &WD

MDM

ICAPController

8 x GPIO8 x LED

JTAG

ConfigInter-face

SRAM SDRAMPROM/MRAM

External ConfigurationMicroprocessor

External PCB

External Config-uration Flash

TM/TC

Interlinks TOP/UCF

Ë Dynamic Reconfigurable PartitionsË Static Partitions

Ë Same BoardË Static Partitions

Ë Data-PathË PLB MB

Ë BRAM-BusË LMB MB

DCM120 MHz

DCM360 MHz

Other FPGAs LVDS

,

RP Switching Configuration

Figure 2 – Architecture of the first FPGA in the initial configuration system. The static sections are in red and light red,

and the dynamic portions in light blue (left). The yellow lightning bolt represents a SOTA configuration failure.

X C E L L E N C E I N A E R O S PA C E & D E F E N S E

Page 27: Xcell Journal issue 84

X C E L L E N C E I N A E R O S PA C E & D E F E N S E

highly reliable receiver for the firstpartition (RP0), which will be used inthe initial configuration (on the satel-lite) to receive (demodulating, demap-ping, deinterleaving and decoding) theconfiguration data.

Table 1 illustrates target and actualsizes of the RPs. We achieved a slight-ly higher CLB count than we targeted,whereby the DSP slices nearly reachthe maximum. The BRAM resources

correspond to the CLBs because ofthe MicroBlaze’s demand for BRAMs.The overhead of the static area isapproximately 30 percent. Wedesigned this firmware for the com-mercial Virtex-5 FX130 in order toprove the concept with an elegantbreadboard. At this point, we areabout to port this design to the Virtex-5QV and implement further features,such as triple-modular redundancy orerror-correction codes.

PROBLEM SOLVEDA single point of failure regarding FPGAconfiguration is a problem in highly reli-able applications. You can overcomethis issue with an initial configuration,which needs only one additional deviceper FPGA. Partial reconfigurationenables the remote FPGA configurationvia this fail-safe initial configuration.The parallel use of both configurationscombines their main advantages: theincrease of 5.65 percent in configura-tion reliability and the flexibility toreconfigure the complete FPGA.

System design and implementationchallenges require a trade-offbetween complexity regarding therouting or clock resources and theusable design resources. An elegantbreadboard shows the successfulimplementation and verification in areal-world scenario.

Third Quarter 2013Third Quarter 2013 Xcell Journal 27

CLBs DSP Slices BRAM

Target Result Target Result Target Result

4,096 4,400 max. 80 max. 60

20% ≈ 21% max. ≈ 25% max. ≈ 20%

2,048 2,720 max. 76 max. 38

10% ≈ 13% max. ≈ 24% max. ≈ 13%

12,288 14,240 max. 312 max. 196

60% ≈ 70% max. ≈ 98% max. ≈ 66%

RP0/RP1

RP2/RP3

sum of allRP

resources

Figure 3 – PlanAhead view of the implemented design. The blue rectangles represent the RP regions. RP0 is seen as purple, the MicroBlaze as yellow, BRAM as green and other static regions as red.

Table 1 – FPGA resources of the reconfigurable partitions (RPx)

FPGA S OLUTIONS

? 100K Logic Cells + 240 DSP Slices? DDR3 SDRAM + Gigabit Ethernet? SO-DIMM form factor (68 x 30 mm)

Mars AX3 Artix®-7 FPGA Module

from

Mars ZX3 SoC Module

? Xilinx Zynq™-7000 All Programmable SoC (Dual Cortex™-A9 + Xilinx Artix®-7 FPGA)

? DDR3 SDRAM + NAND Flash? Gigabit Ethernet + USB 2.0 OTG? SO-DIMM form factor (68 x 30 mm)

We speak FPGA.

www.enclustra.com

? Xilinx Kintex™-7 FPGA? High-performance DDR3 SDRAM? USB 3.0, PCIe 2.0 + 2 Gigabit Ethernet ports? Smaller than a credit card

Mercury KX1 FPGA Module

FPGA Manager Solution

Simple, fast host-to-FPGA data transfer, for PCI Express, USB 3.0 and Gigabit Ethernet. Supports user applications written in C, C++, C#, VB.net,

MATLAB®, Simulink® and LabVIEW.

Page 28: Xcell Journal issue 84

28 Xcell Journal Third Quarter 2013

XPERIMENT

It’s easy to build an attractivedemonstration platform for students and beginners around a Xilinx Spartan-6 device and a few peripheral components.

XRadio:A Rockin' FPGA-onlyRadio for TeachingSystem Design

XRadio:A Rockin' FPGA-onlyRadio for TeachingSystem Design

Page 29: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 29

Recently here at the Beijing Institute of Technology, we setout to develop a digital design educational platform thatdemonstrated the practical use of FPGAs in communica-tions and signal processing. The platform needed to beintuitive and easy to use, to help students understand allaspects of digital design. It also had to be easy for studentsto customize for their own unique system designs.

Around that time, we in the electrical engineering depart-ment had been having a lively debate as to whether it waspossible to use an FPGA’s I/O pins as a comparator or a 1-bitA/D converter directly. We decided to test that premise by try-ing to incorporate this FPGA-based comparator in the designof our XRadio platform, an entirely digital FM radio receiverthat we designed using a low-cost Xilinx® Spartan®-6 FPGAand a few general peripheral components. We demonstratedthe working radio to Xilinx University Program (XUP) direc-tor Patrick Lysaght, during his visit last year to the BeijingInstitute of Technology. He was excited by the simplicity ofthe design, and encouraged us to share the story of theXRadio with the academic community worldwide.

SYSTEM ARCHITECTUREWe built the XRadio platform almost entirely in the FPGAand omitted the use of traditional analog components suchas amplifiers or discrete filters, as shown in Figure 1. As afirst step, we created a basic antenna by linking a simplecoupling circuit with a wire connected to the FPGA’s I/Opins. This antenna transmits RF signals to the FPGA, whichimplements the signal processing of the FM receiver as digi-tal downconversion and frequency demodulation. We thenoutput the audio signal through the I/O pins to the head-phones. We added mechanical rotary incremental encodersto control the XRadio’s tuning and volume. We designed thesystem so that the tuning and volume information is dis-played on the seven-segment LED module.

Figure 2 shows a top-level logic diagram of the FPGA.In this design, the RF signals coupled into the input bufferof the FPGA are quantified as a 1-bit digital signal. Thequantified signal is then multiplied by the local oscillation

X P E R I M E N T

Rby Qiongzhi WuLecturerBeijing Institute of [email protected]

Xingran YangMS StudentBeijing Institute of [email protected]

Page 30: Xcell Journal issue 84

signal generated by a numericallycontrolled oscillator (NCO). Themultiplied signal is filtered to obtainan orthogonal IQ (in-phase plusquadrature phase) baseband signal.It then obtains the audio data streamfrom the IQ signal through the frequen-cy demodulator and low-pass filter.

The data stream next goes to thepulse-width modulation (PWM) mod-ule, which generates pulses on itsoutput. By filtering the pulses, weobtain a proportional analog value todrive the headphones. We connecteda controller module to the mechani-cal rotary incremental encoders and

LEDs. This module gets a pulse sig-nal from incremental encoders toadjust the output frequency of theNCO and audio volume controlledby the PWM module.

IMPLEMENTATION DETAILSThe first challenge we had to over-come was how to couple the signalfrom the antenna to the FPGA. In ourfirst experimental design, we config-ured an FPGA I/O as a standard sin-gle-ended I/O; then we used resistorsR1 and R2 to build a voltage divider togenerate a bias voltage between VIHand VIL at the FPGA pin. The signalthat the antenna receives can drivethe input buffer through the couplingcapacitor C1. The signal is sampledby two levels of D flip-flops, whichare driven by an internal 240-MHzclock. The output of the flip-flopsobtains an equal interval 1-bit sam-pling data stream.

To test this circuit, we fed theresult to another FPGA pin and meas-ured it with a spectrum analyzer tosee if the FPGA received the signalaccurately. However, it didn’t workwell because the Spartan-6 FPGA’sinput buffer has a small, 120-millivolthysteresis, according to the analyzer.Although the hysteresis is good for

30 Xcell Journal Third Quarter 2013

X P E R I M E N T

Incremental Encoder

Earphone

LED Module

SimpleAntenna

Antenna BUF

NCO

SIN & COS

Encoder

CICLPFs

I/Q

LEDDisplay

LPF PWM EarphoneFD

Controller

Figure 1 – The XRadio hardware concept

Figure 2 – Top-level diagram of the XRadio FPGA

Page 31: Xcell Journal issue 84

preventing noise generally, in thisapplication it wasn’t wanted. We hadto devise a way to increase thestrength of the signal.

To solve this problem, we foundthat the differential input bufferprimitive, IBUFDS, has a very highsensitivity between its positive andnegative terminals. According to ourtest, a differential voltage as low as

1 mV peak-to-peak is enough tomake IBUFDS swing between 0 and1. Figure 3 shows the resultinginput circuit. In this implementa-tion, the resistors R1, R2 and R3generate a common voltage in boththe P and N terminals of IBUFDS.The received signals feed into the Psidefile:///app/ds/terminal throughcoupling capacitor C1. The AC signal

is filtered out by the capacitor C2 onthe N side, which acts as an AC refer-ence. With this circuit, the FPGA con-verts the FM broadcast signals into a1-bit data stream successfully.

The radio picks up about seven FMchannels with different amplitudes—including 103.9-MHz Beijing TrafficRadio. The design significantlyreduces the signal-to-noise ratiobecause of the quantizing noise gener-ated by 1-bit sampling. But there arefour channels that are still distin-guishable against the noise floor.Therefore, we were able to prove ourtheory that an FPGA’s I/O pins can beused effectively as a comparator or a1-bit A/D converter in our XRadio.

For digital downconversion, weused the DDS Compiler 4.0 IP core tobuild a numerically controlled oscilla-tor with a 240-MHz system clock. Thesine and cosine output frequencyranged from 87 MHz to 108 MHz. Thelocal-oscillator signals that the NCOgenerates are multiplied with the 1-bitsampling stream and travel through a

X P E R I M E N T

Third Quarter 2013 Xcell Journal 31

VCCO

FPGA

P

NC1

R3 R2

R1

C2

CLK

D Q D QIBUFDS

BasebandData ADDR DATA

ROM

Phase Angle

CLK

D Q LPFAudioData

+

The design slashes the signal-to-noise ratio because of the quantizing noise generated by 1-bit sampling. But fourchannels are still distinguishable against the noise floor.

Thus, we were able to prove our theory that an FPGA’s I/O pinscan be used as a comparator or as a 1-bit A/D converter.

Figure 3 – The antenna input method, making use of the differential input buffer primitive (IBUFDS)

Figure 4 – Frequency demodulator, which converts the frequency of the IQ baseband signal to our audio signal

Page 32: Xcell Journal issue 84

low-pass filter to obtain the orthogo-nal baseband signal. Here, we usedthe CIC Compiler 2.0 IP core to builda three-order, low-pass CIC decima-tion filter with a downsampling fac-tor R = 240. The sample rate drops to1 MSPS because the filtered base-band orthogonal signal is a narrow-band signal.

Figure 4 shows the frequencydemodulator, which converts thefrequency of the IQ baseband signalto our audio signal. We extracted theIQ data’s instantaneous phase angleusing a lookup table in ROM. The IQdata is merged and is used as anaddress of the ROM; then the ROMoutputs both the real and the imagi-nary part of the corresponding com-plex angle. Next we have a differ-ence operation in which one tap by a

32 Xcell Journal Third Quarter 2013

X P E R I M E N T

Slice Logic Utilization Used Available Utilization

Slice registers

Slice LUTs

Occupied slices

MUXCYs

RAMB16BWERS

RAMB8Bwers

BUFG/BUFGMUXs

PLL_ADVS

1,181

1,062

406

544

2

1

4

1

11,440

5,720

1,430

2,860

32

64

16

2

10%

18%

28%

19%

6%

1%

25%

50%

Table 1 – Resource utilization on the XC6SLX9 FPGA

Figure 5 – Co-author and grad student Xingran Yang shows off the working model.

Page 33: Xcell Journal issue 84

flip-flop delays the phase-angle data.When the delayed data is subtractedfrom the original data, the result isexactly the wanted audio signal. Toimprove the output signal-to-noiseratio, we filtered the audio signalthrough a low-pass filter using sim-ple averaging.

The frequency demodulator’s 8-bitaudio data stream output is scaledaccording to the volume controlparameter and sent into an 8-bitPWM module. The duty cycle ofPWM pulses represents the ampli-tude of the audio signal. The pulsesoutput at the FPGA’s I/O pins anddrive the headphone through capaci-

tance. Here, the headphone acts as alow-pass filter, eliminating the high-frequency part of the pulses remain-ing in the audio signal.

Two rotary incremental encoderscontrol the frequency and volume ofthe radio. Each of the encoders out-puts two pulse signals. The rotationaldirection and rotational speed canbe determined by the pulse widthand phase. State machines and coun-ters translate the rotation states intoa frequency control word and a vol-ume control word. Meanwhile, thefrequency volume values are decod-ed and then displayed on the seven-segment LEDs.

ROOM TO CUSTOMIZEThe printed-circuit board for theXRadio platform is shown in Figure5. With a Spartan-6 XC6SLX9 FPGA,you can implement a radio programthat consumes only a small part ofthe FPGA’s logic resources, as shownin Table 1. This means there is roomfor students to customize the designand make improvements. For exam-ple, today with the 1-bit sampling,XRadio’s audio quality is not as goodas a standard FM radio. However,this leaves room for students to fig-ure out how to improve the XRadio’saudio quality and how to implementit in stereo.

X P E R I M E N T

Third Quarter 2013 Xcell Journal 33

Debugging Xilinx's ZynqTM -7000 family

with ARM CoreSight

RTOS support, including Linux

kernel and process debugging

SMP/AMP multicore CortexTM-

A9 MPCoreTMs debugging

Up to 4 GByte realtime trace

including PTM/ITM

Profiling, performance and

statistical analysis of Zynq's

multicore CortexTM -A9 MPCoreTM

Page 34: Xcell Journal issue 84

34 Xcell Journal Third Quarter 2013

XPERTS CORNER

by Syed Zahid Ahmed, PhDSenior ResearcherCNRS-ENSEA-UCP ETIS Lab, Paris [email protected]

Sébastien Fuhrmann, MScR&D EngineerCNRS-ENSEA-UCP ETIS Lab (now at Elsys Design), Paris [email protected]

Bertrand Granado, PhDFull ProfessorCNRS-ENSEA-UCP ETIS Lab and UPMC LIP6 Lab, Paris [email protected]

Benchmark: Vivado’sESL Capabilities SpeedIP Design on Zynq SoC

Page 35: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 35

F PGAs are widely used as prototyping oractual SoC implementation vehicles forsignal-processing applications. Their

massive parallelism and abundant heteroge-neous blocks of on-chip memory along withDSP building blocks allow efficient implemen-tations that often rival standard microproces-sors, DSPs and GPUs. With the emergence ofthe Xilinx® 28-nanometer Zynq®-7000 AllProgrammable SoC, equipped with a hardARM® processor alongside the programmablelogic, Xilinx devices have become even moreappealing for embedded systems.

Despite the devices’ attractive properties,some designers avoid FPGAs because of thecomplexity of programming them—an esotericjob involving hardware description languagessuch as VHDL and Verilog. But the advent ofpractical electronic system-level (ESL) designpromises to ease the task and put FPGA pro-gramming within the grasp of a broader range ofdesign engineers.

ESL tools, which raise design abstraction tolevels above the mainstream register transferlevel (RTL), have a long history in the industry. [1]After numerous false starts, these high-leveldesign tools are finally gaining traction in real-world applications. Xilinx has embraced theESL methodology to solve the programmabilityissue of its FPGAs in its latest design tool offer-ing, the Vivado® Design Suite. Vivado HLS, thehigh-level synthesis tool, incorporates built-inESL to automate the transformation of a designfrom C, C++ or SystemC to HDL.

The availability of Vivado HLS and the ZynqSoC motivated our team to do a cross-analysisof an ESL approach vs. hand-coding as part ofour ongoing research into IP-based image-pro-cessing solutions, undertaken in a collaborativeindustrial-academic project. We hoped that ourwork would provide an unbiased, independentcase study of interest to other design teams,since the tools are still new and not much liter-ature is available on the general user experi-ence. BDTI experiments [2, 3] provide a niceoverview of the quality and potential of ESLtools for FPGAs at a higher level, but the BDTIteam rewrote the source application in an opti-mal manner for implementation with ESL tools.Our work, by contrast, used the original legacysource code of the algorithms, which was writ-ten for the PC.

X P E R T S C O R N E R

Automated methodology delivers results similar to hand-coded RTL for twoimage-processing IP cores

Page 36: Xcell Journal issue 84

In comparing hand-coded blocks ofIP that were previously validated on aXilinx competitor’s 40-nm midrangeproducts to ESL programming on aZynq SoC, we found that the ESLresults rivaled the manual implemen-tations in multiple aspects, with onenotable exception: latency. Moreover,the ESL approach dramaticallyslashed the development time.

Key objectives of our work were:

• Assess the effort/quality of porting existing (C/C++) code to hardware with no or minimalmodification;

• Do a detailed design spaceexploration (ESL vs. hand-codedIP);

• Investigate productivity trade-offs (efficiency vs. design time);

• Examine portability issues(migration to other FPGAs orASICs);

• Gauge the ease of use of ESLtools for hardware, software andsystem designers.

VIVADO-BASED METHODOLOGYWe used Xilinx Vivado HLS 2012.4(ISE® 14.4) design tools and the ZynqSoC-based ZedBoard (XC7z20clg484-1)for prototyping of the presented exper-iments. We chose two IP cores out of

10 from our ongoing Image ProcessingFPGA-SoC collaborative project forscanning applications. The chosenpieces of IP were among the most com-plex in the project. Optimally designedin Verilog and validated on the Xilinxcompetitor’s 40-nm product, they per-form specific image-processing func-tions for scanning products of theindustrial partner, Sagemcom.

Figure 1 outlines the experimenta-tion flow using Xilinx tools. For pur-poses of our experiments, we beganwith the original C source code of thealgorithms. It is worth noting that thecode was written for the PC in a clas-sical way and is not optimal for effi-cient porting to an embedded system.

Our objective was to make mini-mal—ideally, zero—modifications tothe code. While Vivado HLS supportsANSI C/C++, we did have to makesome basic tweaks for issues likedynamic memory allocation and com-plex structure pointers. We accom-plished this task without touching thecore of the coding, as recommendedby Xilinx documentation. We createdthe AXI4 master and slave (AXI-Lite)interfaces on function arguments andprocessed the code within VivadoHLS with varying constraints. Wespecified these constraints usingdirectives to create different solu-tions that we validated on the Zynq

Z20 SoC using the ZedBoard. The sin-gle constraint we had to make forcomparison’s sake involved latency.We had to assume the Zynq SoC’sAXI4 interface would have compara-ble latency at the same frequency asthe bus of the competitor’s FPGA inthe hand-coded design. Porting theRTL code (the core of the IP) to theXilinx environment was straightfor-ward, but redesigning the interfacewas beyond the scope of our work.

We used the Vivado HLS tool’sbuilt-in SystemC co-simulation fea-ture to verify the generated hardwarebefore prototyping on the board. Thefact that Vivado HLS also generatessoftware drivers (access functions)for the IP further accelerated the timeit took to validate and debug the IP.

We used Core-0 of the Zynqdevice’s ARM Cortex™-A9 MPCore at666.7 MHz, with a DDR3 interface at533.3 MHz. An AXI4 timer took careof latency measurements. Figure 2presents the validation flow we usedon the FPGA prototype. The image isprocessed with the original C sourceon ARM to get the “golden” referenceresults. We then processed the sameimage with the IP and compared themto validate the results.

As a standalone project, we alsoevaluated the latency measurementsof the two blocks of IP for a 100-MHzMicroBlaze™ processor with 8+8KBI+D cache, and hardware multiplierand divider units connected directlywith DDR3 via the AXI4 bus. We usedthe latency measurements of the twoIP cores on the MicroBlaze as a refer-ence to compare the speedupachieved by the Vivado HLS-generat-ed IP and the ARM processor, whichhas brought significant software com-pute power to FPGA designers.

EXPERIMENTAL RESULTSLet’s now take a closer look at thedesign space explorations of the twopieces of IP, which we have labeledIP1 and IP2, using the HLS tools.We’ve outlined the IP1 experiment in

36 Xcell Journal Third Quarter 2013

X P E R T S C O R N E R

C Source

• Mallocs• AXI Interfaces• Constraints

Vivado HLS

ARM Cortex-A9Processing System

DDR3Controller

AXI4

Vivado HLS

IPAXI

Timer Zynq Z20SoC

SW D

river

s

HW

Figure 1 – The test hardware system generation for the Zynq Z20 SoC

Page 37: Xcell Journal issue 84

detail but to avoid redundancy, for IP2we’ve just shown the results. Finally,we will also discuss some feedbackwe received from the experts at Xilinxto our query regarding the efficiencyof the AXI4 “burst” implementationused in the Xilinx ESL tools.

Tables 1 and 2 outline the details ofour design space explorations for IP1in the form of steps (we have listedthem as S1, S2, etc.). For our experi-ments, we used a gray-scale testimage of 256 x 256 pixels. It is impor-tant to note that the clock period,latency and power figures are just ref-erence values given by a tool afterquick synthesis for the cross-analysisof multiple implementations. Theactual values can significantly varyafter final implementation, as you willsee in the discussion of the FPGA pro-totype results (Table 2).

S1: RAW CODE TRANSFORMATIONIn the first step, we compiled the codewith no optimization constraints(except the obligatory modificationsdescribed earlier). As you can seefrom Table 1, the hardware is using anenormous amount of resources. Whenwe analyzed the details of resourcedistribution from the Vivado HLSquick-synthesis report and design-viewer GUI tool, it was clear that theprimary contributor to this massiveresource usage was several divisionoperations in the algorithm. Unlikethe multipliers, there are no hard

blocks for division in an FPGA, so thedivision operations consumed a hugeamount of logic. Therefore, the firstthing we had to do was to optimizethis issue.

S2: S1 WITH SHARED DIVIDERIntuitively, our first thought for howto decrease the division hardware wasto make use of resource sharing, withits trade-off of performance loss.Unfortunately, in the version of thetools we used it is not easy or evenpossible to apply custom constraintson operators directly. But theworkaround was quite simple. Wereplaced the division operation (/) bya function, and on functions, con-straints can be applied. By default, thetool uses functions as inline: For eachdivision, it generates complete divi-sion hardware. We applied the“Allocation” directive on the divisionfunction, experimented with differentvalues and monitored their effect onlatency. Finally, we opted for single,shared division hardware. As you cansee in Table 1, this choice drasticallyreduced lookup tables (LUTs), flip-flops(FFs) and other logic resources withjust a negligible impact on latency—acomplete win-win situation.

S3: S2 WITH CONSTANT DIVISIONTO MULTIPLICATION CONVERSION In the next step, we opted to transformthe division operation into multiplica-tion, which was possible because this

application uses division by constantvalues. Our motivation here wastwofold. First, despite deciding to usea single shared divider for all divi-sions, division hardware is still expen-sive (around 2,000 LUTs and flip-flops). Second, this shared dividertakes 34 cycles for division, which isquite slow, and the version of theXilinx tools we used supports only onetype of divider in the HLS library. Butthere are several options for multipli-cation, as we will see in the next steps.

Therefore, in this step we trans-formed the division operations to mul-tiplications (we used the #ifdef direc-tives of C so as to keep the codeportable). The approximation valueswere provided as function parametersvia AXI-Lite slave registers. This iswhy Table 1 shows a significantimprovement in performance andresources at this step, with a slighttrade-off in added DSPs due toincreased multiplication operations.

S4: S3 WITH BUFFER MEMORYPARTITIONINGThe source algorithm uses multiplememory buffers (8, 16 and 32 bits)arranged together in a unified mallocblock. During the initial transforma-tion to make the algorithm VivadoHLS ready (Figure 1), that memoryblock became an internal on-chipmemory (Block RAMs).

While a unified malloc block was anormal choice for shared externalmemory, when using internal on-chipmemories, it is much more interestingto exploit memory partitioning forfaster throughput. In this step, there-fore, we fractured the larger memoryblock into independent small memo-ries (8, 16 and 32 bits). Table 1 showsthe results, highlighting significantperformance improvements with aslight increase in the number ofBRAMs (Table 2) due to fractured use.This technique also yielded a signifi-cant decrease in logic resources,potentially due to simplified controllogic for memory access.

X P E R T S C O R N E R

Input Image

Image-ProcessingSoftware

Image-ProcessingVivado HLS IP

Comparison

Golden Results IP Results

Third Quarter 2013 Xcell Journal 37

Figure 2 – ESL IP validation flow on the FPGA

Page 38: Xcell Journal issue 84

S5: S4 WITH SHARED 32/64-BITSIGNED/UNSIGNED MULTIPLIERSIn this step, motivated by the shared-divider experiments in S2, we focusedon optimizing the DSP blocks usingshared resources. Our first step wasto conduct a detailed analysis of thegenerated hardware using the DesignViewer utility to cross-reference gen-erated hardware with the source Ccode. The analysis revealed that inraw form, the hardware is using 26distinct multiplier units of heteroge-neous types, which can be classifiedas unsigned 16-/32-/64-bit multiplica-tions and 32-bit signed multiplica-tions. As in our S2 experiments, weconstructed a system consisting of 32-bit signed, 32-bit unsigned and 64-bit

unsigned multipliers (by replacing the“*” operation with a function). Wethen applied an Allocation constraintof using only one unit for these threedistinct operations. Table 1 shows theresulting drastic reduction in DSPusage, albeit with some penalty in per-formance. This result motivated us tofurther investigate smart sharing.

S6: S5 WITH SMARTMULTIPLICATION SHARINGTo mitigate the performance loss aris-ing from shared multiplier usage, wemade smart use of multiplier sharing.We created two additional multipliertypes: a 16-bit unsigned multiplier anda 32-bit unsigned multiplier with 64-bit return value. After several itera-

tions of smartly using the multipliersin a mutually exclusive manner andchanging their quantity, our final solu-tion contained two unsigned 64-bitmultipliers, two unsigned 32-bit multi-pliers, one signed 32-bit multiplier,one unsigned 16-bit multiplier and asingle unsigned 32-bit multiplier with64-bit return value. This techniqueimproved performance slightly, as youcan see in Table 1, but with slightlyhigher DSP usage.

S7: S6 WITH MULTIPLIER LATENCYEXPERIMENTS (COMBINATIONALMULTIPLIER )In the final stages of multiplier opti-mization, we conducted two experi-ments for changing the latency value

38 Xcell Journal Third Quarter 2013

X P E R T S C O R N E R

IP1: Zynq Z20 board results for256x256-pixel test image

LUTs FFs BRAMs DSPFmax(MHz)

Fexp(MHz)

Latency(ms)

SpeedupPower**

(mW)

ARM CortexA9 MPCore (Core-0 only) NA NA NA NA NA 666.7 32.5 15 NA

MicroBlaze (with Cache & int. Divider) 2,027 1,479 6 3 NA 100 487.4 1 NA

Hand-Coded IP (Resources for Ref.) 8798 3,693 8 33 54.2 *50(NA) *4 (NA) NA 57

S3: S2 with DivToMult 6,721 6,449 8 72 40.2 40 142.4 3.42 12

S4: S3 with Buffer Mem. Partitioning 4,584 5,361 10 69 84.5 83.3 124.5 3.91 30

S6: S5 with Smart Mult sharing 5,156 4,870 10 32 91.3 90.9 136.9 3.56 26

S8: S6 with Mult Latency exp (1clk) 5,027 4,607 10 33 77.3 76.9 121.2 4.02 22

S9: S4 with Superburst 4,504 5,008 42 70 77.6 76.9 26.5 18.39 NA

S10: S4 with Smart Burst 4,485 4,981 14 73 103 100 22.48 21.68 62

S11: S6 with Smart Burst 4,813 4,438 14 33 101 100 33.7 14.46 42

IP1: Vivado HLS results for Zynq Z20 SoC LUTs FFs BRAMs DSP ClkPr (ns) Latency (clk cycles) Speedup Power*

S1: Raw transform 20,276 18,293 16 59 8.75 8,523,620 NA 1410

S2: S1 with Shared Div 9,158 7,727 16 51 8.75 8,555,668 1 1638

S3: S2 with DivToMult 7,142 6,172 16 84 8.75 3,856,048 2.22 1337

S4: S3 with Buffer Memory Partitioning 4,694 4,773 16 81 8.75 2,438,828 3.51 953

S5: S4 with Shared 32/64bit Multipliers 4,778 4,375 16 32 8.75 3,111,560 2.75 890

S6: S5 with Smart Mult sharing 4,900 4,569 16 40 8.75 2,910,344 2.94 918

S7: S6 with Mult Lat. exp (comb. Mult) 4,838 4,080 16 40 13.56 1,975,952 4.33 893

S8: S6 with Mult Lat. exp (1clk cycle Mult) 4,838 4,056 16 40 22.74 2,181,776 3.92 890

S9: S4 with Superburst 4,025 4,357 80 79 8.75 NA NA NA

S10: S4 with Smart Burst 3,979 4,314 24 80 8.75 NA NA NA

S11: S6 with Smart Burst 4,224 4,079 24 39 8.75 NA NA NA

Table 1 – HLS exploration results for IP1*Power value of tool for iterations comparison, it has no units

Table 2 – FPGA prototype results for IP1 explorations*Reference values, see methodology section

**Dynamic power from XPower Analyzer

Page 39: Xcell Journal issue 84

of the multipliers. By default, VivadoHLS used multipliers with latenciesranging from two to five clock cycles.In hard IP, all multipliers are eithersingle-clock-cycle or combinational.In this step, we set the multipliers’latency to 0 by using the “Resource”directive and choosing the correspon-ding combinational multiplier fromthe library. The results, as seen inTable 1, show an improvement inlatency. However, the design hasbecome much slower from the stand-point of timing, so the increased per-formance may well turn into a lossdue to the slower clock speed of theFPGA design.

S8: S6 WITH MULTIPLIER LATENCYEXPERIMENTS (1-CLOCK-CYCLEMULTIPLIER)In this step, instead of using a combina-tional multiplier we selected a single-clock-cycle multiplier from the library.Results are shown in Table 1. We cansee that while the latency has slightlyincreased, the clock period, surprising-ly, has significantly increased, whichmay lead to slower hardware.

S9, S10, S11: BURST-ACCESSEXPERIMENTSAll the exploration steps presented sofar required very little or no knowledgeof the functionality of the C code. Inthe final experiments, we exploredburst access, which is almost obligato-ry for IP that is based on shared-mem-ory systems due to high latency of thebus for small or random data accesses.Toward this end, we analyzed the algo-rithmic structure of the C code toexplore the possibility of implement-ing burst access, since natively there isno such thing as burst access from ageneral software standpoint. However,it is important to note that even withour burst-access experiments, wenever modified the software structureof the code.

We performed our burst-accessexperiments in two steps. First weused a dump approach (not practical,

but a good way to get some technicalinformation), where the entire imagewas burst into the internal memorybuffer and the full results burst out atthe end of execution. This was easy todo by using a small image and by thepresence of abundant on-chip memoryin the FPGA. We called this iterationstep “superburst” (S9).

In the second phase, we implement-ed a smart, interactive burst (S10, S11)within the limitations of not modifyingthe source code structure. The resultsare shown in Table 1. We implementedthe burst functionality in C using thestandard “memcpy” function, based ontutorial examples from Xilinx.Unfortunately, it is not possible to seelatency values when you insert thememcpy function in the code. TheFPGA implementation (Table 2) willillustrate the obtained behaviors.

FPGA PROTOTYPE OF SELECTED STEPSOne of the greatest benefits of usingESL tools is rapid design space explo-ration. We were able to evaluate allthe exploration steps listed above andthe results presented in Table 1 with-in minutes, avoiding the lengthyimplementation phase of an FPGA foronly a few, final selected iterations.Table 2 presents the final implemen-tation results of some selected stepsof Table 1 on our Zynq Z20 SoC. Italso presents the equivalent results(latency) of implementing the originalsource code on a MicroBlaze proces-sor, as a reference, and on the ARMCortex-A9, showing the tremendouscompute power that the Zynq SoC hasbrought to FPGA designers. The tablealso shows estimated dynamic-powervalues calculated by the XilinxXPower Analyzer.

We obtained the implementation sta-tistics of the optimal hand-coded IP onthe Zynq SoC by implementing theirRTL using Xilinx tools. We had to usethe latency measurements of our previ-ous experiments with a competitiveFPGA as a reference, since porting to

the AXI4 bus was beyond the scope ofour experiments. This further high-lights the potential of HLS tools, whichcan make it easier for designers to portto multiple protocols.

If we compare the results of Table 2with those of Table 1, several interest-ing observations emerge. First, thequick synthesis of ESL gives a relative-ly good estimate compared with theactual implementation (especially forLUTs and flip-flops); for memory andDSP, the synthesizer sometimes cansignificantly change the resourcecount for optimizations.

Second, for final implementation,timing closure can be a big problem.We can see a significant change inachieved frequency after FPGA imple-mentation compared with the HLS esti-mate. Finally, it is surprising that evenwith the burst-access experiments,there is a significant difference in thelatency values from the expected HLSvalues and hand-coded IP. This piquedour curiosity about the quality of DMAthat the HLS tool generated.

NEW IP, SIMILAR RESULTS In a similar fashion, we likewiseexplored our second piece of IP,which we called IP2, in the samedetail. The results, illustrated inTables 3 and 4, are self-explanatory inthe light of our in-depth discussion ofour experiences with IP1. One pointthat’s interesting to note is step 8 inTable 3. Just as with IP1 (see the S2and S3 discussions), the divider isexpensive in terms of resources andalso latency. Unlike IP1, IP2 uses realdivision, so the divider cannot beremoved. Unfortunately, in the cur-rent version of the ESL tools, only onedivider is present, with 34 clockcycles of latency. S8 shows that if adivider with a latency of a single clockcycle were used, like those availablein hand-coded IP, theoretically itcould yield a performance improve-ment of around 30 percent. Table 4shows FPGA prototype results ofselected steps.

Third Quarter 2013 Xcell Journal 39

X P E R T S C O R N E R

Page 40: Xcell Journal issue 84

VIVADO HLS DMA/BURSTEFFICIENCY ISSUE Comparing the results of the hand-coded IP implementation with ESLtechniques, several observations canbe made. Despite using the originalsource code (with almost no modifi-cation), which was not written in anoptimal manner to exploit embeddedsystems, the ESL tools deliveredresults that rivaled optimally hand-coded IP in terms of resource utiliza-tion. The significant difference in theexplored results for ESL vs. hand-coded IP came in latency, which wasfar better in the hand-coded IP.

Looking into this issue, we observedthat there is a major difference in the

way the hand-coded IP and ESL IP aremade. (This is also the reason for thebig difference in resources consumedfor IP2.) Due to the style of the sourcecode, there is only one data bus forinput and output in the ESL IP, com-pared with two for the hand-coded IP.Also, FIFOs handle burst access muchmore optimally in hand-coded IP. Forthe ESL IP, the burst is done withbuffers, and due to the nature of theoriginal code it was quite hard to opti-mally create an efficient burst mecha-nism. The ESL approach entails a penal-ty in BRAM usage, since the ESL hard-ware is using these memories for workbuffers and burst buffers in accordancewith the code structure. In addition, the

burst is done using the memcpy func-tion of C, in accordance with Xilinxtutorials for the AXI4 master interface.

We also conferred with Xilinxexperts, who recommended using theexternal DMAs manually for optimalperformance, as the AXI4 master inter-face generated by the version ofVivado HLS we used is in beta modeand will be upgraded in the future.Such factors might have caused a sig-nificant difference in latency valuesand might be a good starting point forfuture experiments.

PRODUCTIVITY VS. EFFICIENCYThe detailed experiments undertakenin this work revealed several interest-

40 Xcell Journal Third Quarter 2013

X P E R T S C O R N E R

IP2: Zynq Z20 board results for 256x256-pixel test image

LUTs FFs BRAMs DSPFmax(MHz)

FexpMHz)

Latency(ms)

SpeedupPower**

(mW)

ARM CortexA9 MPCore (Core-0 only) NA NA NA NA NA 666.7 37.9 13.39 NA

MicroBlaze (with Cache & int. Divider) 2,027 1,479 6 3 NA 100 507.6 1 NA

Hand-Coded IP (Resources for Ref.) 3,671 2,949 2 16 63.8 *50(NA) *4 (NA) NA 25

S3: S2 + const Mult & Div ops reduction 8,620 8,434 8 81 44.2 40 236.8 2.14 24.4

S4: S3 with Buffer Memory Partitioning 6,523 7,403 10 78 88.5 83.3 124.4 4.08 41.5

S5: S4 with Smart Mult sharing 6,722 7,017 10 41 94.1 90.9 124.2 4.09 36.4

S6: S4 with Smart Burst 6,240 6,486 14 69 96.2 90.9 41.2 12.32 56.4

S7: S5 with Smart Burst 6,435 6,174 14 30 95.1 90.9 45.8 11.08 50.1

IP2: Vivado HLS results for Zynq Z20 SoC LUTs FFs BRAMs DSPClkPr (ns)

Latency (clk cycles)

Speedup Power*

S1: Raw transform 25,746 23,392 16 76 8.75 8,691,844 NA NA

S2: Raw with SharedDiv 9,206 7,900 16 68 8.75 11,069,548 1 1643

S3: S2 + const Mult & Div ops reduction 9,703 8,557 16 96 8.75 4,934,764 2.24 1712

S4: S3 with Buffer Memory Partitioning 7,204 7,249 16 93 8.75 3,539,564 3.13 1338

S5: S4 with Smart Mult sharing 7,240 6,663 16 48 8.75 3,882,604 2.85 1357

S6: S4 with Smart Burst 6,816 6,515 24 78 8.75 NA NA 1240

S7: S5 with Smart Burst 6,846 5,958 24 33 8.75 NA NA 1262

S8: S5 Theoretical Lat. with 1 clk cycle Div. NA NA NA NA NA 2,751,084 4.02 NA

Table 3 – HLS exploration results for IP2*Power value of tool for iterations comparison, it has no units

Table 4 – FPGA prototype results for IP2 explorations*Reference values, see methodology section

**Dynamic power from XPower Analyzer

Page 41: Xcell Journal issue 84

ing aspects of ESL design and alsocleared up some misconceptions. ESLtools have come a long way since theearly iterations. They now supportANSI C/C++, give good competitiveresults and exploit several designdecisions (not just the loop exploita-tions of the past) inspired by hard-ware/system design practices, in theform of constraints directives.

Indeed, we found that the resultsobtained from ESL rivaled the optimalresults in all aspects except one:latency. Furthermore, using accuratebit types with advanced features ofthe tools (instead of the generic Ctypes we used in these experiments)may further improve resources, ascould partial rewriting of code.

Our explorations also gave us anidea about productivity vs. efficiencytrade-offs. Figure 3 illustrates the rela-tive development time of IP blocks (interms of design, software, FPGA imple-mentation and verification). The fasttime-to-solution for ESL-based designis dramatic, especially for verification,with results that rival optimal imple-mentation. It is important to note thatdesign/validation time is highlydependent on the skills of the design-ers. Since the same group of peoplewas involved in both the hand-codeddesign and the ESL-based design,Figure 3 gives quite an unbiasedoverview of how the process plays out.Considering the fact that designers

were well acquainted with the classicalRTL method of IP design and integra-tion, and were new to ESL tools, it’slikely that implementation time couldsignificantly improve in the longerterm, once the tools are more familiar.

WHAT NEXT FOR ESL? From a technology and also a busi-ness standpoint, there should besome EDA standardization amongESL tools’ constraints structure, mak-ing designs easy to port to otherFPGAs or ASICs (flexibility that isinherent in RTL-based design toolsand in principle, even easier toachieve with ESL tools). ESL toolproviders should differentiate in thequality of their tools and results only,as they do for RTL tools.

It’s a long and complex discus-sion, and beyond scope of this shortarticle, to answer questions like:Have ESL tools enabled softwaredesigners to program FPGAs? WillESL eliminate RTL designers/jobs?Who is the biggest winner with ESLtools? In our brief experience withthis work, we have found that ESLtools are beneficial for all, especiallyfor the system designers.

For hardware designers, the toolscan be an interesting way to create orevaluate small portions of theirdesign. They can be helpful in rapidlycreating a bus-interface-ready testinfrastructure. Software designers try-

ing to design hardware are one of thekey targets for ESL tools. And whileESL tools have come a long way forthis purpose, they still have a longway to go. In our project, for example,ELS tools hid the hardware complexi-ties to a great extent from softwaredesigners. Nevertheless, good imple-mentation might still require a hard-ware mind-set when it comes to con-straints and optimization efforts. Thisissue will get better as the ESL toolsfurther evolve.

As for the system designers, whoare becoming more and more com-mon due to the increased integrationand penetration of HW/SW co-designin the world of SoCs, ESL tools per-haps are a win-win situation. Systemdesigners can exploit these tools atmultiple levels.

On the hardware side, one of thekey reasons to choose the Zynq-7000All Programmable SoC among Xilinx’s28-nm 7 series FPGAs was to explorethe potential of having a hard ARMcore onboard the same chip withFPGA logic. As we have seen in theseexperimental results, the dual-coreARM has brought unprecedentedcompute power to FPGA designers.Indeed, the Zynq SoC delivers a mini-supercomputer to embedded design-ers, who can customize it using thetightly integrated FPGA fabric.

Acknowledgements

We are grateful to our industrial partner,

Sagemcom, for funding and technical

cooperation. We would also like to thank

Xilinx experts in Europe, Kester Aernoudt

and Olivier Tremois, for their prompt

technical guidance during this work.

References

1. G. Martin, G. Smith, “High-Level

Synthesis: Past, Present and Future,”

IEEE Design and Test of Computers,

July/August 2009

2. BDTI report, “High-Level Synthesis Tools

for Xilinx FPGAs,” 2010

3. BDTI report, “The AutoESL AutoPilot

High-Level Synthesis Tool,” 2010

Third Quarter 2013 Xcell Journal 41

X P E R T S C O R N E R

20

15

10

5

0

IP1 RTL IP1 ESL IP2 RTL IP2 ESL

15

3

11

1

Development Time (Person Weeks)

Figure – 3: Hand-coded RTL vs. ESL development time

Page 42: Xcell Journal issue 84

42 Xcell Journal Third Quarter 2013

XPLANATION: FPGA 101

One of the major benefits of the Zynq SoC architecture is the ability to speed up processing by building a peripheral within the device’s programmable logic.

How to IncreaseZynq SoC Speed by Creating Your Own Peripheral

by Adam TaylorSenior Specialist EngineerEADS [email protected]

Page 43: Xcell Journal issue 84

O ne of the real advantages of the Xilinx® Zynq™-7000 All Programmable SoC is the ability toincrease the performance of the processing sys-

tem (PS) side of the device by creating a peripheral with-in the programmable logic (PL) side. At first you mightthink this would be a complicated job. However, it is sur-prisingly simple to create your own peripheral.

Adding a peripheral within the PL can be of greathelp when you’re trying to speed up the performanceof your processing system or when you’re using thePS to control the behavior of the design within theprogrammable logic side. For example, the PS mightuse a number of memory-mapped registers to controloperations or options for the design within the pro-grammable logic.

Our example design is a simple module that allowsyou to turn on and off a memory-mapped LED on theZedBoard. To create this module, we will use the XilinxPlanAhead™, XPS and Software Development Kit(SDK) tool flow in a three-step process.

1. Create the module within the EmbeddedDevelopment Kit (EDK) environment.

2. Write the actual VHDL for our module and build the system.

3. Write the software that uses our new custom module.

CREATING THE MODULE IN EDKTo create your own peripheral, you will first need toopen Xilinx Platform Studio (XPS) from the PlanAheadproject containing your Zynq SoC design and select themenu option “hardware->create or import peripheral.”

For our example module, the seven LEDs on theZedBoard should demonstrate a walking display, illu-minating one after the other, except that the most sig-nificant LED will be toggled under control of the soft-ware application. While this is not the most excitingapplication of a custom block, it is a useful simpleexample to demonstrate the flow required. Once you’vegrasped that process, you will know everything youneed to know to implement more-complex modules.

Third Quarter 2013 Xcell Journal 43

X P L A N A T I O N : F P G A 1 0 1

This is the third in a planned series of hands-on

Zynq-7000 All Programmable SoC tutorials by Adam

Taylor. Earlier installments appeared in Xcell Journalissues 82 and 83. A frequent contributor to Xcell Journal,Adam is also a blogger at All Programmable Planet

(www.programmableplanet.com).

Page 44: Xcell Journal issue 84

• A Microprocessor PeripheralDescription, which defines theinterfaces to your module suchthat it can be used with XPS

• A CIP file, which allows recus-tomization of the peripheral ifrequired

• Driver source files for the SDK

• Example source files for the SDK

• A Microprocessor Driver Def-inition, which details the driversrequired for your peripheral

These files will be created underthe PlanAhead sources directory,which in our case is zed_blog.srcs\

sources_1\edk\proc_subsystem.Beneath this you will see the Pcores

directory, which will contain the files(under a directory named after youperipheral) required for XPS andPlanAhead, along with the driversdirectory, which contains the filesrequired for the SDK.

CREATING THE RTLWithin the Pcores directory you will seetwo files, one named after the compo-nent you created (in this case,led_if.vhd) and another called

You create a peripheral by steppingthrough 10 simple windows and select-ing the options you desire.

The first two screens (shown inFigure 1) ask if you want to create orimport an existing peripheral and ifso, to which project you wish to asso-ciate it. The next stage, at screen 3,allows you to name your module andvery usefully lets you define versionsand revisions. (Note that the optionsat the bottom of the peripheral flowenable you to recustomize a moduleyou have already created by readingback in the *.cip file.)

Once you have selected your mod-ule name, you can then choose thecomplexity of the AXI bus required, asdemonstrated in Figure 2. The one weneed for our example design is a sim-ple memory-mapped control register-type interface; hence, we have chosenthe top option, AXI4-Lite. (More infor-mation on the different types of AXIbuses supported is available athttp://www.xilinx.com/support/docu-

mentation/white_papers/wp379_AXI

4_Plug_and_Play_IP.pdf.) This stepwill create a simple number of regis-ters that can be written to and readfrom over the AXI bus via the process-

X P L A N A T I O N : F P G A 1 0 1

ing system. These registers will belocated within the programmable logicfabric on the Zynq SoC and hence arealso accessible by the user logic appli-cation, where you can create the func-tion of the peripheral.

The next six screens offer you theoption to customize an AXI master orslave configuration, and the number ofregisters that you will be using to con-trol your user logic along with the sup-porting files. For this example we areusing just the one register, which willbe functioning as an AXI-Lite slave.The most important way to ease thedevelopment of your system is toselect the “generate template driverfiles” on the peripheral implementationsupport tab. Doing so will deliver anumber of source and header files tohelp communicate within the peripher-al you create.

At the end of this process, XPS willhave produced a number of files tohelp you with creating and using yournew peripheral:

• A top-level VHDL file named afteryour peripheral

• A VHDL file in which you can create your user logic

Figure 1 – The first two of the 10 wizard screens

44 Xcell Journal Third Quarter 2013

Page 45: Xcell Journal issue 84

X P L A N A T I O N : F P G A 1 0 1

user_logic.vhd. It is in this latter filethat you create your design. You willfind that the wizard has already givenyou a helping hand by generating theregisters and interfaces necessary tocommunicate over the AXI bus withinthis file. The user-logic module isinstantiated within led_if.vhd.Therefore, any changes you make tothe entity of user_logic.vhd must beflowed through into the port map. Ifexternal ports are required—as theyare in our example design, to flash theLED—you must also flow through yourchanges in the entity of led_if.vhd.

Once you have created your logicdesign, you may wish to simulate it toensure that it functions as desired. Bus-functional models are provided underthe Pcores/devl/bfmsim directory.

USING YOUR MODULE WITHIN XPSHaving created the VHDL file or filesand ensuring they function, you willwish to use the module within yourXPS project. To correctly do this youwill need to ensure the MicroprocessorPeripheral Description file is up to datewith the ports you may have added dur-ing your VHDL design. You can accom-

plish this task by editing the MPD filewithin XPS. You will find your createdperipheral under the Project LocalPcores->USER->core name. Right-clicking on this name will enable you toview your MPD file.

You will need to add the non-AXIinterfaces you included in your designso that you can see the I/Os and dealwith them correctly within XPS. Youwill find the syntax for the MPD file athttp://www.xilinx.com/support/docu-

mentation/sw_manuals/xilinx14_4/p

sf_rm.pdf, although it is prettystraightforward, as you can see inFigure 3. The highlighted line (line 59,at the bottom of the central list) is theone I added to bring my LED outputport to the module.

When you have updated the MPDfile, you must click on “re-scan user IPrepositories” to ensure the changes areloaded into XPS. At that point, you canthen implement the device into yoursystem just like any other peripheralfrom the IP library. Once you have it inthe address space you want (remem-ber, the minimum is 4K) and have con-nected your outputs as required toexternal ports, if necessary you can runthe design rule checker. Provided there

are no errors, you can close down XPSand return to PlanAhead.

Back within the PlanAhead environ-ment, you need to regenerate the toplevel of HDL; to do this, just right-clickon the processor subsystem you creat-ed and select “create top HDL” (seeXcell Journal issue 82 for informationon creating your Zynq SoC PlanAheadproject). The final stage before imple-mentation is to modify the UCF filewithin PlanAhead if changes havebeen made to the I/O pinout of yourdesign. Once you are happy with thepinout, you can then run the imple-mentation and generate the bitstreamready for use in the SDK.

USING THE PERIPHERAL IN SDKHaving gotten this far, we are ready touse the peripheral within the softwareenvironment. As with the tasks of cre-ation and implementation, this is aneasy and straightforward process. Thefirst step is to export the hardware tothe Software Development Kit using the“export hardware for SDK” option with-in PlanAhead. Next, open the SDK andcheck the MSS file to make sure thatyour created peripheral is now present.It should be listed with the name of the

Figure 2 – The next two windows, covering name, revision and bus standard selection

Third Quarter 2013 Xcell Journal 45

Page 46: Xcell Journal issue 84

X P L A N A T I O N : F P G A 1 0 1

Figure 4 – Importing the software repositories

Figure 3 – Editing the MPD file to include your peripheral

46 Xcell Journal Third Quarter 2013

Page 47: Xcell Journal issue 84

X P L A N A T I O N : F P G A 1 0 1

Third Quarter 2013 Xcell Journal 47

driver next to it; in the first instance thedriver will be named “generic.” Usingthe peripheral within the SDK, thegeneric driver is not an issue. If you wishto use it, you will need to include theheader file #include "xil_io.h." Thisallows you to use the function callsXil_in32(addr) or Xil_out32(addr) toread and write from the peripheral.

To use this method you will alsoneed to know the base address of theperipheral, which is given in xparame-ters.h, and the offset between regis-ters. In this case, because of the 32-bitaddressing in our example design, theregisters are separated by 0x04.However, if you enabled the driver-cre-ation option when you created theperipheral using XPS, then a separatedriver will have been specially built foryour peripheral. This driver too will belocated within the drivers subdirectoryof your PlanAhead sources directory.

You can use this type of speciallybuilt driver in place of the generic driv-ers within your design, but first you

must set the software repositories topoint to the directory to see the newdrivers. You can define the reposito-ries under the Xilinx Tools ->Repositories menu (Figure 4).

Once you have set the repositoriesto point to the correct directory, youcan re-scan the repositories and closethe dialog window. The next step is toupdate the board support packagewith the new driver selection. Openthe board support package settings byright-clicking on the BSP icon underthe Project Explorer. Click on the driv-ers tab and you will see a list of all theperipherals and their drivers. You canchange the driver selection for yourperipheral by selecting the appropri-ate driver from a list of drivers applica-ble to that device (Figure 5).

Select the driver for your peripher-al by name and click OK; your projectwill be recompiled if you have auto-matic compilation selected. Once thecode has been rebuilt, you are thennearly ready to communicate with

the peripheral using the supplied driv-ers. Within your source code you willneed to include the driver header file#include "led_if.h" and use the baseaddress of the peripheral from xpara-meters.h to enable the interface toknow the address within the memoryspace. It is then a simple case of call-ing the commands provided by thedriver within the “include” file. In thecase of our design, one example was

LED_IF_mWriteSlaveReg0(LED_ADDR, LED_IF_SLV_REG0_OFF-SET, 0x01);

With this command written, it isthen time to test the software withinthe hardware environment and con-firm that it functions as desired.

As you can see, making your ownperipheral within a Zynq SoC design isnot as complicated as you may at firstthink. Creation of a peripheral offersseveral benefits for your Zynq SoCprocessing system and, ultimately,your overall solution.

Figure 5 – Selecting the peripheral driver

Page 48: Xcell Journal issue 84

48 Xcell Journal Third Quarter 2013

XPLANATION: FPGA 101

JESD204B: Critical Link in theFPGA-Converter Chainby Jonathan HarrisProduct Applications EngineerHigh-Speed Converter GroupAnalog [email protected]

Page 49: Xcell Journal issue 84

I n most designs today, parallellow-voltage differential signaling(LVDS) is the interface of choice

between data converters and FPGAs.Some FPGA designers still use CMOSas the interface in designs with slowerconverters. However, the latest high-speed data converters and FPGAs aremigrating from parallel LVDS andCMOS digital interfaces to a serial-ized interface called JESD204B, anew standard developed by theJEDEC Solid State TechnologyAssociation, an independent semi-conductor engineering trade organi-zation and standardization body.

As converter resolution and speedhave increased, the demand for a moreefficient interface has grown. Thisinterface is the critical link betweenFPGAs and the analog-to-digital or dig-ital-to-analog converters that residebeside them in most system designs.The JESD204B interface brings thisefficiency to designers, and offers sev-eral advantages over preceding inter-face technologies. New FPGA designsemploying JESD204B enjoy the bene-fits of a faster interface to keep pacewith the faster sampling rates of con-verters. Also, a dramatic reduction inpin count leads to smaller packagesizes and fewer trace routes, makingboard designs a little less complex.

All this does not come for free. Each type of interface—includ-

ing JESD204B—comes with designissues such as timing concerns. Thespecifics differ somewhat, but eachinterface brings its own set ofparameters that designers mustanalyze properly to achieve accept-able system performance. There arealso hardware choices to be made.For example, not all FPGAs andconverters support the JESD204Binterface. In addition, you must beup to date on parameters such asthe lane data rate in order to selectthe appropriate FPGA.

Before exploring the details of thenew JESD2904B interface, let’s takea look at the other two options thatdesigners have long used for FPGA-to-converter links: CMOS and LVDS.

CMOS INTERFACEIn converters with sample rates ofless than 200 megasamples per sec-ond (MSPS), it is common to findthat the interface to the FPGA isCMOS. A high-level look at a typicalCMOS driver shows two transistors,one NMOS and one PMOS, connect-ed between the power supply (VDD)and ground, as shown in Figure 1a.This structure results in an inver-sion in the output, and to avoid it,

Third Quarter 2013 Xcell Journal 49

X P L A N A T I O N : F P G A 1 0 1

Data converters that interface withXilinx FPGAs are fast embracing the new JESD204B high-speed serial link. To use this interface format and protocol, your design needs to take some basic hardware and timing issues into account.

Page 50: Xcell Journal issue 84

The series-damping resistors will helpsuppress this large transient current.This technique will reduce the noisethat the transients generate in the out-puts, and thus help prevent the outputsfrom generating additional noise anddistortion in the A/D converter.

LVDS INTERFACELVDS offers some nice advantages overCMOS technology for the FPGA design-er. The interface has a low-voltage dif-ferential signal of approximately 350mV peak-to-peak. The lower-voltageswing has a faster switching time andreduces EMI concerns. In addition,many of the Xilinx® FPGA families,such as the Spartan®, Virtex®, Kintex®

and Artix® devices, support LVDS.Also, by virtue of being differential,LVDS offers the benefit of common-mode rejection. This means that noisecoupled to the signals tends to becommon to both signal paths and ismostly canceled out by the differen-tial receiver. In LVDS, the load resist-ance needs to be approximately 100 Ωand is usually achieved by a paralleltermination resistor at the LVDSreceiver. In addition, you must routethe LVDS signals using controlled-impedance transmission lines. Figure3 shows a high-level view of a typicalLVDS output driver.

you can use the back-to-back struc-ture in Figure 1b instead. The input ofthe CMOS output driver is highimpedance while the output is lowimpedance. At the input to the driver,the gates of the two CMOS transistorspresent high impedance, which canrange from kilohms to megohms. Atthe output of the driver, the imped-ance is governed by the drain current,ID, which keeps the impedance to lessthan a few hundred ohms. The volt-age levels for CMOS swing fromapproximately VDD to ground and cantherefore be quite large depending onthe magnitude of VDD.

50 Xcell Journal Third Quarter 2013

X P L A N A T I O N : F P G A 1 0 1

Some important things to considerwith CMOS are the typical switchingspeed of the logic levels (~1 V/ns), out-put loading (~10 pF/gate driven) andcharging currents (~10 mA/output). Itis important to minimize the chargingcurrent by using the smallest capaci-tive load possible. In addition, a damp-ing resistor will minimize the chargingcurrent, as illustrated in Figure 2. It isimportant to minimize these currentsbecause of how quickly they can addup. For example, a quad-channel 14-bitA/D converter could have a transientcurrent as high as 14 x 4 x 10 mA,which would be a whopping 560 mA.

Input Output

VDD

Input

VDD

Output

VDD

a) Inverted output b) Noninverted output

Figure 1 – Typical CMOS digital output driver

Input

VDD

Output

R

VDD

i = CxdV/dt

dV/dt = 1V/ns

C

10pF

Figure 2 – CMOS output driver with transient-current and damping resistor

Page 51: Xcell Journal issue 84

X P L A N A T I O N : F P G A 1 0 1

Third Quarter 2013 Xcell Journal 51

A critical item to consider withdifferential signals is proper termi-nations. Figure 4 shows a typicalLVDS driver and the terminationneeded at the receivers. You can usea single differential terminationresistor (RTDIFF) or two single-ended termination resistors (RTSE).Without proper terminations, thesignal quality becomes degraded,which results in data being corrupt-ed during transmission. In addition

to having proper terminations, it isimportant to give attention to thephysical layout of the transmissionlines. You can maintain properimpedance by using line widths thatare appropriate. The fabricationprocess variation should not causewide shifts in the impedance. Avoidabrupt transitions in the spacing ofthe differential transmission lines inorder to reduce the possibility of sig-nal reflections.

JESD204B INTERFACEThe latest development in high-speedconverters and FPGAs is theJESD204B serialized interface, whichemploys current-mode logic (CML)output drivers. Typically, converterswith higher resolutions (≥14 bits),higher speeds (≥200 MSPS) and thedesire for smaller packages with lesspower utilize CML drivers. There issome overlap here in high-speed con-verters around the 200-MSPS speedrange that use LVDS. This means thatdesigners moving to JESD204B canuse a proven design that was likelybased on a high-speed converter withLVDS. The result is to remove a levelof complexity out of the design and tomitigate some of the risk.

Combining CML drivers with serial-ized JESD204B interfaces allows datarates on the converter outputs to go upto 12.5 Gbps. This speed allows roomfor converter sample rates to push intothe gigasample (GSPS) range. In addi-tion, the number of required outputpins plummets dramatically. It’s nolonger necessary to route a separateclock signal, since the clock becomesembedded in the 8b/10b encoded datastream. Since the interface employedwith CML drivers is typically serial,

Input+

Input–

Input–

Output+

Output–

Input+

VDD

IS1

IS2

Input+

Input–

Input–

LVDSReceiver

Output+Output–

Input+

VDD

IS1

RTSE

RTDIFF

RTSE

IS2

Figure 3 – Typical LVDS output driver

Figure 4 – LVDS output driver with termination

Page 52: Xcell Journal issue 84

reduced and the eye is closing, inwhich case the receiver in the FPGAwill have a difficult time interpretingthe data. As you can see, it is veryimportant to use proper terminationsto ensure that you achieve the best pos-sible signal integrity.

A BETTER AND FASTER WAYThe three main types of digital outputsemployed in converters each have theirown advantages and disadvantageswhen connecting to FPGAs. It is impor-tant to keep these pros and cons in mindwhen utilizing converters that employCMOS, LVDS or CML (JESD204B) out-put drivers with an FPGA design. Theseinterfaces are a critical link between theFPGAs and the A/D or D/A converters.Each type of driver has qualities and

there’s a much smaller increase in thenumber of pins required than would bethe case with CMOS or LVDS.

Table 1 illustrates the pin countsfor the three different interfaces in a200-MSPS converter with variouschannel counts and 12-bit resolution.The data assumes a synchronizationclock for each channel’s data, in thecase of the CMOS and LVDS outputs,and a maximum data rate of 4 Gbpsfor data transfer using JESD204B.This lane rate is well under the maxi-mum limits of many of the FPGAs inthe Virtex-6, Virtex-7, Kintex-7 andArtix-7 families available today. Thereasons for the progression toJESD204B become obvious when younote the dramatic reduction in pincount that you can achieve. Theboard routing complexity is reducedand less I/O is required from theFPGA, freeing up those resources forother system functions.

Taking this trend one step furtherand looking at a 12-bit converter run-ning at 1 GSPS, the advantage ofusing JESD204B becomes even moreapparent. This example omits CMOS,because it is just completely imprac-tical to try to use a CMOS outputinterface with a gigasample convert-er. Also, the number of converterchannels is limited to four and thelane rate is limited to 5 Gbps, whichis again compatible with many of theFPGAs in the Virtex-6 and Xilinx 7series FPGA families. As you cansee, JESD204B can significantlyreduce the complexity in the outputrouting due to the reduction in thenumber of output pins. There is a 6xreduction in pin count for a quad-channel converter (Table 2).

52 Xcell Journal Third Quarter 2013

X P L A N A T I O N : F P G A 1 0 1

Just as with the LVDS interface, it isimportant to maintain proper termina-tion impedances and transmission-lineimpedances when using CML with theJESD204B interface. One method ofmeasuring signal quality is the eye dia-gram. An eye diagram is a measure-ment that indicates several parametersabout the signal on a link. Figure 5shows an eye diagram for a properlyterminated CML driver on a 5-GbpsJESD204 link. The eye diagramexhibits nice transitions and has a suf-ficiently open eye such that the receiv-er in the FPGA should have no diffi-cultly interpreting the data. Figure 6shows the result of having an improp-erly terminated CML driver in the same5-Gbps JESD204 link. The eye diagramhas more jitter, the amplitude is

Number of Channels

1 12 13 14 2

2 12 26 28 4

4 12 52 56 8

8 12 104 112 16

Resolution CMOS Pin Count

LVDS Pin Count (DDR)

CML Pin Count (JESD204B)

Number of Channels

1 12 50 8

2 12 100 16

4 12 200 32

8 12 400 64

Resolution LVDS Pin Count (DDR)

CML Pin Count (JESD204B)

Table 1 – Pin count comparison for a 200-MSPS A/D converter

using CMOS, LVDS and JESD204B

Table 2 – Pin count comparison between LVDS and JESD204B for a 1-GSPS converter

The JESD204B serialized data interface offers a better and faster way to transmit data from converters to receiver

devices. The latest converters and FPGAs are being designed with drivers to support JESD204B.

Page 53: Xcell Journal issue 84

X P L A N A T I O N : F P G A 1 0 1

Third Quarter 2013 Xcell Journal 53

requirements that you must take intoaccount when designing a system, suchthat the receiver in the FPGA can prop-erly capture the converter data. It isimportant to understand the loads thatmust be driven; to utilize the correctterminations where appropriate; and toemploy proper layout techniques forthe different types of digital outputs theconverters use. As the speed and reso-lution of converters have increased, sohas the demand for a more efficient dig-ital interface. As this happens, itbecomes even more important to havea properly designed system using opti-mal layout techniques.

The JESD204B serialized data inter-face offers a better and faster way totransmit data from converters toreceiver devices. The latest convertersand FPGAs are being designed withdrivers to support JESD204B. Theneed begins with the system-leveldemand for more bandwidth, resultingin faster converters with higher sam-

ple rates. In turn, the higher samplerates drive the requirement for higheroutput data rates, which has led to thedevelopment of JESD204B. As systemdesigns become more complex andconverter performance pushes higher,the JESD204 standard is poised toadapt and evolve to continue to meetthe new design requirements. Thisability of the JESD204 standard toevolve will increase the use of theinterface, eventually propelling it intothe role of default interface of choicefor converters and FPGAs.

For more information, see the“JESD204B Survival Guide” athttp://www.analog.com/static/import-

ed-files/tech_articles/JESD204B-

Survival-Guide.pdf. Also, Xilinx andAnalog Devices offer a three-partYouTube series on rapid prototypingwith JESD204B and FPGA platforms athttp://www.youtube.com/playlist?list=

PLiwaj4qabLWxRUOHKH2aGMH7kk

Nr0euo3

Figure 5 – Eye diagram (5 Gbps) with proper terminations

Figure 6 – Eye diagram (5 Gbps) with improper terminations

www.trenz-electronic.de

difference by design

Platform Features• 4×5 cm compatible footprint• up to 8 Gbit DDR3 SDRAM• 256 Mbit SPI Flash• Gigabit Ethernet• USB option

All ProgrammableFPGA and SoC modules

4form x factor

5

Available SoMs:

Design Services• Module customization• Carrier board customization• Custom project development

rugged for harsh environmentsextended device life cycle

Page 54: Xcell Journal issue 84

54 Xcell Journal Third Quarter 2013

TOOLS OF XCELLENCE

Designing a Low-LatencyH.264 System Using the Zynq SoC by Allen Vexler

CTOA2e [email protected]

Page 55: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 55

To create an IP-based streaming-video systemwith a small PCB footprint, designers historicallyhave had to choose between the inflexible archi-tecture of an ASSP or a larger, more flexible sys-tem based on an FPGA-microprocessor combo.Placing a soft-core microprocessor inside theFPGA could eliminate the need for a separateprocessor and DRAM, but the performance of thefinal system might not match that of a solutionbuilt around an external ARM® processor thatwould also contain USB, Ethernet and other use-ful peripherals. With the advent of the Xilinx®

Zynq®-7000 All Programmable SoC along withsmall-footprint H.264 cores, it’s now possible toconstruct a system in a very small-form-factorPCB with only one bank of DRAM, but with thepower of dual ARM cores and high-speed periph-erals that are all interconnected with multiplehigh-speed AXI4 buses (see Figure 1).

Although H.264 cores for FPGAs have existedfor quite some time, up until now none have beenfast enough and small enough to convert 1080p30video and still fit into a small, low-cost device.Using the latest micro-footprint H.264 cores fromA2e Technologies in tandem with the Zynq SoC,it’s possible to build a low-latency system thatcan encode and decode multiple streams ofvideo from 720p to 4K at variable frame ratesfrom 15 to 60 frames per second (fps). Placing anA2e Technologies H.264 core in a Zynq SoCdevice allows for a significant reduction in boardspace and component count, while the dual ARMcores in the Zynq SoC remove the need for a sep-arate microprocessor and the memory bank towhich it must be attached.

T O O L S O F X C E L L E N C E

Micro-footprint H.264cores combine withXilinx’s Zynq SoC in asmall, fast, streaming-video system. T

Page 56: Xcell Journal issue 84

The result is a complete stream-ing-video system, built around Linuxdrivers and a Real Time StreamingProtocol (RTSP) server, that is capa-ble of compressing 1080p30 videofrom two cameras with an end-to-end latency of approximately 10 mil-liseconds. The A2e TechnologiesH.264 core is available in encode-only and encode/decode versions.Additionally, there is a low-latencyversion that reduces encode latencyto less than 5 ms. The 1080p30encode-only version requires 10,000lookup tables (LUTs), or roughly 25percent of a Zynq Z7020’s FPGA fab-ric, while the encode/decode versionrequires 11K LUTs.

SOC-FPGA BASED SYSTEMSProducts based on the Zynq SoC pro-vide significantly more flexibilitythan those built using an application-specific standard product (ASSP).For example, it’s easy to construct asystem supporting four 1080p30inputs by placing four H.264 cores inthe FPGA fabric and hooking eachcamera input to the FPGA. ManyASSPs have only two inputs, forcingdesigners to figure out how to multi-plex several streams into one input.Each A2e Technologies H.264 core iscapable of handling six VGA-resolu-tion cameras or one camera with1080p30 resolution. So, it would bepossible to construct a two-core sys-

tem that could compress video from12 VGA cameras.

ASSPs typically provide for an on-screen display feature only on thedecoded-video side, forcing design-ers either to send the OSD informa-tion as metadata or to use the ARMcore to time video frames and pro-grammatically write the OSD data tothe video buffer. In an FPGA, addingOSD before video compression is assimple as dropping in a block of IP.You can place additional processingblocks such as fisheye-lens correc-tion before the compression enginewith relative ease. FPGAs also allowfor in-field, future upgrades of fea-tures, such as adding H.265 compres-

56 Xcell Journal Third Quarter 2013

T O O L S O F X C E L L E N C E

Camera

Camera

Zynq

DDR

DDR

Flash

DDR Flash

ENETPHY

ENETPHY

USB

SATA

USB

SATA

FPGAUSB

ARMµP

Figure 1 – Video compression systems based on a single-chip Zynq SoC (top) vs. a two-chip processor-FPGA combination

Page 57: Xcell Journal issue 84

sion. Figure 2 is a block diagram ofan H.264 compression engine withinput from two 1080p30 cameras,with OSD applied to the uncom-pressed image.

MANAGING LATENCYCertain applications, such as controlof remotely piloted vehicles (RPVs),are based upon feedback fromstreaming images coming from theremote device. In order to control aremote device, the latency betweenthe time when the sensor sends thevideo to the compression engine andthe point when the decoded image isdisplayed (referred to as “glass toglass”) typically needs to be less than100 ms. Some designers target num-

bers in the 50-ms range. Latency isthe sum the following:

• Video-processing time (ISP, fisheye-lens correction, etc.)

• Delay to fill frame buffer

• Compression time

• Software delay associated with sending out a packet

• Network delay

• Software delay associated with receiving a packet

• Video decode time

Many systems encode video usinghardware but end up decoding itusing a standard video player such

as VLC running on a PC. Even thoughthe buffer delay for media players canbe reduced from the typical 500 to1,000 ms to something significantlyless, the latency is still far greaterthan 50 ms. To truly control latencyand keep it to an absolute minimum,hardware decode of the compressedvideo stream is required, along withminimal buffering.

Latency for H.264 encoders is typi-cally specified in frames, as a full framemust typically be buffered before com-pression begins. Assuming you canspeed up your encoder, you coulddecrease the encode latency simply bydoubling the frame rate—that is, oneframe of latency for 30 fps is 33 ms vs.one frame of latency for 60 fps is 16.5

T O O L S O F X C E L L E N C E

Third Quarter 2013 Xcell Journal 57

SystemClock

Audio InputI2S

1080p 30 VideoInputs

DVI, HDMI, CMOSSensor, etc.

Audio DMA

Video DMAwith raster tomacroblockconversion

Video DMAwith raster tomacroblockconversion

To all modules

AXI Interconnect AXI Interconnect AXI Interconnect

H264Core

H264Core

AXI InterconnectConfiguration Audio Data Uncompressed Video

Xilinx IP

A2e IP

Peripherals(USB, Ethernet,UART, I2C, etc)

PL Clocks

Application Processor Unit

(ARM Cortex-A9 CPUs,Cache, DMA, etc)

DDR 3

GP0 HP0 HP1 HP2

Zynq SoC Processing System (PS)

Memory Controller

Zynq SoC Programmable Logic (PL)

0 1 2 3

Video Into

AXI4-Stream

Video Into

AXI4-Stream

Video On -Screen Display

Video On -Screen Display

Figure 2 – Architecture of a dual-1080p30 H.264 compression engine

Page 58: Xcell Journal issue 84

ms. Many times this option is notavailable due to performance limita-tions of the camera and encoder.Thus, the solution is an encoderspecifically designed for low latency.In the new A2e Technologies low-latency encoder, only 16 video linesneed to be buffered up before com-pression starts. For a 1080p30 videostream, the latency is less than 500 µs.For a 480p30 video stream the latencyis less than 1 ms. This low-latencyencoder will allow designers to con-struct systems with low and pre-dictable latency.

In order to minimize total latency,you must also minimize delays intro-duced by buffers, network stacks and

RTSP servers/clients on both theencoding and decoding sides, as it ispointless to use a low-latency encoderwith a software path that introduces alarge amount of latency. An RTSP serv-er typically is used to establish astreaming-video connection between aserver (camera) and the client (decod-ing/recording) device. After a connec-tion is established, the RTSP serverwill stream the compressed video tothe client for display or storage.

MINIMIZING LATENCYTypically, the software components inthe server and client are designedonly to keep up with the bandwidthrequired to stream compressed video,

nor minimize latency. With a non-real-time operating system likeLinux, it can be difficult to guaranteelatency. The typical solution is to cre-ate a low-latency, custom protocolfor both the server and client.However, this approach has thedownside of not being compatiblewith industry standards. Anotherapproach is to take a standard likeRTSP and make modifications at thelower levels of the software to mini-mize latency, while at the same timeretaining compliance with standards.

However, you can also take steps tominimize copies between kernel anduser space, and the delays associatedwith them. To reduce latency though

58 Xcell Journal Third Quarter 2013

T O O L S O F X C E L L E N C E

Camera

DDR Flash DDR Flash

ZynqSoC

ENETPHY

ENETPHY

DualARMCores

H.264Encoder/Decoder

DualARMCores

DisplayController

H.264Encoder/Decoder

ZynqSoC

Figure 3 – Block diagram of complete encode-decode system

A typical streaming-video system uses an RTSP server to establish a streaming-video connection between a camera and the client (decoding/recording) device. The RTSP server will stream

the compressed video to the client for display or storage.

Page 59: Xcell Journal issue 84

the whole software path, you willneed to decouple the RTSP server andcompressed data forwarding so thatthe Linux driver, not the RTSP server,performs the forwarding.

In the A2e Technologies low-laten-cy RTSP server, we have made twomajor changes to reduce latency.First, we removed the RTSP serverfrom the forwarding path. The RTSPserver continues to maintain statis-tics using the Real Time ControlProtocol (RTCP), and periodically(or asynchronously) updates the ker-nel drive with the changes in the net-work destination (i.e., destination IPor MAC address). Second, the kerneldriver attaches the necessary head-ers (based on the information pro-

vided by the RTSP server) andimmediately forwards the packet bydirectly injecting it into the networkdriver (for example, udp_send),thus eliminating the memory copyto/from user space.

Figure 3 shows a complete H.264IP-based encode/decode systemwith a combined latency of less than50 ms. The system is based upon theZynq SoC, the A2e Technologieslow-latency H.264 encoder-decoderand the A2e Technologies low-laten-cy RTSP server/client. Note that theonly real difference from a hard-ware perspective between theencode and decode systems is thatthe encode side must interface to acamera/sensor and the decode side

must be able to drive a flat-panel dis-play. You can easily design a boardthat has all required hardware fea-tures needed for use in both encodeand decode.

In order to minimize video compres-sion/decompression latency for real-time control applications, designersneed special encoders and optimizedsoftware. Utilizing the Xilinx Zynq SoCand the A2e Technologies low-latencyH.264 encoder and optimized RTSP serv-er/client, you can create a highly config-urable system with extremely low laten-cy in a small PCB footprint.

For more information, visithttp://www.xilinx.com/products/intel-

lectual-property/1-1LK9J1.htm andhttp://www.a2etechnologies.com.

T O O L S O F X C E L L E N C E

Third Quarter 2013 Xcell Journal 59

Page 60: Xcell Journal issue 84

60 Xcell Journal Third Quarter 2013

TOOLS OF XCELLENCE

Designers must consider voltage rails, voltage levels and high-current requirements when planning their power systems. The point-of-load converter will be a key component.

Innovative PowerDesign MaximizesVirtex-7 TransceiverPerformance by Takayuki Mizutori

Field Application EngineerBellnix Co. [email protected]

Page 61: Xcell Journal issue 84

Third Quarter 2013 Xcell Journal 61

W ith semiconductor compa-nies releasing ever-more-advanced devices, circuit

designers are constantly challenged tofigure out the best ways to ensure theirdesigns run at maximum performanceso their systems can reap all the benefitsof these new ICs. An often overlookedbut critical step to getting the best per-formance out of the most advancedFPGAs is to create a robust and reliablepower supply in your system.

The Xilinx® 7 series FPGAs provideunprecedented low power, high per-formance and an easy path for migrat-ing designs. This innovative familyemploys a 28-nanometer TSMC HPL(high-performance, low-power) processand offers a 50 percent reduction inpower consumption compared withcompeting FPGAs.

Yet as FPGAs increase in capacityand performance, FPGA I/O resourcesare becoming a major performance bot-tleneck. While transistor counts effec-tively double every year, the size of anFPGA die stays more or less the same.This is problematic for the I/O carryingsignals on and off the chip, as I/O canonly increase linearly, which presents a

slew of layout, signal integrity andpower challenges for systems designerstrying to create higher-bandwidth sys-tems. To overcome the issue, Xilinxemploys highly sophisticated, multigiga-bit transceiver (MGT) I/O modules in itsstate-of-the-art devices. The Virtex®-7HT FPGA, for example, provides a max-imum bandwidth of 2.78 Tbps, using 96sets of 13.1-Gbps transceivers.

As FPGAs get more sophisticated,the power supply requirements havealso become more complex. For today’sdesigns, teams must consider multiplevoltage rails corresponding to each cir-cuit and functional block, a plurality ofvoltage levels, as well as high-currentrequirements. Let’s examine the powerrequirements associated with theVirtex-7 FPGAs in greater depth.

ACCURACY IN OUTPUTVOLTAGE SETTING With the evolution of semiconductormanufacturing process technology,the density and performance of semi-conductors, especially FPGAs, haveimproved greatly. At the same time,semiconductor companies have beenreducing the core voltages of their

devices. Both factors complicatepower supply design and make it hardfor designers to select the right powercircuitry. Board layout and manufac-turing technologies are not sufficientto manage voltage within a few milli-volts, but the power systems must bemeasurably accurate and reliable. As aresult, it is imperative for design teamsto take power supply characteristicsand features as well as the device’scharacteristics into considerationwhen designing their power systems.For example, a core voltage of 1 V hasa tolerance of +/-30 mV. When a cur-rent changes by 10 A, a 50-mV voltagedrop occurs with a 5-milliohm powerline (see Figure 1). As such, it is noteasy to manage a power supply volt-age within the tolerable values in allthe environments in your boarddesigns. In addition, if design teamsare using a point-of-load (POL) powersupply, they need to be extremelyaccurate, within +/-1 percent, and havea high-speed transient response. Forthese and other reasons, design teamsshould thoroughly investigate thedesign and options available to thembefore building their boards.

T O O L S O F X C E L L E N C E

+3.3V

+2.5V

+1.2V

+1.0V

350nm 250nm 180nm 90nm 65nm 28nm

±300mv

±50mv ±30mv

Cor

e V

olta

ge V

ccin

t

Figure 1 – Evolution of the process and tolerance of the core voltage

Bellnix BPE-37

Page 62: Xcell Journal issue 84

One of the first things design teamsneed to be aware of when selecting cir-cuitry for their power system is thatFPGA designs have strict sequencingrequirements for startup. For example,the Virtex-7 FPGA power-on sequencerequires one of two choices:VCCINT_VMGTAVCC_VMGTAVTT orVMGTAVCC_VCCINT_VMGTAVTT.Meanwhile, the power-off sequence isthe reverse of whichever of thesepower-on sequences you chose. If youpower the device on or off in anysequence other than the ones recom-mended, the slack current from VMG-TAVTT could be more than the specifi-cation during power-up and power-down. To satisfy such sequencingrules, designers need to use either aspecialized IC for sequencing or theP-Good and on/off features of POLs.

A power system for a design usingan FPGA also needs to include circuitrythat allows the team to monitor out-put voltages. Typically, designerscheck the output voltage during aproduct’s fine-tuning stage and trimthe output voltage to manage printedpattern drop voltages. Designers typi-cally trim the voltage with POLs, butof course this means that the POLsyou use in your design need to have afeature for output-voltage adjustment.The power system also needs to allowyou to monitor your design’s voltageand current to check for current timedeviations and power malfunctions,and to better predict failures.

High-speed serdes require a highcurrent because they have several

channels and a high operating fre-quency. The low-dropout (LDO) regu-lators that engineers use for this pur-pose generate a lot of heat due topoor efficiency and thus are not anideal power solution for circuits thatconsume high current. High-speedserdes are quite sensitive to the jittercaused by noise in power circuitry.High jitter will corrupt signals andcause data errors. It is also veryimportant for serdes to have verygood signal integrity so as to achievegood high-speed data communica-tions. This of course means that whenchoosing a POL, designers need toensure their POL isn’t noisy.

SELECTING A HIGH-ACCURACYPOL CONVERTERA crucial step in building a proper powersystem for a Virtex-7 FPGA is ensuringthe POL converters meet a given device’srequirements. Fast-transient-responsePOL converters excel in designs wherethe current is high and shifts in currentare large. Not only are these high-accura-cy POL converters with fast transientresponses helpful in managing theFPGA’s shifting core voltage. They arealso useful for managing the VMGTAVCCand VMGTAVTT of an MGT. Figure 2shows a measurement of voltageresponse for the dynamic-load change ofseveral POL converters. To keep thingsfair, we have used the same conditionsfor all the POL converters. The onlyexception is that we used the capacitorthat each POL converter’s manufacturerrecommends in their literature.

• Input voltage: 5 V fixed

• Output voltage: 1 V fixed (Virtex-7 core voltage)

• Load current: 0~10 A

• Current change slew rate (current change rate per unit of time): 5 A/µs

Streaming data and numeric pro-cessing create changes in dynamiccurrent in FPGAs. Our results showthat some POL converters under-shoot and overshoot the output volt-ages, getting beyond 1 V ±30 mV,which is the VCCINT and VMGTAVCCvoltage that Xilinx recommends forVirtex-7 FPGAs. Of course, the num-ber might vary depending on usage,board layout and decoupling capaci-tors surrounding the POLs and theFPGAs. However, it is clear that weneed a very good power supply tokeep up with the recommended oper-ating-voltage range of the Virtex-7FPGAs, where large current changes(10 A and 5 A/µs) occur rapidly. TheBellnix BSV series fits these needsand also has a fast transientresponse, necessary for high-endFPGAs and other high-speed devices,such as discrete DSPs and CPUs.

PROGRAM FOR OUTPUT VOLTAGEAND SEQUENCINGAs described above, as FPGA processtechnologies shrink, their supply volt-age gets lower but the number of func-tions increases, as does their perform-ance. This means the current changes

62 Xcell Journal Third Quarter 2013

T O O L S O F X C E L L E N C E

Figure 2 – Examples of transient response in competing point-of-load converters; results for the Bellnix POL converter are at left.

Page 63: Xcell Journal issue 84

rapidly, increasing power require-ments. It is getting more difficult fordesign engineers to manage voltagewithin the allowed tolerance thatFPGAs require. This is true not onlyfor the core voltage of the FPGA butalso for the FPGA’s high-speed serdes.There are several important pointsrelated to power supplies you need toconsider to get the maximum per-formance from a Virtex-7 transceiver.For example, you need to adjust thePOL output voltage when a voltagedrop occurs due to power line resist-ance. Margining testing likewiserequires an output-voltage adjust-ment. Designers also need to monitor

and trim the output voltage of thePOL to evaluate systems quickly.

Here is a method to adjust the out-put voltage with a CPU program tosatisfy Virtex-7 power requirements.You can use the circuit diagram inFigure 3 to make a BSV-series POLinto programmable output voltage.

In the figure, Amp 1 detects theoutput voltage. You can control theoutput voltage with high accuracy byremote sensing with a differentialamplifier and cancel out the voltagedrop by connecting the POL converterto the load. Amp 2 compares the out-put voltage detected by Amp 1 withthe D/A converter’s output voltage. We

connected Amp 2’s output to the POLconverter’s trim pin. The output voltageof the POL converter changes with thetrim pin’s input voltage. Thus, the volt-age of the load and that of the D/A con-verter output match.

You can set an output voltage by con-trolling the D/A converter with the CPU.First, you must set the D/A converteroutput to 0 V, and then turn the on/offcontrol to ON and set the D/A converteroutput to the desired value when youturn on the POL converter. If the nonin-verted input voltage is high when theinverted input of Amp 2 is at 0 V (whenthe output is OFF), Amp 2 will try toraise the output voltage at the trim pinand cause an overvoltage when you turnon the POL converter. Also, you mustconnect an RC circuit to the D/A con-verter output to prevent Amp 2’s nonin-verted input pin from rising high rapidlyand creating an overshoot.

At Bellnix, we developed an evalua-tion power module called the BPE-37 forXilinx’s 7 series GTX/GTH FPGAs. BPE-37 uses the circuit described earlier andhas four POL converters required for theMGT. These four POL converters are allBSV-series POLs featuring high accura-cy, fast transient response and low ripplenoise. The BPE-37 uses a microcon-troller to control the D/A converter andto set each output voltage to an adequatelevel. The microcontroller also startsand stops the POL converters using theiron/off pins to control sequencing. Thistechnique allows design teams to modifyoutput voltage, change sequencing andmonitor the output voltage and currentvia PMBus serial communication.

The BPE-37’s circuit configurationmakes it possible to add digital function-ality to the conventional analog-typePOLs to satisfy all the power require-ments for Xilinx’s Virtex-7 series. TheGTX/GTH transceivers in these FPGAsrequire POLs with the fastest transientresponse. Table 1 shows the BPE-37’skey specifications.

For more details on Bellnix’s offering,please refer to http://www.bellnix.com/

news/274.html.

T O O L S O F X C E L L E N C E

Third Quarter 2013 Xcell Journal 63

Vin

GND GND

BSV series

Vsense

Trim

Vsense

Vsense

Vsense

Vout

+

+ +– +

––

3.3V

Amp 1

Amp 2

DAC

Figure 3 – How to make a BSV-series POL into programmable output voltage

Power supplyMaximumallowed

POL converters

Power supply noise

Dynamic response

MGTAVCC 1.0V±30mV BSV-1.5S12R0H 8mVp-p 18mV (voltage dip)

MGTAVTT 1.2V±30mV BSV-3.3S8R0M 8mVp-p 20mV (voltage dip)

MGTVCCAUX 1.8V±50mV BSV-3.3S3R0M 10mVp-p 10mV (voltage dip)

Table 1 – The BPE-37’s key specifications

Page 64: Xcell Journal issue 84

VIVADO DESIGN SUITE:DESIGN EDITION UPDATES

Vivado IP Integrator

IP Integrator is now in full productionrelease. The tool is available withevery seat of Vivado Design Suite.Enhancements include:

• Automated IP upgrade flow for migrat-ing a design to the latest IP version

• Cross-probing now enabled on IPIntegrator-generated errors andwarnings

• Integration with IP issued bySystem Generator

• Synthesis run-times up to 4x fasterfor IP Integrator-generated designs

• Error-correcting code (ECC) nowsupported in MicroBlaze™ withinIP Integrator

Vivado IP Catalog

The Vivado IP flow enables bottom-up synthesis in project-baseddesigns to reduce synthesis run-times (IP that has not changed willnot be resynthesized). Significantenhancements to the “Manage IP”flow include:

• Simplified IP creation and man-agement. An IP project is automat-ically created to manage thenetlisting of IP.

• Easy generation of a synthesizeddesign checkpoint (DCP) for IP, toenable IP usage in black-box flowsusing third-party synthesis tools.

• DCP for IP and the Verilog stubfiles are now co-located with theIP’s XCI file.

• Scripts are delivered for compilingIP simulation sources for behav-ioral simulation using third-partysimulators.

Vivado also offers improved han-dling of IP constraints. IP DCP con-tains the netlist and constraints forthe IP, and constraints are scopedautomatically.

64 Xcell Journal Third Quarter 2013

What’s New in theVivado 2013.2 Release? Xilinx is continually improving its products, IP and design tools as it strives to help designers

work more effectively. Here, we report on the most current updates to Xilinx design tools including

the Vivado® Design Suite, a revolutionary system- and IP-centric design environment built from

the ground up to accelerate the design of Xilinx® All Programmable devices. For more information

about the Vivado Design Suite, please visit www.xilinx.com/vivado.

Product updates offer significant enhancements and new features to the Xilinx design tools.

Keeping your installation up to date is an easy way to ensure the best results for your design.

The Vivado Design Suite 2013.2 is available from the Xilinx Download Center at

www.xilinx.com/download.

XTRA, XTRA

VIVADO DESIGN SUITE 2013.2 RELEASE HIGHLIGHTS

• Full support for Zynq®-7000 All Programmable SoCs

• Vivado IP Integrator now integrated with High-Level Synthesis and SystemGenerator in one comprehensive development environment

• Complete multiprocessor application support, including hardware/softwareco-debug, for detailed embedded-system analysis

• All 7 series FPGA and Zynq-7000 All Programmable SoC devices supported with PCI-SIG®-compliant PCI Express® cores

• Tandem two-stage configuration for PCI Express 100-ms requirements

Device Support

The following devices are production ready:

• Zynq-7000 All Programmable SoC: 7Z010, 7Z020 and 7Z100

• Defense-Grade Zynq All Programmable SoC: 7000Q, 7Z010, 7Z020 and 7Z030

• Defense-Grade Virtex®-7Q FPGA: VX690T and VX980T

• Defense-Grade Artix®-7Q FPGA: A100T and A200T

• XA Artix-7 FPGA: A100

Page 65: Xcell Journal issue 84

• Enhanced caching to speed upsubsequent model compilation.

UPDATES TO EXISTING IP

AXI Ethernet

New IEEE 1588 hardware timestamping for one-step and two-step

AXI Ethernet Lite

Virtex-7 FPGA in production

Trimode Ethernet MAC

Virtex-7 FPGA, Artix-7 FPGA, Zynq-7000 All Programmable SoC ver-sions in production

GMII2RGMII

Zynq-7000 All Programmable SoCversion in production

10G Ethernet MAC, RXAUI, XAUI

Virtex-7 FPGA, Zynq-7000 All Pro-grammable SoC versions in production

PCI32 and PCI64

Artix-7 FPGA in production

LEARN MORE ABOUT THEVIVADO DESIGN SUITE

Vivado QuickTake Video Tutorials

Vivado Design Suite QuickTake videotutorials are how-to videos that takea look inside the features of theVivado Design Suite. Topics includehigh-level synthesis, simulation andIP Integrator, among others. Newtopics are updated regularly. Seewww.xilinx.com/training/vivado.

Vivado Training

For instructor-led training on theVivado Design Suite, please visitwww.xilinx.com/training.

Vivado Debug

Debugging support has been added forthe Zynq-7Z100 All Programmable SoC;7 series XQ package and speed gradechanges; and Virtex-7 HT general ESdevices (7VHT580T and 7VHT870T).

• Target Communication Framework(TCF) agent (hw_server)

• Vivado serial I/O analyzer supportfor IBERT 7 series GTZ 3.0

• Added support for debugging multi-ple FPGAs in the same JTAG chain

Tandem Configuration for Xilinx PCIe® IP

Designers can use tandem configura-tion to meet the fast enumerationrequirements for PCI Express®. Atwo-stage bitstream is delivered to theFPGA. The first stage configures theXilinx PCIe IP and all design elementsto allow this IP to independently func-tion as quickly as possible. Stage twocompletes the device configurationwhile the PCIe link remains function-al. Two variations are available:Tandem PROM loads both stages fromthe same flash device, and TandemPCIe loads stage two over the PCIelink. All PCIe configurations are sup-ported up to X8Gen3.

• Tandem configuration is nowreleased with production status, forboth Tandem PROM and TandemPCIe. Devices with this status includeXC7K325T-FFG900, XC7VX485T-FFG1761 (PCIE X1Y0 locationrequired) and XC7VX690T-FFG1761(PCIE X0Y1 location required).

• Tandem configuration is available asbeta for additional devices, for bothTandem PROM and Tandem PCIe.Hardware testing for these deviceshas been limited. They include theXC7K160T-FFG676, XC7K410T-FFG676, XC7VX415T-FFG1158(PCIE X0Y0 location recommended)and XC7VX550T-FFG1158 (PCIEX0Y1 location recommended).

For more information, see PG054 (ver-sion 2.1) for Gen2 PCIe IP, or PG023(version 2.1) for Gen3 PCIe IP.

VIVADO DESIGN SUITE:SYSTEM EDITION UPDATES

Vivado High-Level Synthesis

• Video libraries with OpenCV inter-faces are enhanced with support for12 additional functions.

• Adders can now be targeted forimplementation in a DSP48 directlyfrom Vivado HLS.

• The Xilinx FFT core can now beinstantiated directly into the C/C++source code. This feature is beta.Contact your local Xilinx FAE fordetails on how to use this feature.

Vivado IP Integrator support inSystem Generator for DSP

• Model packaged as IP with supportfor gateway inputs and outputs to bepackaged as interfaces or raw ports.

• Capability to specify AXI4-Streaminterface in gateway blocks and toinfer.

• AXI4-Lite and AXI4-Stream inter-faces from named gateway blocks.

• System Generator generates exampleRTL Vivado project and example IPIntegrator block diagram to enableeasy evaluation of packaged IP.

• New System Generator IP Integratorintegration tutorial available withSystem Generator tutorials.

Introducing support for MATLAB®

and Simulink® release R2013a

Blockset enhancements include:

• Support for FIR Compiler v7.1 withintroduction of area, speed and cus-tom optimization options.

• Model upgrade flow now includesport and parameter checks to theHTML report.

Third Quarter 2013 Xcell Journal 65

Page 66: Xcell Journal issue 84

66 Xcell Journal Third Quarter 2013

Xpress Yourself in Our Caption Contest

I f you’ve looked at clouds from both sides now, take another gander.Here’s your opportunity to Xercise your funny bone by submitting anengineering- or technology-related caption for this cartoon showing an

engineer with his head—and all the rest of him, actually—in the clouds. Theimage might inspire a caption like “Dave made major strides in cloud com-puting after studying transcendental meditation.”

Send your entries to [email protected]. Include your name, job title, companyaffiliation and location, and indicate that you have read the contest rules atwww.xilinx.com/xcellcontest. After due deliberation, we will print the sub-missions we like the best in the next issue of Xcell Journal. The winner willreceive an Avnet ZedBoard, featuring the Xilinx® Zynq®-7000 AllProgrammable SoC (http://www.zedboard.org/). Two runners-up will gainnotoriety, fame and a cool, Xilinx-branded gift from our swag closet.

The contest begins at 12:01 a.m. Pacific Time on July 15, 2013. All entries mustbe received by the sponsor by 5 p.m. PT on Oct. 1, 2013.

So, look for that silver lining and get writing!

ROBIN FINDLEY, embedded-systemsconsultant/contractor with Findley

Consulting, LLC in Clarkesville, Ga.,won a shiny new Avnet ZedBoard

with this caption for the asteroid car-toon in Issue 83 of Xcell Journal:

Congratulations as well to

our two runners-up:

“Remember when you asked, ‘Whatcould possibly go wrong?’”

– Mark Rives, systems engineer,

Micron Technology, Berthoud, Colo.

“Larry, you know how I said our timing looked great???”

– Dave Nadler, owner,

Nadler & Associates, Acton, Mass.

“Paul watches as Engineering lobsanother design over the wall.”

DA

NIE

LG

UID

ER

AXCLAMATIONS!

NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia, or Canada (excluding Quebec) to enter. Entries must be entirely original. Contest begins on July 15, 2013. Entries must be received by 5:00 pm Pacific Time (PT) on Oct. 1, 2013. Official rules are available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.