www.xilinx.com/xcell/ SOLUTIONS FOR A PROGRAMMABLE WORLD Xcell journal Xcell journal ISSUE 84, THIRD QUARTER 2013 Efficient Bitcoin Miner System Implemented on Zynq SoC Benchmark: Vivado’s ESL Capabilities Speed IP Design on Zynq SoC How to Boost Zynq SoC Performance by Creating Your Own Peripheral Xilinx Goes UltraScale at 20 nm and FinFET XRadio: A Rockin’ FPGA-only Radio for Teaching System Design page 28
The summer 2013 edition of Xcell Journal includes a cover story that examines Xilinx new innovative UltraScale architecture, which Xilinx will deploy in its 20nm planar and 16nm FinFET All Programmable devices. The issue also includes a variety of fantastic methodology and how-to articles for Xilinx users of all skill levels.
Text of Xcell Journal issue 84
S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D
Xcell journalXcell journalI S SUE 84 , TH IRD QUAR TER 2013
Efficient Bitcoin Miner SystemImplemented on Zynq SoC
Benchmark: Vivado’s ESLCapabilities Speed IP Design on Zynq SoC
How to Boost Zynq SoCPerformance by Creating Your Own Peripheral
Xilinx GoesUltraScale at 20 nm and FinFET
XRadio: A Rockin’ FPGA-only Radio for Teaching System Design
The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.
Xilinx UltraScale Raises Value of All Programmable
If you’ve been keeping up with developments in the road map for advanced semi-conductor process technologies, you’re probably aware that the cost of producingsilicon is accelerating. Some have asserted that the cost benefit afforded by Moore’s
Law—namely, that the cost stays the same even though the transistor count doubles—is running out of gas as processes shrink below 28 nanometers. Indeed, for the 20-nm planarand 14/16-nm FinFET generation, industry analysts foresee a tandem trend of rising NRE costsand spiraling price tags for each transistor at each new node.
Analysts say that to make a decent profit on a chip implemented at 20 nm, you will firstneed to clear an estimated lowball cost of $150 million for designing and manufacturing thedevice. And to hit that lowball mark, you’d better get it right the first time, because maskcosts are approaching $5 million to $8 million per mask.
FinFET looks even more challenging. Analysts predict that the average cost for the16/14-nm FinFET node will be around $325 million—more than double the cost of 20-nmmanufacturing! Why so high? Analysts believe companies will encounter excessiveretooling, methodology and IP-requalification costs in switching from planar-transistorprocesses to FinFETs. Tools and methods will have to deal with new complexities in signalintegrity, electromigration, width quantization, and resistance and capacitance. The combi-nation of new transistor, lithography and manufacturing techniques and tools—along withrevamped and more expensive IP—represents a perfect storm for design teams.
To say the least, it’s hard to justify targeting 16/14-nm FinFET unless the end productwill sell in multiple tens of millions of units and have returns counted in billions rather thanmillions of dollars. As such, eliminating the NRE costs for customers designing on theseadvanced nodes is a growing sales advantage for all vendors playing in the FPGA space. Butif you have been following Xilinx’s progress for the last five years, you’ve seen that the com-pany has clearly demonstrated technical superiority and a unique value proposition. Today’sXilinx® All Programmable devices offer unmatchable system functionality with highcapacity and performance, lower power and reduced BOM costs, paired with more flexi-bility and greater programmability than any other silicon device on the market.
Xilinx executed several breakout moves at the 28-nm node, becoming the first to marketwith 28-nm FPGAs, 3D ICs and Zynq®-7000 All Programmable SoCs. At the same time, thecompany introduced the Vivado® Design Suite, incorporating state-of-the-art EDA tech-nologies. There are more breakthroughs in store at the 20-nm and 16-nm FinFET nodes.In July, Xilinx once again became the first merechant chip company to tape out an IC ata new manufacturing node: 20 nm. These new devices will include numerous architec-tural innovations in Xilinx’s new UltraScale™ architecture.
As you will read in the cover story, the UltraScale architecture will deliver many benefitsfor customers, perhaps the most noteworthy of which is a vast increase in system usage—better than 90 percent. It’s worth noting that the power of the UltraScale architecture couldnot have been fully realized without Vivado. The tool suite has greatly boosted productivityfor Xilinx customers using the 28-nm devices. Now, this sophisticated EDA technology ishelping Xilinx unleash the benefits of 20-nm and FinFET silicon. Xilinx will deliver thatunmatchable advantage to customers—with no NRE charges.
Building on its breakout move at 28 nanome-ters, Xilinx® has announced two industry firstsat the 20-nm node. Xilinx is the first merchantchip company to tape out a 20-nm device.What’s more, the new device will be the firstXilinx will bring to market using its UltraScale®
technology—the programmable industry’s firstASIC-class architecture. The UltraScale archi-tecture takes full advantage of the state-of-the-art EDA technologies in the Vivado® DesignSuite, enabling customers to quickly create anew generation of All Programmable innova-tions. The move continues the Generation
Ahead advantage over competitors thatbegan at 28 nm, when Xilinx deliv-ered two first-of-their-kind devices:the Zynq®-7000 All ProgrammableSoC and Virtex®-7 3D ICs.
“Xilinx was the first company todeliver 28-nm devices and at 20 nm,
we have maintained the industry’s mostaggressive tapeout schedule,” said Steve
Glaser, senior vice president of marketingand corporate strategy at Xilinx. “Our hardwork is paying off yet again, and we areclearly a year ahead of our competition indelivering high-end devices and half a yearahead delivering midrange devices.”
But as Xilinx did at 28 nm, the companyis also adding some industry-first innova-tions to its new portfolio. Among the firstthe company is revealing is the newUltraScale architecture, which it will beimplementing at 20-nm, 16-nm FinFET and
Xilinx’s industry-first 20-nmtapeout signals the dawning ofan UltraScale-era, ASIC-classprogrammable architecture. B
beyond. “It’s the first programmablearchitecture to enable users to imple-ment ASIC-class designs using AllProgrammable devices,” said Glaser.“The UltraScale architecture enablesXilinx to deliver 20-nm and FinFET16-nm All Programmable deviceswith massive I/O and memory band-width, the fastest packet processing,the fastest DSP processing, ASIC-likeclocking, power management andmultilevel security.”
AN ARCHITECTURAL ADVANTAGEThe UltraScale architecture includeshundreds of structural enhancements,many of which Xilinx would havebeen unable to fully implement had itnot already developed its Vivado
Design Suite, which brings leading-edge EDA tool capabilities to Xilinxcustomers. For example, Vivado’sadvanced placement-and-routing capa-bilities enable users to take full advan-tage of UltraScale’s massive data-pro-cessing abilities, so that design teamswill be able to achieve better than 90percent utilization in UltraScaledevices while also hitting ambitiousperformance and power targets. Thisutilization is far beyond what’s possiblewith competing devices, which todayrequire users to trade off performancefor utilization. Customers are thusforced to move to that vendor’s biggerand more expensive device, only tofind they have to slow down the clockyet again to meet system power goals.
This problem didn’t plague Xilinx atthe 28-nm process node thanks toXilinx’s routing architecture. Nor will itbe an issue at 20 nm, because theUltraScale architecture can implementmassive data flow for wide buses tohelp customers develop systems withmultiterabit throughput and with mini-mal latency. The UltraScale architec-ture co-optimized with Vivado resultsin highly optimized critical paths withbuilt-in, high-speed memory cascadingto remove bottlenecks in DSP andpacket processing. Enhanced DSP sub-systems combine critical-path opti-mization with new 27x18-bit multipli-ers and dual adders that enable a mas-sive jump in fixed- and IEEE-754 float-ing-point arithmetic performance and
10 Xcell Journal Third Quarter 2013
C O V E R S T O R Y
Figure 1 – The UltraScale architecture enablesmassive I/O and memory bandwidth, fastpacket processing and DSP processing, alongwith ASIC-like clocking, power managementand multilevel security.
➤ Monolithic to 3D IC
➤ Planar to FinFET
➤ ASIC-class performance
efficiency. The wide-memory imple-mentation also applies to UltraScale3D ICs, which will see a step-functionimprovement in interdie bandwidthfor even higher overall performance.
UltraScale devices feature furtherI/O and memory-bandwidth enhance-ments, including support for next-gen-eration memory interfacing that offersa dramatic reduction in latency andoptimized I/O performance. The archi-tecture offers multiple hardened, ASIC-class IP cores, including 10/100GEthernet, Interlaken and PCIe®.
The UltraScale architecture alsoenables multiregion clocking—a fea-ture typically found only in ASIC-classdevices. Multiregion clocking allowsdesigners to build high-performance,low-power clock networks withextremely low clock skew in their sys-tems. The co-optimization of theUltraScale architecture and Vivadoalso allows design teams to employ awider variety of power-gating tech-niques across a wider range of func-tional elements in Xilinx’s 20-nm AllProgrammable devices to achieve fur-ther power savings in their designs.
Last but not least, UltraScale sup-ports advanced approaches to AES bit-stream decryption and authentication,key obfuscation and secure deviceprogramming to deliver best-in-classsystem security.
TAILORED FOR KEY MARKETSGlaser said that UltraScale deviceswill enable design teams to achieveeven greater levels of system integra-tion while maximizing overall systemperformance, lowering total systempower and reducing the overall BOMcost of their systems. “Building onwhat we did at 28 nm, Xilinx at the20-nm and FinFET 16/14 nodes is rais-
ing the value proposition of our AllProgrammable technology far beyondthe days when FPGAs were consideredmerely a nice alternative to logicdevices,” he said. “Our unique systemvalue is abundantly clear across abroad number of applications.”
Glaser said devices using Xilinx’s 20-nm and FinFET 16-nm UltraScalearchitecture squarely address a num-ber of growth market segments, suchas optical transport networking (OTN),high-performance computing in thenetwork, digital video and wirelesscommunications (see Figure 2). All ofthese sectors must address increasingrequirements for performance, cost,lower power and massive integration.
ULTRASCALE TO SMARTER OTN NETWORKSThe OTN market is currently undergo-ing a change to smarter networks, asdata traffic dictates a massive buildoutto create ever-more-advanced wired
networking equipment that can movedata from 100 Gbps today to 400 Gbpssoon and, tomorrow, to 1 Tbps. Whilethe traditional business approachwould be to simply throw massiveamounts of newer and faster equipmentat the problem, today network opera-tors are seeking smarter ways to buildout their networks and to improve thenetworks they have installed while find-ing better ways to manage costs—all tovastly improve service and profitability.
For example, the data center marketis looking to software-defined network-ing (SDN) as a way to optimize networkutilization and cut costs. SDN will allownetwork managers to leverage thecloud and virtual networks to providedata access to any type of Internet-con-nected device. The hardware to enablethese software-defined networks willhave to be extremely flexible, reliableand high performance. Greater pro-
Third Quarter 2013 Xcell Journal 11
C O V E R S T O R Y
>400 Gbps wire-speed
Massive Data Flow>5 Tbps
Massive I/O & Memory BW
Figure 2 – The UltraScale architecture will speed the development of multiple types of next-generation systems, especially in wired and wireless communications.
Typically found only in ASIC-class devices, multiregion clocking allows designers to build high-performance, low-power clocknetworks with extremely low clock skew in their systems.
(Continued on page 14)
C O V E R S T O R Y
12 Xcell Journal Third Quarter 2013
by Mike Santarini
If you’ve been keeping up with the latest news in semi-conductor process technology, you’ve likely beenreading that the world’s most advanced foundries will
manufacture devices using new transistor structurescalled “FinFETs” as the fundamental building blocks forchips they’ll produce at what’s collectively called the 16/14-nanometer node. But you may be wondering just whatFinFETs are, how they differ from standard transistors,and what benefits and challenges they bring.
One of the key inventions in semiconductor manufac-turing that has enabled today’s $292 billion semiconduc-tor business was Jean Hoerni’s creation of the planarprocess at Fairchild Semiconductor back in the 1950s.The planar process makes it possible to implement ever-smaller transistors, arrange them into a vast array of cir-cuits and interconnect them efficiently on a horizontalplane, rather than having them stick up, as discrete com-ponents do on printed-circuit boards. The planar processhas enabled the extreme miniaturization of ICs. With theplanar process, semiconductors are built in layers on topof, and etched into, ultrapure silicon wafers. Figure 1ashows a planar transistor (essentially an elaborate on/offswitch) composed of three main features: source, gateand drain, which have been etched and layered into a sil-icon substrate. Figure 1b shows a FinFET. Notice thathere, the gate surrounds the channel on three sides,whereas in the planar transistor the gate only covers thetop of the channel.
The source is where electrons enter the transistor.Depending on the voltage at the gate, the transistor’s gatewill be either opened or closed in the channel under thegate, similar to the way a light switch can be switched onor off. If the gate allows the signal through the channel, thedrain will move the electrons to the next transistor in thecircuit. The ideal transistor allows lots of current throughwhen it is “on” and as little current as possible when “off,”and can switch between on and off billions of times persecond. The speed of the switching is the fundamentalparameter that determines the performance of an IC. Chipdesign companies arrange and group transistors into anarray of circuits, which in turn are arranged and groupedinto functional blocks (such as processors, memory andlogic blocks). Then, these blocks too are arranged andgrouped to create the marvelous ICs in the plethora ofelectronic marvels we benefit from today.
Since the 1960s, the semiconductor industry has intro-duced a myriad of silicon innovations to keep pace withMoore’s Law, which states that the number of transistors in anIC will double every two years. This doubling of transistorsmeans that today’s leading-edge ICs contain billions of tran-sistors, crammed into essentially the same die size as 1960schips but running exponentially faster with exponentiallylower power, at 1.2 volts or below. Today, a transistor’s fea-tures are so small that some are only dozens of atoms wide.
KEEPING UP WITH MOOREThe industry is always struggling to keep pace with Moore’sLaw in the face of what appear to be true physical limita-tions to further advancement in semiconductors. Over thelast 10 years, process specialists have striven to ensure theelectrical integrity of the planar transistor. Simply put, theplanar transistor’s source, gate and drain features were sosmall and performance demands so high that these transis-tors could not adequately control electrons moving throughthem. The electrons would essentially leak or get pulledthrough by the drain even if a device was turned off.
For battery-powered devices such as mobile phones, thismeans the battery would drain charge faster, even if thephone was turned off. For AC-powered devices (those thatplug into a wall socket), it means they waste power and runhotter. This heat, if profound and unmanaged by tricks ofcooling, can lessen the lifetime of a product, because leakagecreates heat and heat increases leakage. Of course, the cool-ing adds extra cost to products, too. The leakage problembecame strikingly noticeable in semiconductors at the 130-nm node. It was in this era of extreme leakage that Microsofthad to recall its Xbox 360 worldwide due to overheating;leakage is one of the reasons why microprocessor companieshad to move to less efficient multicore processors over fasterbut much hotter single-processor architectures.
After 130 nm, several refinements came along to reduceheat and improve planar transistors on all fronts to keep upwith Moore’s Law. At 90 nm, the industry moved to low-k-dielectric insulators to improve switching, while the 65- and40-nm nodes introduced further refinements through stressedsilicon. At 28 nm, the industry moved to high-k gate dielectricsand metal gates. Xilinx’s use of the TSMC 28-nm HPL (high-performance, low-power) process provided the ideal mix ofperformance and power, giving Xilinx a Generation Aheadadvantage over competitors at that geometry.
The planar transistor process will be viable at the 20-nmnode thanks to a high-k gate dielectric, metal gates and a
THANKS, MR. HOERNI, BUT THE FIN IS INDawning of a new era in transistor technology benefits Xilinx customers.
Third Quarter 2013 Xcell Journal 13
C O V E R S T O R Y
series of more elaborate manufacturing steps such as dou-ble patterning. The 20-nm node provides phenomenal ben-efits in terms of power as well as performance, but the costis increasing marginally because of elaborate manufactur-ing to ensure silicon integrity.
Thanks in large part to remarkable research started byCal Berkeley professor Chenming Hu under a DARPA con-tract, the 20-nm process will likely be the last hurrah for theplanar transistor (at least as we know it today), as theindustry moves to FETs built with fins.
INS AND OUTS OF FINSIn a planar transistor of today, electrical current flows fromsource to drain through a flat, 2D horizontal channel under-neath the gate. The gate voltage controls current flow through
the channel. As transistorsize shrinks with the intro-duction of each new siliconprocess, the planar transistor cannotadequately stop the flow of currentwhen it is in an “off” state, which results inleakage and heat. In a FinFET MOSFET transis-tor, the gate wraps around the channel on threesides, giving the gate much better electrostatic control to stopthe current when the transistor is in the “off” state. Superiorgate control in turn allows designers to increase the currentand switching speed and, thus, the performance of the IC.Because the gate wraps around three sides of the fin-shapedchannel, the FinFET is often called a 3D transistor (not to beconfused with 3D ICs, like the Virtex-7 2000T, which Xilinxpioneered with its stacked-silicon technology).
In a three-dimensional transistor (see Figure 1b), gate con-trol of the channel is on three sides rather than just one, as in
conventional two-dimensional planar transistors (see Figure1a). Even better channel control can be achieved with a thin-ner fin, or in the future with a gate-all-around structurewhere the channel will be enclosed by a gate on all sides.
The industry believes the 16-nm/14-nm FinFET processwill enable a 50 percent performance increase at the samepower as a device built at 28 nm. Alternatively, the FinFETdevice will consume 50 percent less power at the same per-formance. The performance-per-watt benefits added to thecontinued increases in capacity make FinFET processesextremely promising for devices at 16 or 14 nm and beyond.
That said, the cost and complexity of designing andmanufacturing 3D transistors is going to be higher at leastfor the short term, as EDA companies figure out ways toadequately model the device characteristics of these new
processes and augment their tools and flows toaccount for signal integrity, electromigration,width quantization, resistance and capaci-
tance. This complexity makes designing ASICs and ASSPseven riskier and more expensive than before.
Xilinx, however, shields users from the manufacturingdetails. Customers can reap the benefits of increased per-formance per watt and Xilinx’s Generation Ahead designflows to bring innovations based on the new UltraScalearchitecture to market faster.
Figure 1 – The position of the gatediffers in the two-dimensional traditional planar transistor (a) vs. the three-dimensional FinFETtransistor (b).
grammability coupled more tightlywith processing will be mandatory.
While network manufacturing com-panies develop next-generation soft-ware-defined networks, their cus-tomers are constantly looking to pro-long the life of existing equipment.Companies have implemented single-chip CFP2 optical modules in OTNswitches with Xilinx 28-nm Virtex-7F580T 3D ICs to exponentiallyincrease network bandwidth (seecover story, Xcell Journal issue 80,http://issuu.com/xcelljournal/docs/
xcell80/8?e=2232228/2002872).With UltraScale, Xilinx is introducingdevices that enable further integra-tion and innovation, allowing compa-nies to create systems capable ofdriving multiple CFP4 optical mod-ules for yet another exponential gainin data throughput (see Figure 3).
ULTRASCALE TO SMARTER DIGITAL VIDEO AND BROADCASTDigital video is another market facing amassive buildout. HDTV manufactur-ers and the supporting broadcast infra-structure are rapidly building TVs andbroadcast infrastructure equipment(cameras, communications and pro-duction gear) for 1080p, 4k/2k and 8kvideo. TVs will not only need to con-form to these upcoming standards,they must also be feature rich and rea-sonably priced. The market is extreme-ly competitive, so having devices thatcan adjust to changing standardsquickly and can help manufacturersdifferentiate their TVs is imperative.And while screen resolution and pic-ture quality will increase greatly withthe transition from 1080p to 4k/2k to8k, TV customers will want ever thin-ner models. This means new TVdesigns will require even greater inte-
gration while achieving the perform-ance, power and BOM cost require-ments the consumer market demands.
While broadcast equipment willneed to advance drastically to con-form to the requirements for 4k/2k and8k video, perhaps the biggest cus-tomer requirements for these systemswill be flexibility and upgradability.Broadcast standards are ever evolving.Meanwhile, the equipment to supportthese standards is expensive and noteasily swapped in and out. As a result,broadcast companies are gettingsmarter about their buying and aremaking longevity and upgradabilitytheir top requirements. This makesequipment based on All Programmabledevices a must-have for all sectors ofthe next-generation broadcast market,from the camera at the ballgame to theproduction equipment in the controlroom to the TV in your living room.
C O V E R S T O R Y
14 Xcell Journal Third Quarter 2013
Virtex UltraScale Solution
Programmable System Integration2 Devices ‡ 1 Device
Figure 3 – The UltraScale architecture will enable higher-performance, lower-power single-chip CFP4 modules that will vastly improve the performance of existing OTN equipment with a low-cost card-swap upgrade.
Figure 4 illustrates how amidrange Kintex® UltraScale devicewill enable professional and “pro-sumer” camera manufacturers toquickly bring to market new cameraswith advanced capabilities for 4k/2kand 8k while dramatically cuttingcost, size and weight.
ULTRASCALE TO SMARTERWIRELESS COMMUNICATIONS Without a doubt, the most rapidlyexpanding business in electronics ismobile communications. With usageof mobile devices increasing tenfoldover the last 10 years and projec-tions for even more dramaticgrowth in the coming 10, carriersare scrambling to devise ways towoo customers to service plans inthe face of increased competition.Wireless coverage and network per-formance are key differentiatorsand so carriers are always looking
for network equipment that is fasterand can carry a vast amount of dataquickly. They could, of course, buymore and more higher-performancebasestations, but the infrastructurecosts of building out and maintainingmore equipment cuts into profitabili-ty, especially as competitors try tolure customers away by offering morecost-effective plans. Very much liketheir counterparts in wired communi-cations, wireless carriers are seekingsmarter ways to run their businessand are looking for innovative net-work architectures.
One such architecture, called cloudradio-access networks (C-RAN), willenable carriers to pool all the base-band processing resources that wouldnormally be associated with multiplebasestations and perform basebandprocessing “in the cloud.” Carriers canautomatically shift their resources toareas experiencing high demand, such
as those near live sporting events, andthen reallocate the processing to otherareas as the crowd disperses. Theresult? Better quality of service for cus-tomers and a reduction in capital aswell as operating expenditures for carri-ers, with improved profitability.
Creating viable C-RAN architecturesrequires highly sophisticated and high-ly flexible systems-on-chip that inte-grate high-speed I/O with parallel andserial processing. Xilinx’s UltraScaleFPGAs and SoCs will play key roles inthis buildout.
Xilinx is building UltraScale 20-nmversions of its Kintex and Virtex AllProgrammable FPGAs, 3D ICs and ZynqAll Programmable SoCs. The company isalso developing Virtex UltraScaledevices in TSMC’s “FinFET 16” technol-ogy (see sidebar, page 12). For moreinformation on Xilinx’s UltraScale archi-tecture, visit Xilinx’s UltraScale pageand read the backgrounder.
Third Quarter 2013 Xcell Journal 15
C O V E R S T O R Y
Image Processing,Video Processingand Connectivity
CPU 10G-SDI12G-SDI10G VoIP
Kintex UltraScale Solution
Programmable System Integration4 Devices ‡ 2 Devices
Figure 4 – The UltraScale architecture will let professional and consumer-grade camera manufacturers bring 8k cameras to market sooner, saving power, reducing weight, improving performance and lowering BOM cost.
16 Xcell Journal Third Quarter 2013
XCELLENCE IN NETWORKING
Efficient Bitcoin Miner System Implemented on Zynq SoC
Third Quarter 2013 Xcell Journal 17
Bitcoin is a virtual currency that has become increas-ingly popular over the past few years. As a result,Bitcoin followers have invested some of their assets in
support of this currency by either purchasing or “mining”Bitcoins. Mining is the process of implementing mathematicalcalculations for the Bitcoin Network using computer hard-ware. In return for their services, Bitcoin miners receive alump-sum reward, currently 25 Bitcoins, and any includedtransaction fees. Mining is extremely competitive, since net-work rewards are divided up according to how much calcula-tion all miners have done.
Bitcoin mining began as a software process on cost-ineffi-cient hardware such as CPUs and GPUs. But as Bitcoin’s pop-ularity has grown, the mining process has made dramaticshifts. Early-stage miners had to invest in power-hungryprocessors in order to obtain decent hash rates, as miningrates are called. Although CPU/GPU mining is very inefficient,it is flexible enough to accommodate changes in the Bitcoinprotocol. Over the years, mining operations have slowly grav-itated toward dedicated or semi-dedicated ASIC hardware foroptimized and efficient hash rates. This shift to hardware hasgenerated greater efficiency in mining, but at the cost of lowerflexibility in responding to changes in the mining protocol.
An ASIC is dedicated hardware that is used in a specificapplication to perform certain specified tasks efficiently. Soalthough ASIC Bitcoin miners are relatively inexpensive andproduce excellent hash rates, the trade-off is their inflexibili-ty to protocol changes.
Like an ASIC, an FPGA is also an efficient miner and is rel-atively inexpensive. In addition, FPGAs are more flexible thanASICs and can adjust to Bitcoin protocol changes. The currentproblem is to design a completely efficient and fairly flexiblemining system that does not rely on a PC host or relay deviceto connect with the Bitcoin Network. Our team accomplishedthis task using a Xilinx® Zynq®-7000 All Programmable SoC.
THE BIG PICTURETo accomplish our goal of implementing a complete mining sys-tem, including a working Bitcoin node and a miner that is bothefficient and flexible, we needed a powerful FPGA chip notonly for flexibility, but also for performance. In addition to theFPGA, we also needed to use a processing engine for efficiency.
On this complete system-on-chip(SoC), we would need an optimizedkernel to run all required Bitcoin activ-ities, including network maintenanceand transactions handling. The hard-ware that satisfied all these conditionswas the Zynq-7020 SoC residing on theZedBoard development board. Ataround $300 to $400, the ZedBoard is
fairly inexpensive for its class (seehttp://www.zedboard.org/buy). TheZynq-7020 SoC chip contains a pair ofARM® Cortex™-A9 processors embed-ded in 85,000 Artix®-7 FPGA logiccells. The ZedBoard also comes with512 Mbytes of DDR3 memory thatenabled us to run our SoC designfaster. Lastly, the ZedBoard has an SD
card slot for mass storage. It enabledus to put the entire updated Bitcoinprogram onto it.
Using the ZedBoard, we implement-ed our SoC Bitcoin Miner. It includes ahost, a relay, a driver and a miner, asseen in Figure 1. We used the non-graphical version of the originalBitcoin client, known as Bitcoind, as a
X C E L L E N C E I N N E T W O R K I N G
Figure 1 – The basic data flow ofthe miner showing the relation-ships between the components
and the Bitcoin Network
At its simplest, the Bitcoin mining process boils down to a SHA-256process and a comparator. The SHA-256 process manipulates the block header information twice, and then compares it with
host to interact with the BitcoinNetwork. The relay uses the driver topass work from the host to the miner.
THE DEPTHS OF THE MINING COREWe started developing the mining corewith Vivado® HLS (high-level synthe-sis), using the U.S. Department ofCommerce’s specification for SHA-256.With Vivado HLS’ ability of fast behav-ioral testing, we quickly laid out sever-al prototype mining cores, rangingfrom a simple single-process to a com-plex multiprocess system. But before
exploring the details of the mining-core structure, it’s useful to look at thebasics of the SHA-256 process.
The SHA-256 process pushes 64bytes through 64 iterations of bitwiseshifts, additions and exclusive-ors,producing a 32-byte hash. During theprocess, eight 4-byte registers holdthe results of each round, and whenthe process is done these registersare concatenated to produce thehash. If the input data is less than 63bytes, it is padded to 63 bytes and the64th byte is used to store the length ofthe input data. More commonly, the
input data is larger than 63 bytes. Inthis case, the data is padded out to thenearest multiple of 64, with the lastbyte reserved for the data length. Each64-byte chunk is then run through theSHA-256 process one after another,using the output of the previouschunk, known as the midstate, as thebasis for the next chunk.
With Vivado HLS, the implementa-tion of the SHA-256 process is a simple“for” loop; an array to hold the con-stants required by each round; anadditional array to hold the temporaryvalues used by later rounds; eight vari-ables to hold the results of eachround; and definitions for the logicoperations, as seen in Figure 2.
At its simplest, the Bitcoin miningprocess boils down to a SHA-256process and a comparator. The SHA-256 process manipulates the blockheader information twice, and thencompares it with the Bitcoin Network’sexpanded target difficulty, as seen inFigure 3A. This is a simple concept thatbecomes complicated in hardware,because the first-version Bitcoin blockheader is 80 bytes long. This means thatthe initial pass requires two runs of theSHA-256 process, while the second passrequires only a single run, as seen inFigure 3B. This double pass, definingtwo separate SHA-256 modules, compli-cates the situation because we want tokeep the SHA-256 module as generic aspossible to save development time andto allow reuse. This generic require-ment on the SHA-256 module definesthe inputs and outputs for us, settingthem as the 32-byte initial values, 64-byte data and 32-byte hash. Thus, withthe core aspect of the SHA-256 processisolated, we can then manipulate theseinputs to any layout we desire.
The first of the three prototypes wedeveloped used a single SHA-256Process module. From the beginning,we knew that this core would be theslowest of the three because of thelack of pipelining; however, we werelooking to see how compact we couldmake the SHA-256 process.
X C E L L E N C E I N N E T W O R K I N G
Third Quarter 2013 Xcell Journal 19
1 Bit B
1 Bit1 Bit D
Figure 3 – A (top left) is the simplest form of the core, showing the general flow.B splits the single, universal SHA-256 process module into a pair of dedicatedmodules for each phase. C breaks down the specific SHA-256 processes into a
single generic module that can be repeatedly used. D utilizes a quirk in the blockheader information allowing the removal of the first generic SHA-256 process.
The second prototype containedthree separate SHA-256 Process mod-ules chained together, as seen inFigure 3C. This configuration allowedfor static inputs where the firstprocess handles the first 64 bytes ofthe header information. The secondprocess handles the remaining 16bytes from the header plus the 48 bytesof required padding. The third processhandles the resulting hash from thefirst two. These three separateprocesses simplified the control logicand allowed for simple pipelining.
The third and final prototype usedtwo SHA-256 Process modules andtook advantage of a quirk in theBitcoin header information and min-ing process that the Bitcoin miningcommunity makes use of. The blockheader information is laid out in sucha way that only the first 64 byteschange when the block is modified.This means that once the first 64bytes are processed, we can save theoutput data (midstate) and only
process the last 16 bytes of the head-er, for a significant hash rate boost(see Figure 3D).
The comparator is a major compo-nent, and properly designed can signif-icantly increase performance. A solu-tion is a hash that has a numeric valueless than the 32-byte expanded targetdifficulty defined by the Bitcoin sys-tem. This means we must compareeach hash we produce and wait to seeif a solution is found. Due to a quirkwith the expanded target difficulty, thefirst 4 bytes will always be zero regard-less of what the difficulty is, and as thenetwork’s difficulty increases so willthe number of leading zeros. At thetime of this writing, the first 13 bytesof the expanded target difficulty werezero. Thus, instead of comparing thewhole hash to the target difficulty, wecheck to see if the first x number ofbytes are zero. If they are zero, thenwe compare the hash to the target dif-ficulty. If they are not zero, then wetoss the hash out and start over.
With the results of our prototypes inhand, we settled on a version of thethird core developed by the Bitcoinopen-source community specificallyfor the Xilinx FPGAs.
ISE DEVELOPMENTBitcoin mining is a race to find a num-ber between zero and 232-1 that givesyou a solution. Thus, there are reallyonly two ways to improve mining-core performance: faster processingor divide-and-conquer. Using aSpartan®-3 and a Spartan-6 develop-ment board, we tested a series of dif-ferent frequencies, pipelining tech-niques and parallelization.
For our frequency tests, we startedwith the Spartan-3E developmentboard. However, it quickly becameapparent that nothing could be donebeyond 50 MHz and as such, we leftthe frequency tests to the Spartan-6.
We tested 50, 100 and 150 MHz onthe Spartan-6 development board,which resulted in predictable results,achieving 0.8, 1.6 and 2.4 mega hashesper second (MHps) respectively. Wealso tried 75 MHz and 125 MHz duringthe parallelization and pipeliningtests, but these frequencies were onlya compromise to make the miner fitonto the Spartan-6.
One of the reasons we chose themining core from the open-source com-munity was that it was designed with adepth variable to control the number ofpipelined stages. The depth is an expo-nential value between 0 and 6, whichcontrols the number of power-two iter-ations the VHDL “generate” commandexecutes—namely, 20 to 26. We per-formed the initial frequency tests with adepth of 0—that is, only a single roundimplemented for each process.
On the Spartan-3e we testeddepths of 0 and 1 before running intotiming-constraint issues. The depth of0 at 50 MHz gave us roughly 0.8 MHps,while the depth of 1 achieved roughly1.6 MHps.
On the Spartan-6 we were able toreach a depth of 3 before running into
20 Xcell Journal Third Quarter 2013
X C E L L E N C E I N N E T W O R K I N G
000:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00
Figure 4 – The hash rate of the mining core over a four-day period of core-optimizationtesting. The graph is an average over time, giving a general idea of performance of the
mining core. The results decreased due to hitting the routing limits of the test platforms.
timing-constraint issues. At 50 MHz, theSpartan-6 achieved the same results asthe Spartan-3 at a depth of 0 and 1. Wenoticed an interesting trend at thispoint. Doubling the frequency had thesame effect as increasing the depth ofthe pipeline. The upper limit was set bythe availability of routing resources,with our peak averaging out at roughly3.8 MHps, as seen in Figure 4.
The SHA-256 Process module has10 different 32-bit additions. In thisimplementation, we are attempting todo all of them in a single clock cycle.To shorten the longest path, we triedto divide the adder chains into a seriesof stages. However, doing so radicallychanged the control logic for thewhole mining core and required acomplete rewrite. We dropped thesechanges to save time and effort.
The final performance improvementwe tested was parallelization. Withonly minor modifications, we doubledand quadrupled the number of SHA-256 sets. A set consists of two SHA-256Process modules. This modificationeffectively cut the potential numberrange for each set into halves, for thedoubled number of SHA-256 sets, orfourths for the quadrupled number. Tofit these extra SHA processes onto theSpartan-6, we had to reduce the systemfrequency. Four sets of the SHA-256Process modules would run at 75 MHz,while two sets would run at 125 MHz.The hash rate improvements were dif-ficult to document. We could easily seethe hash rate of an individual SHA-256set; however, a multiple-set miningcore could find a solution faster thanthe hash rate implied.
USING THE EDKAfter FPGA testing, our next step wasto tie the mining core into the ZynqSoC’s AXI4 bus. Xilinx’s EmbeddedDevelopment Kit (EDK) comesloaded with a configuration utilityspecifically designed for the ZynqSoC, allowing us to easily configureevery aspect. By default, the systemhas the 512 Mbytes of DDR3, theEthernet, the USB and the SD inter-face enabled, which is everything weneeded for our Bitcoin SoC.
The twin Cortex-A9 processors fea-ture the AXI4 interface instead of thePLB system used by previous soft-coresystems. All of the peripherals areattached to processors through theAXI4 interface, as shown in Figure 5.
The EDK custom peripheral wizardprovides code stubs for the differentvariants of the AXI4 interface andserves as the basis for the develop-ment of the mining core’s AXI inter-face. For simplicity, we used the AXI4-Lite interface, which provides simpleread and write functionality to themining core. Ideally, developers wouldwant to use the regular AXI4 interfaceto take advantage of the advancedinterface controls, such as data bursts.
Keeping with the goal of simplicity,we used three memory-mapped regis-ters to handle the mining-core I/O. Thefirst register feeds the host data packetto the miner. This register keeps trackof the amount of data passed and locksitself after 11 words have been trans-ferred. This lock is removed automati-cally when a solution is found. We canremove the lock manually if necessarythrough the status register.
We defined the second register asthe status register, flagging certainbits for the various states the miningcore goes through. With the simplicityof our design, we used only threeflags: a loaded flag, a running flag anda solution-found flag. The loaded flagis triggered when the mining corereceives 11 words, as mentionedbefore, and is cleared when we writeto it. The start/running flag is set by us
X C E L L E N C E I N N E T W O R K I N G
Third Quarter 2013 Xcell Journal 21
ARM Cortex-A9 Mining Core
Figure 5 – The relatively new AXI4 interface connects the ZedBoard’s hard peripherals with the ARM processors and Artix FPGA logic.
that before we can use the reservedaddress we have to request an addressremap for the mining core. The kernelgives us a virtual address through theremap, which we can then use to com-municate to the mining-core registers.
Now that we have access to thehardware, the initialization functionthen runs a simple test on the miningcore to confirm that the core is operat-ing properly. If it is, we register themajor and minor numbers—which areused to identify the device in user pro-grams—with the kernel. Every func-tion has a counter, and the exit functionis the counter to the initialization func-tion. In this function, we undo every-thing done during initialization—specifically, release the major andminor numbers used by the driver, thenrelease the mapped virtual address.
When we first call the mining corefrom the relay program, the open func-tion runs. All we do here is to confirmthat the self-test run during the initial-ization function is successful. If theself-test fails, then we throw a system-error message and fail out. The closefunction is called when the device isreleased from the relay program.However, since we only check the self-test results in the open function, thereis nothing to do in the close function.The read function looks at the databuffer and determines which port theuser is reading, and it then gets thedata from the mining core and passesit back. The write function determineswhich register the user is writing toand passes the data to the mining core.
The final component of our systemis a small relay program to pass work
to start the mining core, and iscleared by the mining core when asolution is found. The final register isthe output register, which is loadedwith the solution when one is found.
We tested each stage of the AXI4-Lite interface development beforeadding the next component. We alsotested the AXI4-Lite interface inXilinx’s Software Development Kitusing a bare-bones system. The testserved two purposes: first, to confirmthat the mining core’s AXI4-Lite inter-face was working correctly and sec-ond, to ensure that the mining corewas receiving the data in the correctendian format.
EMBEDDED LINUXWith the mining core tied into theprocessors, we moved to softwaredevelopment. Ideally we wanted tobuild our own firmware, potentiallywith a Linux core, to maximize per-formance. For ease of development,we leveraged a full Linux installation.We used Xillinux, developed byXillybus, an Ubuntu derivative forrapid embedded-system development.
Our first order of business was tocompile Bitcoind for the Cortex-A9architecture. We used the original open-source Bitcoin software to ensure com-patibility with the Bitcoin Network.
Testing Bitcoind was a simplematter. We ran the daemon, waitedfor the sizable block chain to down-load and then used the line commandto start Bitcoind’s built-in CPU min-ing software.
For the Linux driver, you mustimplement a handful of specific func-tions and link them to the Linux kernelfor correct operation. When the driveris loaded a specific initialization func-tion runs, preparing the system forinteraction with the hardware. In thisfunction, we first confirm that thememory address to which the miningcore is tied is available for use, thenreserve the address. Unlike a bare-bones system, Linux uses a virtualmemory address scheme, which means
22 Xcell Journal Third Quarter 2013
X C E L L E N C E I N N E T W O R K I N G
from Bitcoind to the mining corethrough the driver, and return results. Asa matter of course, the relay programwould check to see if Bitcoind was run-ning and that the mining core was readyand operating properly. Since the relayprogram would normally sit idle for themajority of the time, we will implementa statistics compilation subsystem togather data and organize it in a log file.Ideally, we would use a Web host inter-face for configuration, statistical displayand device status.
COMPLETE, EFFICIENT MINING SYSTEMOur solution to an efficient and com-plete Bitcoin mining system is the XilinxZynq-7000 All Programmable SoC on theZedBoard. This board is flexible tochanges in the Bitcoin protocol and alsooffers a high-performance FPGA solu-tion along with SoC capability. Toimprove on our current design withoutjeopardizing the adaptability, we couldadd more mining cores that connect tothe main Bitcoin node. These additionscould boost the performance significant-ly and provide a higher overall hash rate.
Another improvement that couldfurther optimize this design is to havededicated firmware. Our currentdesign is running Ubuntu Linux 12.04,which has many unnecessary process-es running concurrently with theBitcoin program, like SSH. It is awaste of the board’s resources to runthese processes while Bitcoin is run-ning. In a future build, we would elim-inate these processes by running ourown custom firmware dedicated toonly Bitcoin tasks.
To improve on our current design without jeopardizing the adaptability, we could add more
mining cores that connect to the main Bitcoin node.These additions could boost the performance
and provide a higher overall hash rate.
Third Quarter 2013 Xcell Journal 23
XCELLENCE IN AEROSPACE & DEFENSE
Ensuring FPGA Reconfiguration in Space by Robért Glein
System EngineerFriedrich-Alexander-Universität Erlangen-Nürnberg (FAU), InformationTechnology (Communication Electronics)[email protected]
Florian RittnerFirmware DesignerFraunhofer IISRF and Microwave Design Department
Alexander HofmannTelecommunications EngineerFraunhofer IISRF and Microwave Design Department
In 2007, the German FederalMinistry of Economics andTechnology commissionedthe German Aerospace Center
to perform a feasibility study of anexperimental satellite to explore com-munications methods. Government aswell as academic researchers will usethe satellite, called the Heinrich Hertzcommunications satellite, to performexperiments in the Ka and K band. Itis scheduled for launch in 2017. A key element of this system is theFraunhofer-designed onboard proces-sor, which we implemented on Xilinx®
Virtex®-5QV FPGAs. The Fraunhofer onboard proces-
sor—which is involved in the in-orbitverification of modules in space—is theessential module of the satellite’s regen-erative transponder. Because theHeinrich Hertz satellite will be used fora broad number of communicationsexperiments, we chose to leverage theFPGAs’ capability to be reconfigured forvarious tasks. This saves space, reducesweight and enables the use of futurecommunications protocols. Figure 1illustrates this onboard processor andother parts of the scientific in-orbit veri-fication payload of the satellite.
But because space is perhaps theharshest environment for electronics,we had to take a few key steps toensure the system could be reconfig-ured reliably. The radiation-hardenedXilinx FPGAs offer a perfect solutionfor reconfigurable space applicationsregarding total ionizing dose, single-event effects, vibration and thermalcycles. The reconfiguration of the dig-ital hardware of these FPGAs is anessential function in this geostationarysatellite mission in order to ensure
system flexibility. To achieve a robustreconfiguration without a single pointof failure, we devised a two-prongedmethodology that employs two con-figuration methods in parallel. Besidesa state-of-the-art (SOTA) method, wealso introduced a fail-safe reconfigura-tion method via partial reconfigura-tion, which enables a self-reconfigura-tion of the FPGAs.
A straightforward way to configureFPGAs is to use an external radiation-hardened configuration processor witha rad-hard mass memory (for example,flash). In this case, the (partial) bit fileis stored in the memory and the proces-sor configures the FPGA with the bit-stream. This approach offers good flex-ibility if your design requires a fullFPGA reconfiguration and has manybit files onboard. You can update orverify bit files via a low-data-ratetelemetry/telecommand channel or adigital communication up- or downlinkwith a higher data rate. This SOTA con-figuration method is nonredundant. Letus assume a failure of the processor orof the configuration interface, asshown in Figure 2. Now it is impossibleto reconfigure the FPGA and all Virtex-5QV devices are useless.
OVERALL CONFIGURATIONCONCEPT FOR THE FPGAFigure 2 illustrates both configurationmethods for one FPGA—the SOTAmethod and the fail-safe configuration(also the initial configuration) method,which only needs one radiation-hard-ened nonvolatile memory (PROM ormagnetoresistive RAM) as an externalcomponent. This component stores theinitial firmware, recognizable as con-tent of the FPGA. We use these two con-
figuration methods in parallel in order tobenefit from the advantages of both.With this approach we also increase thereconfiguration reliability and overcomethe single-point-of-failure problem.
DETAILS OF FAIL-SAFE CONFIGURATION METHODWe designed this fail-safe initial con-figuration method in order to enableself-reconfiguration of the FPGA. TheFPGA configures itself with the initialconfiguration at startup of theonboard processor. We use the serialmaster configuration protocol for thePROM or the Byte Peripheral Interfacefor the magnetoresistive RAM (MRAM).However, only one nonvolatile memoryper FPGA is required for this fail-safeconfiguration method.
We separated the firmware of the ini-tial configuration in dynamic (blue) andstatic (light-red and red) areas as shownin Figure 2. With partial reconfiguration(PR), it is possible to reconfiguredynamic areas while static and otherdynamic areas are still running.
After a power-up sequence, theFPGA automatically configures itselfwith the initial firmware and is readyto receive new partial or even com-plete (only by MRAM) bit files. A trans-mitter on Earth packs these bit filesinto a data stream, encrypts, encodesand modulates it, and sends it over theuplink to the satellite. After signal con-ditioning and digitizing, the first recon-figurable partition (RP0) inside theFPGA receives the digital data stream.RP0 performs a digital demodulationand decoding. The Block RAM (BRAM)shares the packed bit file with theMicroBlaze™ system. The MicroBlazeunpacks the bit file and stores it in the
A fail-safe method of reconfiguring rad-hard Xilinx FPGAs improves reliability in remote space applications.
X C E L L E N C E I N A E R O S PA C E & D E F E N S E
X C E L L E N C E I N A E R O S PA C E & D E F E N S E
SRAM or SDRAM. The MicroBlazereconfigures a dynamic reconfigurablepartition (RP) or all RPs with a direct-memory access (DMA) via the InternalConfiguration Access Port (ICAP).The static part of the FPGA should notbe reconfigured.
In case of an overwritten RP0, a high-priority command over the satellite con-trol bus could reboot the onboardprocessor. This command would restorethe initial fail-safe configuration.
There are plenty of new possibleapplications for reconfigurable parti-tions, including digital transmitters,digital receivers and replacement of ahighly reliable communications proto-col, among others. You can also recon-figure and switch to new submoduleslike demappers or decoders for adap-tive coding and modulation.
You can even replace the whole fail-safe configuration bit file with a newdouble-checked initial bit file viaMRAM as storage for the initial config-uration. With a digital transmitter, theonboard processor can also send theconfiguration data back to Earth inorder to verify bit files on the ground.
CONFIGURATION RELIABILITYA major benefit of using both configu-ration methods (SOTA and initial) isan increase of the overall system relia-bility. There is a drawback in the initialfail-safe method regarding the usableFPGA resources. In a geosynchronous
orbit mission with a term of 15 years,reliability of the most important pro-cedures such as configurationdepends on the failure-in-time (FIT)rates of both configuration methods.We assume a typical FIT rate of 500 ina representative configuration proces-sor and its peripherals for the SOTAconfiguration method. The PROMachieves a much better FIT rate of 61(part stress method for failure-rateprediction at operating conditionsregarding MIL-HDBK 217).
Nevertheless, both methods aresingle points of failure. Only the par-
allel operation guarantees a signifi-cant increase in the reliability of theconfiguration procedure. Calculatingthe standalone reliability of theSOTA configuration method resultsin Rsc = 0.9364 and of the initialmethod in Ric = 0.9920, with the
given mission parameters. The over-all configuration reliability for sys-tems in parallel is Rc = 0.9929.
We have to compare Rc and Rsc inorder to obtain the gain of the recon-figuration reliability. The achievedgain (Rc - Rsc) is 5.65 percent. The ini-tial configuration has some restric-tions in terms of FPGA resources.Nevertheless, you can still reconfigurethe whole FPGA anytime with themore complex SOTA configurationmethod if the external configurationprocessor is still operating.
Both methods can correct failures
of the FPGA caused by configurationsingle-event upsets. For the SOTA con-figuration method, the external config-uration processor performs a scrub-bing, as does the self-scrubbedMicroBlaze for the initial configura-tion method.
Third Quarter 2013 Quarter 2013 Xcell Journal 25
Figure 1 – Part of the scientific in-orbit verification payload of the Heinrich Hertz satellite; the gold box in the middle,
labeled FOBP, is the Fraunhofer onboard processor. LTWTA stands for linear traveling-wave tube amplifier.
In case of an overwrite, a high-priority command over thesatellite control bus could reboot
the onboard processor and restorethe initial fail-safe configuration.
IMPLEMENTING FAIL-SAFECONFIGURATIONFigure 2 illustrates the detailed archi-tecture of one of the onboard proces-sor’s FPGAs. The FPGA is divided intothree parts: MicroBlaze, static intellec-tual-property (IP) cores and RPs. Anissue of the MicroBlaze system is mem-ory management. Alongside the PROMor MRAM, we attached an SRAM andan SDRAM. These two different memo-ry types enable applications like tempo-rary packet buffers and content storage(for example, video files). The memorymultiplexer block (Mem Mux) enableshigher flexibility. It multiplexes thememory to RPs or to the MicroBlazeregion. Semaphores control the multi-plexer via a general-purpose input/out-put (GPIO) connection. So, only oneregion can use the single-port memoryat any given time. The initial configura-
tion also contains internal memory withBRAM, presenting an opportunity for afast data exchange of both regions.
The build wrapper in XilinxPlatform Studio (XPS) enables a uni-versal integration of the MicroBlazesystem with support and connection ofindependent ports.
We implemented IP cores such asBRAM, RocketIO™ and digital clockmanager (DCM). The RP switchingblock connects the RPs by means ofsettings in the RP switching configura-tion, which flexibly interconnects theRPs. A data flow control (DFC) man-ages the data flow among the RPs. TheFPGAs communicate to each other bymeans of RocketIO and low-voltage dif-ferential signaling (LVDS).
We divided the RPs into two parti-tions with 20 percent configurable logicblock (CLB) resources of the FPGA and
two partitions with 10 percent CLBresources. We also tried to share theDSP48 slices and BRAM equally andmaximally to all partitions.
The main part of the implementationis the system design and floorplanning,which we realized with the PlanAheadtool from Xilinx. We combined theMicroBlaze system, the IP cores, thesoftware design and the partial userhardware description language (HDL)modules, called reconfigurable mod-ules. We also defined the location ofRPs, planned design runs, chose a syn-thesis strategy and constrained thedesign in this step. A BRAM memorymap file defines the BRAM resourcesfor data and instruction memory. Thisenables an integration of the initialMicroBlaze software in the bit file.
Figure 3 shows the implementationresult. Currently we are developing a
Figure 2 – Architecture of the first FPGA in the initial configuration system. The static sections are in red and light red,
and the dynamic portions in light blue (left). The yellow lightning bolt represents a SOTA configuration failure.
X C E L L E N C E I N A E R O S PA C E & D E F E N S E
X C E L L E N C E I N A E R O S PA C E & D E F E N S E
highly reliable receiver for the firstpartition (RP0), which will be used inthe initial configuration (on the satel-lite) to receive (demodulating, demap-ping, deinterleaving and decoding) theconfiguration data.
Table 1 illustrates target and actualsizes of the RPs. We achieved a slight-ly higher CLB count than we targeted,whereby the DSP slices nearly reachthe maximum. The BRAM resources
correspond to the CLBs because ofthe MicroBlaze’s demand for BRAMs.The overhead of the static area isapproximately 30 percent. Wedesigned this firmware for the com-mercial Virtex-5 FX130 in order toprove the concept with an elegantbreadboard. At this point, we areabout to port this design to the Virtex-5QV and implement further features,such as triple-modular redundancy orerror-correction codes.
PROBLEM SOLVEDA single point of failure regarding FPGAconfiguration is a problem in highly reli-able applications. You can overcomethis issue with an initial configuration,which needs only one additional deviceper FPGA. Partial reconfigurationenables the remote FPGA configurationvia this fail-safe initial configuration.The parallel use of both configurationscombines their main advantages: theincrease of 5.65 percent in configura-tion reliability and the flexibility toreconfigure the complete FPGA.
System design and implementationchallenges require a trade-offbetween complexity regarding therouting or clock resources and theusable design resources. An elegantbreadboard shows the successfulimplementation and verification in areal-world scenario.
Third Quarter 2013Third Quarter 2013 Xcell Journal 27
CLBs DSP Slices BRAM
Target Result Target Result Target Result
4,096 4,400 max. 80 max. 60
20% ≈ 21% max. ≈ 25% max. ≈ 20%
2,048 2,720 max. 76 max. 38
10% ≈ 13% max. ≈ 24% max. ≈ 13%
12,288 14,240 max. 312 max. 196
60% ≈ 70% max. ≈ 98% max. ≈ 66%
sum of allRP
Figure 3 – PlanAhead view of the implemented design. The blue rectangles represent the RP regions. RP0 is seen as purple, the MicroBlaze as yellow, BRAM as green and other static regions as red.
Table 1 – FPGA resources of the reconfigurable partitions (RPx)
FPGA S OLUTIONS
? 100K Logic Cells + 240 DSP Slices? DDR3 SDRAM + Gigabit Ethernet? SO-DIMM form factor (68 x 30 mm)
It’s easy to build an attractivedemonstration platform for students and beginners around a Xilinx Spartan-6 device and a few peripheral components.
XRadio:A Rockin' FPGA-onlyRadio for TeachingSystem Design
XRadio:A Rockin' FPGA-onlyRadio for TeachingSystem Design
Third Quarter 2013 Xcell Journal 29
Recently here at the Beijing Institute of Technology, we setout to develop a digital design educational platform thatdemonstrated the practical use of FPGAs in communica-tions and signal processing. The platform needed to beintuitive and easy to use, to help students understand allaspects of digital design. It also had to be easy for studentsto customize for their own unique system designs.
Around that time, we in the electrical engineering depart-ment had been having a lively debate as to whether it waspossible to use an FPGA’s I/O pins as a comparator or a 1-bitA/D converter directly. We decided to test that premise by try-ing to incorporate this FPGA-based comparator in the designof our XRadio platform, an entirely digital FM radio receiverthat we designed using a low-cost Xilinx® Spartan®-6 FPGAand a few general peripheral components. We demonstratedthe working radio to Xilinx University Program (XUP) direc-tor Patrick Lysaght, during his visit last year to the BeijingInstitute of Technology. He was excited by the simplicity ofthe design, and encouraged us to share the story of theXRadio with the academic community worldwide.
SYSTEM ARCHITECTUREWe built the XRadio platform almost entirely in the FPGAand omitted the use of traditional analog components suchas amplifiers or discrete filters, as shown in Figure 1. As afirst step, we created a basic antenna by linking a simplecoupling circuit with a wire connected to the FPGA’s I/Opins. This antenna transmits RF signals to the FPGA, whichimplements the signal processing of the FM receiver as digi-tal downconversion and frequency demodulation. We thenoutput the audio signal through the I/O pins to the head-phones. We added mechanical rotary incremental encodersto control the XRadio’s tuning and volume. We designed thesystem so that the tuning and volume information is dis-played on the seven-segment LED module.
Figure 2 shows a top-level logic diagram of the FPGA.In this design, the RF signals coupled into the input bufferof the FPGA are quantified as a 1-bit digital signal. Thequantified signal is then multiplied by the local oscillation
signal generated by a numericallycontrolled oscillator (NCO). Themultiplied signal is filtered to obtainan orthogonal IQ (in-phase plusquadrature phase) baseband signal.It then obtains the audio data streamfrom the IQ signal through the frequen-cy demodulator and low-pass filter.
The data stream next goes to thepulse-width modulation (PWM) mod-ule, which generates pulses on itsoutput. By filtering the pulses, weobtain a proportional analog value todrive the headphones. We connecteda controller module to the mechani-cal rotary incremental encoders and
LEDs. This module gets a pulse sig-nal from incremental encoders toadjust the output frequency of theNCO and audio volume controlledby the PWM module.
IMPLEMENTATION DETAILSThe first challenge we had to over-come was how to couple the signalfrom the antenna to the FPGA. In ourfirst experimental design, we config-ured an FPGA I/O as a standard sin-gle-ended I/O; then we used resistorsR1 and R2 to build a voltage divider togenerate a bias voltage between VIHand VIL at the FPGA pin. The signalthat the antenna receives can drivethe input buffer through the couplingcapacitor C1. The signal is sampledby two levels of D flip-flops, whichare driven by an internal 240-MHzclock. The output of the flip-flopsobtains an equal interval 1-bit sam-pling data stream.
To test this circuit, we fed theresult to another FPGA pin and meas-ured it with a spectrum analyzer tosee if the FPGA received the signalaccurately. However, it didn’t workwell because the Spartan-6 FPGA’sinput buffer has a small, 120-millivolthysteresis, according to the analyzer.Although the hysteresis is good for
30 Xcell Journal Third Quarter 2013
X P E R I M E N T
SIN & COS
LPF PWM EarphoneFD
Figure 1 – The XRadio hardware concept
Figure 2 – Top-level diagram of the XRadio FPGA
preventing noise generally, in thisapplication it wasn’t wanted. We hadto devise a way to increase thestrength of the signal.
To solve this problem, we foundthat the differential input bufferprimitive, IBUFDS, has a very highsensitivity between its positive andnegative terminals. According to ourtest, a differential voltage as low as
1 mV peak-to-peak is enough tomake IBUFDS swing between 0 and1. Figure 3 shows the resultinginput circuit. In this implementa-tion, the resistors R1, R2 and R3generate a common voltage in boththe P and N terminals of IBUFDS.The received signals feed into the Psidefile:///app/ds/terminal throughcoupling capacitor C1. The AC signal
is filtered out by the capacitor C2 onthe N side, which acts as an AC refer-ence. With this circuit, the FPGA con-verts the FM broadcast signals into a1-bit data stream successfully.
The radio picks up about seven FMchannels with different amplitudes—including 103.9-MHz Beijing TrafficRadio. The design significantlyreduces the signal-to-noise ratiobecause of the quantizing noise gener-ated by 1-bit sampling. But there arefour channels that are still distin-guishable against the noise floor.Therefore, we were able to prove ourtheory that an FPGA’s I/O pins can beused effectively as a comparator or a1-bit A/D converter in our XRadio.
For digital downconversion, weused the DDS Compiler 4.0 IP core tobuild a numerically controlled oscilla-tor with a 240-MHz system clock. Thesine and cosine output frequencyranged from 87 MHz to 108 MHz. Thelocal-oscillator signals that the NCOgenerates are multiplied with the 1-bitsampling stream and travel through a
X P E R I M E N T
Third Quarter 2013 Xcell Journal 31
D Q D QIBUFDS
BasebandData ADDR DATA
D Q LPFAudioData
The design slashes the signal-to-noise ratio because of the quantizing noise generated by 1-bit sampling. But fourchannels are still distinguishable against the noise floor.
Thus, we were able to prove our theory that an FPGA’s I/O pinscan be used as a comparator or as a 1-bit A/D converter.
Figure 3 – The antenna input method, making use of the differential input buffer primitive (IBUFDS)
Figure 4 – Frequency demodulator, which converts the frequency of the IQ baseband signal to our audio signal
low-pass filter to obtain the orthogo-nal baseband signal. Here, we usedthe CIC Compiler 2.0 IP core to builda three-order, low-pass CIC decima-tion filter with a downsampling fac-tor R = 240. The sample rate drops to1 MSPS because the filtered base-band orthogonal signal is a narrow-band signal.
Figure 4 shows the frequencydemodulator, which converts thefrequency of the IQ baseband signalto our audio signal. We extracted theIQ data’s instantaneous phase angleusing a lookup table in ROM. The IQdata is merged and is used as anaddress of the ROM; then the ROMoutputs both the real and the imagi-nary part of the corresponding com-plex angle. Next we have a differ-ence operation in which one tap by a
32 Xcell Journal Third Quarter 2013
X P E R I M E N T
Slice Logic Utilization Used Available Utilization
Table 1 – Resource utilization on the XC6SLX9 FPGA
Figure 5 – Co-author and grad student Xingran Yang shows off the working model.
flip-flop delays the phase-angle data.When the delayed data is subtractedfrom the original data, the result isexactly the wanted audio signal. Toimprove the output signal-to-noiseratio, we filtered the audio signalthrough a low-pass filter using sim-ple averaging.
The frequency demodulator’s 8-bitaudio data stream output is scaledaccording to the volume controlparameter and sent into an 8-bitPWM module. The duty cycle ofPWM pulses represents the ampli-tude of the audio signal. The pulsesoutput at the FPGA’s I/O pins anddrive the headphone through capaci-
tance. Here, the headphone acts as alow-pass filter, eliminating the high-frequency part of the pulses remain-ing in the audio signal.
Two rotary incremental encoderscontrol the frequency and volume ofthe radio. Each of the encoders out-puts two pulse signals. The rotationaldirection and rotational speed canbe determined by the pulse widthand phase. State machines and coun-ters translate the rotation states intoa frequency control word and a vol-ume control word. Meanwhile, thefrequency volume values are decod-ed and then displayed on the seven-segment LEDs.
ROOM TO CUSTOMIZEThe printed-circuit board for theXRadio platform is shown in Figure5. With a Spartan-6 XC6SLX9 FPGA,you can implement a radio programthat consumes only a small part ofthe FPGA’s logic resources, as shownin Table 1. This means there is roomfor students to customize the designand make improvements. For exam-ple, today with the 1-bit sampling,XRadio’s audio quality is not as goodas a standard FM radio. However,this leaves room for students to fig-ure out how to improve the XRadio’saudio quality and how to implementit in stereo.
F PGAs are widely used as prototyping oractual SoC implementation vehicles forsignal-processing applications. Their
massive parallelism and abundant heteroge-neous blocks of on-chip memory along withDSP building blocks allow efficient implemen-tations that often rival standard microproces-sors, DSPs and GPUs. With the emergence ofthe Xilinx® 28-nanometer Zynq®-7000 AllProgrammable SoC, equipped with a hardARM® processor alongside the programmablelogic, Xilinx devices have become even moreappealing for embedded systems.
Despite the devices’ attractive properties,some designers avoid FPGAs because of thecomplexity of programming them—an esotericjob involving hardware description languagessuch as VHDL and Verilog. But the advent ofpractical electronic system-level (ESL) designpromises to ease the task and put FPGA pro-gramming within the grasp of a broader range ofdesign engineers.
ESL tools, which raise design abstraction tolevels above the mainstream register transferlevel (RTL), have a long history in the industry. After numerous false starts, these high-leveldesign tools are finally gaining traction in real-world applications. Xilinx has embraced theESL methodology to solve the programmabilityissue of its FPGAs in its latest design tool offer-ing, the Vivado® Design Suite. Vivado HLS, thehigh-level synthesis tool, incorporates built-inESL to automate the transformation of a designfrom C, C++ or SystemC to HDL.
The availability of Vivado HLS and the ZynqSoC motivated our team to do a cross-analysisof an ESL approach vs. hand-coding as part ofour ongoing research into IP-based image-pro-cessing solutions, undertaken in a collaborativeindustrial-academic project. We hoped that ourwork would provide an unbiased, independentcase study of interest to other design teams,since the tools are still new and not much liter-ature is available on the general user experi-ence. BDTI experiments [2, 3] provide a niceoverview of the quality and potential of ESLtools for FPGAs at a higher level, but the BDTIteam rewrote the source application in an opti-mal manner for implementation with ESL tools.Our work, by contrast, used the original legacysource code of the algorithms, which was writ-ten for the PC.
X P E R T S C O R N E R
Automated methodology delivers results similar to hand-coded RTL for twoimage-processing IP cores
In comparing hand-coded blocks ofIP that were previously validated on aXilinx competitor’s 40-nm midrangeproducts to ESL programming on aZynq SoC, we found that the ESLresults rivaled the manual implemen-tations in multiple aspects, with onenotable exception: latency. Moreover,the ESL approach dramaticallyslashed the development time.
Key objectives of our work were:
• Assess the effort/quality of porting existing (C/C++) code to hardware with no or minimalmodification;
• Do a detailed design spaceexploration (ESL vs. hand-codedIP);
• Investigate productivity trade-offs (efficiency vs. design time);
• Examine portability issues(migration to other FPGAs orASICs);
• Gauge the ease of use of ESLtools for hardware, software andsystem designers.
VIVADO-BASED METHODOLOGYWe used Xilinx Vivado HLS 2012.4(ISE® 14.4) design tools and the ZynqSoC-based ZedBoard (XC7z20clg484-1)for prototyping of the presented exper-iments. We chose two IP cores out of
10 from our ongoing Image ProcessingFPGA-SoC collaborative project forscanning applications. The chosenpieces of IP were among the most com-plex in the project. Optimally designedin Verilog and validated on the Xilinxcompetitor’s 40-nm product, they per-form specific image-processing func-tions for scanning products of theindustrial partner, Sagemcom.
Figure 1 outlines the experimenta-tion flow using Xilinx tools. For pur-poses of our experiments, we beganwith the original C source code of thealgorithms. It is worth noting that thecode was written for the PC in a clas-sical way and is not optimal for effi-cient porting to an embedded system.
Our objective was to make mini-mal—ideally, zero—modifications tothe code. While Vivado HLS supportsANSI C/C++, we did have to makesome basic tweaks for issues likedynamic memory allocation and com-plex structure pointers. We accom-plished this task without touching thecore of the coding, as recommendedby Xilinx documentation. We createdthe AXI4 master and slave (AXI-Lite)interfaces on function arguments andprocessed the code within VivadoHLS with varying constraints. Wespecified these constraints usingdirectives to create different solu-tions that we validated on the Zynq
Z20 SoC using the ZedBoard. The sin-gle constraint we had to make forcomparison’s sake involved latency.We had to assume the Zynq SoC’sAXI4 interface would have compara-ble latency at the same frequency asthe bus of the competitor’s FPGA inthe hand-coded design. Porting theRTL code (the core of the IP) to theXilinx environment was straightfor-ward, but redesigning the interfacewas beyond the scope of our work.
We used the Vivado HLS tool’sbuilt-in SystemC co-simulation fea-ture to verify the generated hardwarebefore prototyping on the board. Thefact that Vivado HLS also generatessoftware drivers (access functions)for the IP further accelerated the timeit took to validate and debug the IP.
We used Core-0 of the Zynqdevice’s ARM Cortex™-A9 MPCore at666.7 MHz, with a DDR3 interface at533.3 MHz. An AXI4 timer took careof latency measurements. Figure 2presents the validation flow we usedon the FPGA prototype. The image isprocessed with the original C sourceon ARM to get the “golden” referenceresults. We then processed the sameimage with the IP and compared themto validate the results.
As a standalone project, we alsoevaluated the latency measurementsof the two blocks of IP for a 100-MHzMicroBlaze™ processor with 8+8KBI+D cache, and hardware multiplierand divider units connected directlywith DDR3 via the AXI4 bus. We usedthe latency measurements of the twoIP cores on the MicroBlaze as a refer-ence to compare the speedupachieved by the Vivado HLS-generat-ed IP and the ARM processor, whichhas brought significant software com-pute power to FPGA designers.
EXPERIMENTAL RESULTSLet’s now take a closer look at thedesign space explorations of the twopieces of IP, which we have labeledIP1 and IP2, using the HLS tools.We’ve outlined the IP1 experiment in
36 Xcell Journal Third Quarter 2013
X P E R T S C O R N E R
• Mallocs• AXI Interfaces• Constraints
ARM Cortex-A9Processing System
Timer Zynq Z20SoC
Figure 1 – The test hardware system generation for the Zynq Z20 SoC
detail but to avoid redundancy, for IP2we’ve just shown the results. Finally,we will also discuss some feedbackwe received from the experts at Xilinxto our query regarding the efficiencyof the AXI4 “burst” implementationused in the Xilinx ESL tools.
Tables 1 and 2 outline the details ofour design space explorations for IP1in the form of steps (we have listedthem as S1, S2, etc.). For our experi-ments, we used a gray-scale testimage of 256 x 256 pixels. It is impor-tant to note that the clock period,latency and power figures are just ref-erence values given by a tool afterquick synthesis for the cross-analysisof multiple implementations. Theactual values can significantly varyafter final implementation, as you willsee in the discussion of the FPGA pro-totype results (Table 2).
S1: RAW CODE TRANSFORMATIONIn the first step, we compiled the codewith no optimization constraints(except the obligatory modificationsdescribed earlier). As you can seefrom Table 1, the hardware is using anenormous amount of resources. Whenwe analyzed the details of resourcedistribution from the Vivado HLSquick-synthesis report and design-viewer GUI tool, it was clear that theprimary contributor to this massiveresource usage was several divisionoperations in the algorithm. Unlikethe multipliers, there are no hard
blocks for division in an FPGA, so thedivision operations consumed a hugeamount of logic. Therefore, the firstthing we had to do was to optimizethis issue.
S2: S1 WITH SHARED DIVIDERIntuitively, our first thought for howto decrease the division hardware wasto make use of resource sharing, withits trade-off of performance loss.Unfortunately, in the version of thetools we used it is not easy or evenpossible to apply custom constraintson operators directly. But theworkaround was quite simple. Wereplaced the division operation (/) bya function, and on functions, con-straints can be applied. By default, thetool uses functions as inline: For eachdivision, it generates complete divi-sion hardware. We applied the“Allocation” directive on the divisionfunction, experimented with differentvalues and monitored their effect onlatency. Finally, we opted for single,shared division hardware. As you cansee in Table 1, this choice drasticallyreduced lookup tables (LUTs), flip-flops(FFs) and other logic resources withjust a negligible impact on latency—acomplete win-win situation.
S3: S2 WITH CONSTANT DIVISIONTO MULTIPLICATION CONVERSION In the next step, we opted to transformthe division operation into multiplica-tion, which was possible because this
application uses division by constantvalues. Our motivation here wastwofold. First, despite deciding to usea single shared divider for all divi-sions, division hardware is still expen-sive (around 2,000 LUTs and flip-flops). Second, this shared dividertakes 34 cycles for division, which isquite slow, and the version of theXilinx tools we used supports only onetype of divider in the HLS library. Butthere are several options for multipli-cation, as we will see in the next steps.
Therefore, in this step we trans-formed the division operations to mul-tiplications (we used the #ifdef direc-tives of C so as to keep the codeportable). The approximation valueswere provided as function parametersvia AXI-Lite slave registers. This iswhy Table 1 shows a significantimprovement in performance andresources at this step, with a slighttrade-off in added DSPs due toincreased multiplication operations.
S4: S3 WITH BUFFER MEMORYPARTITIONINGThe source algorithm uses multiplememory buffers (8, 16 and 32 bits)arranged together in a unified mallocblock. During the initial transforma-tion to make the algorithm VivadoHLS ready (Figure 1), that memoryblock became an internal on-chipmemory (Block RAMs).
While a unified malloc block was anormal choice for shared externalmemory, when using internal on-chipmemories, it is much more interestingto exploit memory partitioning forfaster throughput. In this step, there-fore, we fractured the larger memoryblock into independent small memo-ries (8, 16 and 32 bits). Table 1 showsthe results, highlighting significantperformance improvements with aslight increase in the number ofBRAMs (Table 2) due to fractured use.This technique also yielded a signifi-cant decrease in logic resources,potentially due to simplified controllogic for memory access.
X P E R T S C O R N E R
Image-ProcessingVivado HLS IP
Golden Results IP Results
Third Quarter 2013 Xcell Journal 37
Figure 2 – ESL IP validation flow on the FPGA
S5: S4 WITH SHARED 32/64-BITSIGNED/UNSIGNED MULTIPLIERSIn this step, motivated by the shared-divider experiments in S2, we focusedon optimizing the DSP blocks usingshared resources. Our first step wasto conduct a detailed analysis of thegenerated hardware using the DesignViewer utility to cross-reference gen-erated hardware with the source Ccode. The analysis revealed that inraw form, the hardware is using 26distinct multiplier units of heteroge-neous types, which can be classifiedas unsigned 16-/32-/64-bit multiplica-tions and 32-bit signed multiplica-tions. As in our S2 experiments, weconstructed a system consisting of 32-bit signed, 32-bit unsigned and 64-bit
unsigned multipliers (by replacing the“*” operation with a function). Wethen applied an Allocation constraintof using only one unit for these threedistinct operations. Table 1 shows theresulting drastic reduction in DSPusage, albeit with some penalty in per-formance. This result motivated us tofurther investigate smart sharing.
S6: S5 WITH SMARTMULTIPLICATION SHARINGTo mitigate the performance loss aris-ing from shared multiplier usage, wemade smart use of multiplier sharing.We created two additional multipliertypes: a 16-bit unsigned multiplier anda 32-bit unsigned multiplier with 64-bit return value. After several itera-
tions of smartly using the multipliersin a mutually exclusive manner andchanging their quantity, our final solu-tion contained two unsigned 64-bitmultipliers, two unsigned 32-bit multi-pliers, one signed 32-bit multiplier,one unsigned 16-bit multiplier and asingle unsigned 32-bit multiplier with64-bit return value. This techniqueimproved performance slightly, as youcan see in Table 1, but with slightlyhigher DSP usage.
S7: S6 WITH MULTIPLIER LATENCYEXPERIMENTS (COMBINATIONALMULTIPLIER )In the final stages of multiplier opti-mization, we conducted two experi-ments for changing the latency value
38 Xcell Journal Third Quarter 2013
X P E R T S C O R N E R
IP1: Zynq Z20 board results for256x256-pixel test image
LUTs FFs BRAMs DSPFmax(MHz)
ARM CortexA9 MPCore (Core-0 only) NA NA NA NA NA 666.7 32.5 15 NA
MicroBlaze (with Cache & int. Divider) 2,027 1,479 6 3 NA 100 487.4 1 NA
Hand-Coded IP (Resources for Ref.) 8798 3,693 8 33 54.2 *50(NA) *4 (NA) NA 57
S6: S5 with Smart Mult sharing 4,900 4,569 16 40 8.75 2,910,344 2.94 918
S7: S6 with Mult Lat. exp (comb. Mult) 4,838 4,080 16 40 13.56 1,975,952 4.33 893
S8: S6 with Mult Lat. exp (1clk cycle Mult) 4,838 4,056 16 40 22.74 2,181,776 3.92 890
S9: S4 with Superburst 4,025 4,357 80 79 8.75 NA NA NA
S10: S4 with Smart Burst 3,979 4,314 24 80 8.75 NA NA NA
S11: S6 with Smart Burst 4,224 4,079 24 39 8.75 NA NA NA
Table 1 – HLS exploration results for IP1*Power value of tool for iterations comparison, it has no units
Table 2 – FPGA prototype results for IP1 explorations*Reference values, see methodology section
**Dynamic power from XPower Analyzer
of the multipliers. By default, VivadoHLS used multipliers with latenciesranging from two to five clock cycles.In hard IP, all multipliers are eithersingle-clock-cycle or combinational.In this step, we set the multipliers’latency to 0 by using the “Resource”directive and choosing the correspon-ding combinational multiplier fromthe library. The results, as seen inTable 1, show an improvement inlatency. However, the design hasbecome much slower from the stand-point of timing, so the increased per-formance may well turn into a lossdue to the slower clock speed of theFPGA design.
S8: S6 WITH MULTIPLIER LATENCYEXPERIMENTS (1-CLOCK-CYCLEMULTIPLIER)In this step, instead of using a combina-tional multiplier we selected a single-clock-cycle multiplier from the library.Results are shown in Table 1. We cansee that while the latency has slightlyincreased, the clock period, surprising-ly, has significantly increased, whichmay lead to slower hardware.
S9, S10, S11: BURST-ACCESSEXPERIMENTSAll the exploration steps presented sofar required very little or no knowledgeof the functionality of the C code. Inthe final experiments, we exploredburst access, which is almost obligato-ry for IP that is based on shared-mem-ory systems due to high latency of thebus for small or random data accesses.Toward this end, we analyzed the algo-rithmic structure of the C code toexplore the possibility of implement-ing burst access, since natively there isno such thing as burst access from ageneral software standpoint. However,it is important to note that even withour burst-access experiments, wenever modified the software structureof the code.
We performed our burst-accessexperiments in two steps. First weused a dump approach (not practical,
but a good way to get some technicalinformation), where the entire imagewas burst into the internal memorybuffer and the full results burst out atthe end of execution. This was easy todo by using a small image and by thepresence of abundant on-chip memoryin the FPGA. We called this iterationstep “superburst” (S9).
In the second phase, we implement-ed a smart, interactive burst (S10, S11)within the limitations of not modifyingthe source code structure. The resultsare shown in Table 1. We implementedthe burst functionality in C using thestandard “memcpy” function, based ontutorial examples from Xilinx.Unfortunately, it is not possible to seelatency values when you insert thememcpy function in the code. TheFPGA implementation (Table 2) willillustrate the obtained behaviors.
FPGA PROTOTYPE OF SELECTED STEPSOne of the greatest benefits of usingESL tools is rapid design space explo-ration. We were able to evaluate allthe exploration steps listed above andthe results presented in Table 1 with-in minutes, avoiding the lengthyimplementation phase of an FPGA foronly a few, final selected iterations.Table 2 presents the final implemen-tation results of some selected stepsof Table 1 on our Zynq Z20 SoC. Italso presents the equivalent results(latency) of implementing the originalsource code on a MicroBlaze proces-sor, as a reference, and on the ARMCortex-A9, showing the tremendouscompute power that the Zynq SoC hasbrought to FPGA designers. The tablealso shows estimated dynamic-powervalues calculated by the XilinxXPower Analyzer.
We obtained the implementation sta-tistics of the optimal hand-coded IP onthe Zynq SoC by implementing theirRTL using Xilinx tools. We had to usethe latency measurements of our previ-ous experiments with a competitiveFPGA as a reference, since porting to
the AXI4 bus was beyond the scope ofour experiments. This further high-lights the potential of HLS tools, whichcan make it easier for designers to portto multiple protocols.
If we compare the results of Table 2with those of Table 1, several interest-ing observations emerge. First, thequick synthesis of ESL gives a relative-ly good estimate compared with theactual implementation (especially forLUTs and flip-flops); for memory andDSP, the synthesizer sometimes cansignificantly change the resourcecount for optimizations.
Second, for final implementation,timing closure can be a big problem.We can see a significant change inachieved frequency after FPGA imple-mentation compared with the HLS esti-mate. Finally, it is surprising that evenwith the burst-access experiments,there is a significant difference in thelatency values from the expected HLSvalues and hand-coded IP. This piquedour curiosity about the quality of DMAthat the HLS tool generated.
NEW IP, SIMILAR RESULTS In a similar fashion, we likewiseexplored our second piece of IP,which we called IP2, in the samedetail. The results, illustrated inTables 3 and 4, are self-explanatory inthe light of our in-depth discussion ofour experiences with IP1. One pointthat’s interesting to note is step 8 inTable 3. Just as with IP1 (see the S2and S3 discussions), the divider isexpensive in terms of resources andalso latency. Unlike IP1, IP2 uses realdivision, so the divider cannot beremoved. Unfortunately, in the cur-rent version of the ESL tools, only onedivider is present, with 34 clockcycles of latency. S8 shows that if adivider with a latency of a single clockcycle were used, like those availablein hand-coded IP, theoretically itcould yield a performance improve-ment of around 30 percent. Table 4shows FPGA prototype results ofselected steps.
Third Quarter 2013 Xcell Journal 39
X P E R T S C O R N E R
VIVADO HLS DMA/BURSTEFFICIENCY ISSUE Comparing the results of the hand-coded IP implementation with ESLtechniques, several observations canbe made. Despite using the originalsource code (with almost no modifi-cation), which was not written in anoptimal manner to exploit embeddedsystems, the ESL tools deliveredresults that rivaled optimally hand-coded IP in terms of resource utiliza-tion. The significant difference in theexplored results for ESL vs. hand-coded IP came in latency, which wasfar better in the hand-coded IP.
Looking into this issue, we observedthat there is a major difference in the
way the hand-coded IP and ESL IP aremade. (This is also the reason for thebig difference in resources consumedfor IP2.) Due to the style of the sourcecode, there is only one data bus forinput and output in the ESL IP, com-pared with two for the hand-coded IP.Also, FIFOs handle burst access muchmore optimally in hand-coded IP. Forthe ESL IP, the burst is done withbuffers, and due to the nature of theoriginal code it was quite hard to opti-mally create an efficient burst mecha-nism. The ESL approach entails a penal-ty in BRAM usage, since the ESL hard-ware is using these memories for workbuffers and burst buffers in accordancewith the code structure. In addition, the
burst is done using the memcpy func-tion of C, in accordance with Xilinxtutorials for the AXI4 master interface.
We also conferred with Xilinxexperts, who recommended using theexternal DMAs manually for optimalperformance, as the AXI4 master inter-face generated by the version ofVivado HLS we used is in beta modeand will be upgraded in the future.Such factors might have caused a sig-nificant difference in latency valuesand might be a good starting point forfuture experiments.
PRODUCTIVITY VS. EFFICIENCYThe detailed experiments undertakenin this work revealed several interest-
40 Xcell Journal Third Quarter 2013
X P E R T S C O R N E R
IP2: Zynq Z20 board results for 256x256-pixel test image
LUTs FFs BRAMs DSPFmax(MHz)
ARM CortexA9 MPCore (Core-0 only) NA NA NA NA NA 666.7 37.9 13.39 NA
MicroBlaze (with Cache & int. Divider) 2,027 1,479 6 3 NA 100 507.6 1 NA
Hand-Coded IP (Resources for Ref.) 3,671 2,949 2 16 63.8 *50(NA) *4 (NA) NA 25
S3: S2 + const Mult & Div ops reduction 8,620 8,434 8 81 44.2 40 236.8 2.14 24.4
S5: S4 with Smart Mult sharing 7,240 6,663 16 48 8.75 3,882,604 2.85 1357
S6: S4 with Smart Burst 6,816 6,515 24 78 8.75 NA NA 1240
S7: S5 with Smart Burst 6,846 5,958 24 33 8.75 NA NA 1262
S8: S5 Theoretical Lat. with 1 clk cycle Div. NA NA NA NA NA 2,751,084 4.02 NA
Table 3 – HLS exploration results for IP2*Power value of tool for iterations comparison, it has no units
Table 4 – FPGA prototype results for IP2 explorations*Reference values, see methodology section
**Dynamic power from XPower Analyzer
ing aspects of ESL design and alsocleared up some misconceptions. ESLtools have come a long way since theearly iterations. They now supportANSI C/C++, give good competitiveresults and exploit several designdecisions (not just the loop exploita-tions of the past) inspired by hard-ware/system design practices, in theform of constraints directives.
Indeed, we found that the resultsobtained from ESL rivaled the optimalresults in all aspects except one:latency. Furthermore, using accuratebit types with advanced features ofthe tools (instead of the generic Ctypes we used in these experiments)may further improve resources, ascould partial rewriting of code.
Our explorations also gave us anidea about productivity vs. efficiencytrade-offs. Figure 3 illustrates the rela-tive development time of IP blocks (interms of design, software, FPGA imple-mentation and verification). The fasttime-to-solution for ESL-based designis dramatic, especially for verification,with results that rival optimal imple-mentation. It is important to note thatdesign/validation time is highlydependent on the skills of the design-ers. Since the same group of peoplewas involved in both the hand-codeddesign and the ESL-based design,Figure 3 gives quite an unbiasedoverview of how the process plays out.Considering the fact that designers
were well acquainted with the classicalRTL method of IP design and integra-tion, and were new to ESL tools, it’slikely that implementation time couldsignificantly improve in the longerterm, once the tools are more familiar.
WHAT NEXT FOR ESL? From a technology and also a busi-ness standpoint, there should besome EDA standardization amongESL tools’ constraints structure, mak-ing designs easy to port to otherFPGAs or ASICs (flexibility that isinherent in RTL-based design toolsand in principle, even easier toachieve with ESL tools). ESL toolproviders should differentiate in thequality of their tools and results only,as they do for RTL tools.
It’s a long and complex discus-sion, and beyond scope of this shortarticle, to answer questions like:Have ESL tools enabled softwaredesigners to program FPGAs? WillESL eliminate RTL designers/jobs?Who is the biggest winner with ESLtools? In our brief experience withthis work, we have found that ESLtools are beneficial for all, especiallyfor the system designers.
For hardware designers, the toolscan be an interesting way to create orevaluate small portions of theirdesign. They can be helpful in rapidlycreating a bus-interface-ready testinfrastructure. Software designers try-
ing to design hardware are one of thekey targets for ESL tools. And whileESL tools have come a long way forthis purpose, they still have a longway to go. In our project, for example,ELS tools hid the hardware complexi-ties to a great extent from softwaredesigners. Nevertheless, good imple-mentation might still require a hard-ware mind-set when it comes to con-straints and optimization efforts. Thisissue will get better as the ESL toolsfurther evolve.
As for the system designers, whoare becoming more and more com-mon due to the increased integrationand penetration of HW/SW co-designin the world of SoCs, ESL tools per-haps are a win-win situation. Systemdesigners can exploit these tools atmultiple levels.
On the hardware side, one of thekey reasons to choose the Zynq-7000All Programmable SoC among Xilinx’s28-nm 7 series FPGAs was to explorethe potential of having a hard ARMcore onboard the same chip withFPGA logic. As we have seen in theseexperimental results, the dual-coreARM has brought unprecedentedcompute power to FPGA designers.Indeed, the Zynq SoC delivers a mini-supercomputer to embedded design-ers, who can customize it using thetightly integrated FPGA fabric.
We are grateful to our industrial partner,
Sagemcom, for funding and technical
cooperation. We would also like to thank
Xilinx experts in Europe, Kester Aernoudt
and Olivier Tremois, for their prompt
technical guidance during this work.
1. G. Martin, G. Smith, “High-Level
Synthesis: Past, Present and Future,”
IEEE Design and Test of Computers,
2. BDTI report, “High-Level Synthesis Tools
for Xilinx FPGAs,” 2010
3. BDTI report, “The AutoESL AutoPilot
High-Level Synthesis Tool,” 2010
Third Quarter 2013 Xcell Journal 41
X P E R T S C O R N E R
IP1 RTL IP1 ESL IP2 RTL IP2 ESL
Development Time (Person Weeks)
Figure – 3: Hand-coded RTL vs. ESL development time
42 Xcell Journal Third Quarter 2013
XPLANATION: FPGA 101
One of the major benefits of the Zynq SoC architecture is the ability to speed up processing by building a peripheral within the device’s programmable logic.
How to IncreaseZynq SoC Speed by Creating Your Own Peripheral
O ne of the real advantages of the Xilinx® Zynq™-7000 All Programmable SoC is the ability toincrease the performance of the processing sys-
tem (PS) side of the device by creating a peripheral with-in the programmable logic (PL) side. At first you mightthink this would be a complicated job. However, it is sur-prisingly simple to create your own peripheral.
Adding a peripheral within the PL can be of greathelp when you’re trying to speed up the performanceof your processing system or when you’re using thePS to control the behavior of the design within theprogrammable logic side. For example, the PS mightuse a number of memory-mapped registers to controloperations or options for the design within the pro-grammable logic.
Our example design is a simple module that allowsyou to turn on and off a memory-mapped LED on theZedBoard. To create this module, we will use the XilinxPlanAhead™, XPS and Software Development Kit(SDK) tool flow in a three-step process.
1. Create the module within the EmbeddedDevelopment Kit (EDK) environment.
2. Write the actual VHDL for our module and build the system.
3. Write the software that uses our new custom module.
CREATING THE MODULE IN EDKTo create your own peripheral, you will first need toopen Xilinx Platform Studio (XPS) from the PlanAheadproject containing your Zynq SoC design and select themenu option “hardware->create or import peripheral.”
For our example module, the seven LEDs on theZedBoard should demonstrate a walking display, illu-minating one after the other, except that the most sig-nificant LED will be toggled under control of the soft-ware application. While this is not the most excitingapplication of a custom block, it is a useful simpleexample to demonstrate the flow required. Once you’vegrasped that process, you will know everything youneed to know to implement more-complex modules.
Third Quarter 2013 Xcell Journal 43
X P L A N A T I O N : F P G A 1 0 1
This is the third in a planned series of hands-on
Zynq-7000 All Programmable SoC tutorials by Adam
Taylor. Earlier installments appeared in Xcell Journalissues 82 and 83. A frequent contributor to Xcell Journal,Adam is also a blogger at All Programmable Planet
• A Microprocessor PeripheralDescription, which defines theinterfaces to your module suchthat it can be used with XPS
• A CIP file, which allows recus-tomization of the peripheral ifrequired
• Driver source files for the SDK
• Example source files for the SDK
• A Microprocessor Driver Def-inition, which details the driversrequired for your peripheral
These files will be created underthe PlanAhead sources directory,which in our case is zed_blog.srcs\
sources_1\edk\proc_subsystem.Beneath this you will see the Pcores
directory, which will contain the files(under a directory named after youperipheral) required for XPS andPlanAhead, along with the driversdirectory, which contains the filesrequired for the SDK.
CREATING THE RTLWithin the Pcores directory you will seetwo files, one named after the compo-nent you created (in this case,led_if.vhd) and another called
You create a peripheral by steppingthrough 10 simple windows and select-ing the options you desire.
The first two screens (shown inFigure 1) ask if you want to create orimport an existing peripheral and ifso, to which project you wish to asso-ciate it. The next stage, at screen 3,allows you to name your module andvery usefully lets you define versionsand revisions. (Note that the optionsat the bottom of the peripheral flowenable you to recustomize a moduleyou have already created by readingback in the *.cip file.)
Once you have selected your mod-ule name, you can then choose thecomplexity of the AXI bus required, asdemonstrated in Figure 2. The one weneed for our example design is a sim-ple memory-mapped control register-type interface; hence, we have chosenthe top option, AXI4-Lite. (More infor-mation on the different types of AXIbuses supported is available athttp://www.xilinx.com/support/docu-
4_Plug_and_Play_IP.pdf.) This stepwill create a simple number of regis-ters that can be written to and readfrom over the AXI bus via the process-
X P L A N A T I O N : F P G A 1 0 1
ing system. These registers will belocated within the programmable logicfabric on the Zynq SoC and hence arealso accessible by the user logic appli-cation, where you can create the func-tion of the peripheral.
The next six screens offer you theoption to customize an AXI master orslave configuration, and the number ofregisters that you will be using to con-trol your user logic along with the sup-porting files. For this example we areusing just the one register, which willbe functioning as an AXI-Lite slave.The most important way to ease thedevelopment of your system is toselect the “generate template driverfiles” on the peripheral implementationsupport tab. Doing so will deliver anumber of source and header files tohelp communicate within the peripher-al you create.
At the end of this process, XPS willhave produced a number of files tohelp you with creating and using yournew peripheral:
• A top-level VHDL file named afteryour peripheral
• A VHDL file in which you can create your user logic
user_logic.vhd. It is in this latter filethat you create your design. You willfind that the wizard has already givenyou a helping hand by generating theregisters and interfaces necessary tocommunicate over the AXI bus withinthis file. The user-logic module isinstantiated within led_if.vhd.Therefore, any changes you make tothe entity of user_logic.vhd must beflowed through into the port map. Ifexternal ports are required—as theyare in our example design, to flash theLED—you must also flow through yourchanges in the entity of led_if.vhd.
Once you have created your logicdesign, you may wish to simulate it toensure that it functions as desired. Bus-functional models are provided underthe Pcores/devl/bfmsim directory.
USING YOUR MODULE WITHIN XPSHaving created the VHDL file or filesand ensuring they function, you willwish to use the module within yourXPS project. To correctly do this youwill need to ensure the MicroprocessorPeripheral Description file is up to datewith the ports you may have added dur-ing your VHDL design. You can accom-
plish this task by editing the MPD filewithin XPS. You will find your createdperipheral under the Project LocalPcores->USER->core name. Right-clicking on this name will enable you toview your MPD file.
You will need to add the non-AXIinterfaces you included in your designso that you can see the I/Os and dealwith them correctly within XPS. Youwill find the syntax for the MPD file athttp://www.xilinx.com/support/docu-
sf_rm.pdf, although it is prettystraightforward, as you can see inFigure 3. The highlighted line (line 59,at the bottom of the central list) is theone I added to bring my LED outputport to the module.
When you have updated the MPDfile, you must click on “re-scan user IPrepositories” to ensure the changes areloaded into XPS. At that point, you canthen implement the device into yoursystem just like any other peripheralfrom the IP library. Once you have it inthe address space you want (remem-ber, the minimum is 4K) and have con-nected your outputs as required toexternal ports, if necessary you can runthe design rule checker. Provided there
are no errors, you can close down XPSand return to PlanAhead.
Back within the PlanAhead environ-ment, you need to regenerate the toplevel of HDL; to do this, just right-clickon the processor subsystem you creat-ed and select “create top HDL” (seeXcell Journal issue 82 for informationon creating your Zynq SoC PlanAheadproject). The final stage before imple-mentation is to modify the UCF filewithin PlanAhead if changes havebeen made to the I/O pinout of yourdesign. Once you are happy with thepinout, you can then run the imple-mentation and generate the bitstreamready for use in the SDK.
USING THE PERIPHERAL IN SDKHaving gotten this far, we are ready touse the peripheral within the softwareenvironment. As with the tasks of cre-ation and implementation, this is aneasy and straightforward process. Thefirst step is to export the hardware tothe Software Development Kit using the“export hardware for SDK” option with-in PlanAhead. Next, open the SDK andcheck the MSS file to make sure thatyour created peripheral is now present.It should be listed with the name of the
Figure 2 – The next two windows, covering name, revision and bus standard selection
Figure 3 – Editing the MPD file to include your peripheral
46 Xcell Journal Third Quarter 2013
X P L A N A T I O N : F P G A 1 0 1
Third Quarter 2013 Xcell Journal 47
driver next to it; in the first instance thedriver will be named “generic.” Usingthe peripheral within the SDK, thegeneric driver is not an issue. If you wishto use it, you will need to include theheader file #include "xil_io.h." Thisallows you to use the function callsXil_in32(addr) or Xil_out32(addr) toread and write from the peripheral.
To use this method you will alsoneed to know the base address of theperipheral, which is given in xparame-ters.h, and the offset between regis-ters. In this case, because of the 32-bitaddressing in our example design, theregisters are separated by 0x04.However, if you enabled the driver-cre-ation option when you created theperipheral using XPS, then a separatedriver will have been specially built foryour peripheral. This driver too will belocated within the drivers subdirectoryof your PlanAhead sources directory.
You can use this type of speciallybuilt driver in place of the generic driv-ers within your design, but first you
must set the software repositories topoint to the directory to see the newdrivers. You can define the reposito-ries under the Xilinx Tools ->Repositories menu (Figure 4).
Once you have set the repositoriesto point to the correct directory, youcan re-scan the repositories and closethe dialog window. The next step is toupdate the board support packagewith the new driver selection. Openthe board support package settings byright-clicking on the BSP icon underthe Project Explorer. Click on the driv-ers tab and you will see a list of all theperipherals and their drivers. You canchange the driver selection for yourperipheral by selecting the appropri-ate driver from a list of drivers applica-ble to that device (Figure 5).
Select the driver for your peripher-al by name and click OK; your projectwill be recompiled if you have auto-matic compilation selected. Once thecode has been rebuilt, you are thennearly ready to communicate with
the peripheral using the supplied driv-ers. Within your source code you willneed to include the driver header file#include "led_if.h" and use the baseaddress of the peripheral from xpara-meters.h to enable the interface toknow the address within the memoryspace. It is then a simple case of call-ing the commands provided by thedriver within the “include” file. In thecase of our design, one example was
With this command written, it isthen time to test the software withinthe hardware environment and con-firm that it functions as desired.
As you can see, making your ownperipheral within a Zynq SoC design isnot as complicated as you may at firstthink. Creation of a peripheral offersseveral benefits for your Zynq SoCprocessing system and, ultimately,your overall solution.
Figure 5 – Selecting the peripheral driver
48 Xcell Journal Third Quarter 2013
XPLANATION: FPGA 101
JESD204B: Critical Link in theFPGA-Converter Chainby Jonathan HarrisProduct Applications EngineerHigh-Speed Converter GroupAnalog [email protected]
I n most designs today, parallellow-voltage differential signaling(LVDS) is the interface of choice
between data converters and FPGAs.Some FPGA designers still use CMOSas the interface in designs with slowerconverters. However, the latest high-speed data converters and FPGAs aremigrating from parallel LVDS andCMOS digital interfaces to a serial-ized interface called JESD204B, anew standard developed by theJEDEC Solid State TechnologyAssociation, an independent semi-conductor engineering trade organi-zation and standardization body.
As converter resolution and speedhave increased, the demand for a moreefficient interface has grown. Thisinterface is the critical link betweenFPGAs and the analog-to-digital or dig-ital-to-analog converters that residebeside them in most system designs.The JESD204B interface brings thisefficiency to designers, and offers sev-eral advantages over preceding inter-face technologies. New FPGA designsemploying JESD204B enjoy the bene-fits of a faster interface to keep pacewith the faster sampling rates of con-verters. Also, a dramatic reduction inpin count leads to smaller packagesizes and fewer trace routes, makingboard designs a little less complex.
All this does not come for free. Each type of interface—includ-
ing JESD204B—comes with designissues such as timing concerns. Thespecifics differ somewhat, but eachinterface brings its own set ofparameters that designers mustanalyze properly to achieve accept-able system performance. There arealso hardware choices to be made.For example, not all FPGAs andconverters support the JESD204Binterface. In addition, you must beup to date on parameters such asthe lane data rate in order to selectthe appropriate FPGA.
Before exploring the details of thenew JESD2904B interface, let’s takea look at the other two options thatdesigners have long used for FPGA-to-converter links: CMOS and LVDS.
CMOS INTERFACEIn converters with sample rates ofless than 200 megasamples per sec-ond (MSPS), it is common to findthat the interface to the FPGA isCMOS. A high-level look at a typicalCMOS driver shows two transistors,one NMOS and one PMOS, connect-ed between the power supply (VDD)and ground, as shown in Figure 1a.This structure results in an inver-sion in the output, and to avoid it,
Third Quarter 2013 Xcell Journal 49
X P L A N A T I O N : F P G A 1 0 1
Data converters that interface withXilinx FPGAs are fast embracing the new JESD204B high-speed serial link. To use this interface format and protocol, your design needs to take some basic hardware and timing issues into account.
The series-damping resistors will helpsuppress this large transient current.This technique will reduce the noisethat the transients generate in the out-puts, and thus help prevent the outputsfrom generating additional noise anddistortion in the A/D converter.
LVDS INTERFACELVDS offers some nice advantages overCMOS technology for the FPGA design-er. The interface has a low-voltage dif-ferential signal of approximately 350mV peak-to-peak. The lower-voltageswing has a faster switching time andreduces EMI concerns. In addition,many of the Xilinx® FPGA families,such as the Spartan®, Virtex®, Kintex®
and Artix® devices, support LVDS.Also, by virtue of being differential,LVDS offers the benefit of common-mode rejection. This means that noisecoupled to the signals tends to becommon to both signal paths and ismostly canceled out by the differen-tial receiver. In LVDS, the load resist-ance needs to be approximately 100 Ωand is usually achieved by a paralleltermination resistor at the LVDSreceiver. In addition, you must routethe LVDS signals using controlled-impedance transmission lines. Figure3 shows a high-level view of a typicalLVDS output driver.
you can use the back-to-back struc-ture in Figure 1b instead. The input ofthe CMOS output driver is highimpedance while the output is lowimpedance. At the input to the driver,the gates of the two CMOS transistorspresent high impedance, which canrange from kilohms to megohms. Atthe output of the driver, the imped-ance is governed by the drain current,ID, which keeps the impedance to lessthan a few hundred ohms. The volt-age levels for CMOS swing fromapproximately VDD to ground and cantherefore be quite large depending onthe magnitude of VDD.
50 Xcell Journal Third Quarter 2013
X P L A N A T I O N : F P G A 1 0 1
Some important things to considerwith CMOS are the typical switchingspeed of the logic levels (~1 V/ns), out-put loading (~10 pF/gate driven) andcharging currents (~10 mA/output). Itis important to minimize the chargingcurrent by using the smallest capaci-tive load possible. In addition, a damp-ing resistor will minimize the chargingcurrent, as illustrated in Figure 2. It isimportant to minimize these currentsbecause of how quickly they can addup. For example, a quad-channel 14-bitA/D converter could have a transientcurrent as high as 14 x 4 x 10 mA,which would be a whopping 560 mA.
a) Inverted output b) Noninverted output
Figure 1 – Typical CMOS digital output driver
i = CxdV/dt
dV/dt = 1V/ns
Figure 2 – CMOS output driver with transient-current and damping resistor
X P L A N A T I O N : F P G A 1 0 1
Third Quarter 2013 Xcell Journal 51
A critical item to consider withdifferential signals is proper termi-nations. Figure 4 shows a typicalLVDS driver and the terminationneeded at the receivers. You can usea single differential terminationresistor (RTDIFF) or two single-ended termination resistors (RTSE).Without proper terminations, thesignal quality becomes degraded,which results in data being corrupt-ed during transmission. In addition
to having proper terminations, it isimportant to give attention to thephysical layout of the transmissionlines. You can maintain properimpedance by using line widths thatare appropriate. The fabricationprocess variation should not causewide shifts in the impedance. Avoidabrupt transitions in the spacing ofthe differential transmission lines inorder to reduce the possibility of sig-nal reflections.
JESD204B INTERFACEThe latest development in high-speedconverters and FPGAs is theJESD204B serialized interface, whichemploys current-mode logic (CML)output drivers. Typically, converterswith higher resolutions (≥14 bits),higher speeds (≥200 MSPS) and thedesire for smaller packages with lesspower utilize CML drivers. There issome overlap here in high-speed con-verters around the 200-MSPS speedrange that use LVDS. This means thatdesigners moving to JESD204B canuse a proven design that was likelybased on a high-speed converter withLVDS. The result is to remove a levelof complexity out of the design and tomitigate some of the risk.
Combining CML drivers with serial-ized JESD204B interfaces allows datarates on the converter outputs to go upto 12.5 Gbps. This speed allows roomfor converter sample rates to push intothe gigasample (GSPS) range. In addi-tion, the number of required outputpins plummets dramatically. It’s nolonger necessary to route a separateclock signal, since the clock becomesembedded in the 8b/10b encoded datastream. Since the interface employedwith CML drivers is typically serial,
Figure 3 – Typical LVDS output driver
Figure 4 – LVDS output driver with termination
reduced and the eye is closing, inwhich case the receiver in the FPGAwill have a difficult time interpretingthe data. As you can see, it is veryimportant to use proper terminationsto ensure that you achieve the best pos-sible signal integrity.
A BETTER AND FASTER WAYThe three main types of digital outputsemployed in converters each have theirown advantages and disadvantageswhen connecting to FPGAs. It is impor-tant to keep these pros and cons in mindwhen utilizing converters that employCMOS, LVDS or CML (JESD204B) out-put drivers with an FPGA design. Theseinterfaces are a critical link between theFPGAs and the A/D or D/A converters.Each type of driver has qualities and
there’s a much smaller increase in thenumber of pins required than would bethe case with CMOS or LVDS.
Table 1 illustrates the pin countsfor the three different interfaces in a200-MSPS converter with variouschannel counts and 12-bit resolution.The data assumes a synchronizationclock for each channel’s data, in thecase of the CMOS and LVDS outputs,and a maximum data rate of 4 Gbpsfor data transfer using JESD204B.This lane rate is well under the maxi-mum limits of many of the FPGAs inthe Virtex-6, Virtex-7, Kintex-7 andArtix-7 families available today. Thereasons for the progression toJESD204B become obvious when younote the dramatic reduction in pincount that you can achieve. Theboard routing complexity is reducedand less I/O is required from theFPGA, freeing up those resources forother system functions.
Taking this trend one step furtherand looking at a 12-bit converter run-ning at 1 GSPS, the advantage ofusing JESD204B becomes even moreapparent. This example omits CMOS,because it is just completely imprac-tical to try to use a CMOS outputinterface with a gigasample convert-er. Also, the number of converterchannels is limited to four and thelane rate is limited to 5 Gbps, whichis again compatible with many of theFPGAs in the Virtex-6 and Xilinx 7series FPGA families. As you cansee, JESD204B can significantlyreduce the complexity in the outputrouting due to the reduction in thenumber of output pins. There is a 6xreduction in pin count for a quad-channel converter (Table 2).
52 Xcell Journal Third Quarter 2013
X P L A N A T I O N : F P G A 1 0 1
Just as with the LVDS interface, it isimportant to maintain proper termina-tion impedances and transmission-lineimpedances when using CML with theJESD204B interface. One method ofmeasuring signal quality is the eye dia-gram. An eye diagram is a measure-ment that indicates several parametersabout the signal on a link. Figure 5shows an eye diagram for a properlyterminated CML driver on a 5-GbpsJESD204 link. The eye diagramexhibits nice transitions and has a suf-ficiently open eye such that the receiv-er in the FPGA should have no diffi-cultly interpreting the data. Figure 6shows the result of having an improp-erly terminated CML driver in the same5-Gbps JESD204 link. The eye diagramhas more jitter, the amplitude is
Number of Channels
1 12 13 14 2
2 12 26 28 4
4 12 52 56 8
8 12 104 112 16
Resolution CMOS Pin Count
LVDS Pin Count (DDR)
CML Pin Count (JESD204B)
Number of Channels
1 12 50 8
2 12 100 16
4 12 200 32
8 12 400 64
Resolution LVDS Pin Count (DDR)
CML Pin Count (JESD204B)
Table 1 – Pin count comparison for a 200-MSPS A/D converter
using CMOS, LVDS and JESD204B
Table 2 – Pin count comparison between LVDS and JESD204B for a 1-GSPS converter
The JESD204B serialized data interface offers a better and faster way to transmit data from converters to receiver
devices. The latest converters and FPGAs are being designed with drivers to support JESD204B.
X P L A N A T I O N : F P G A 1 0 1
Third Quarter 2013 Xcell Journal 53
requirements that you must take intoaccount when designing a system, suchthat the receiver in the FPGA can prop-erly capture the converter data. It isimportant to understand the loads thatmust be driven; to utilize the correctterminations where appropriate; and toemploy proper layout techniques forthe different types of digital outputs theconverters use. As the speed and reso-lution of converters have increased, sohas the demand for a more efficient dig-ital interface. As this happens, itbecomes even more important to havea properly designed system using opti-mal layout techniques.
The JESD204B serialized data inter-face offers a better and faster way totransmit data from converters toreceiver devices. The latest convertersand FPGAs are being designed withdrivers to support JESD204B. Theneed begins with the system-leveldemand for more bandwidth, resultingin faster converters with higher sam-
ple rates. In turn, the higher samplerates drive the requirement for higheroutput data rates, which has led to thedevelopment of JESD204B. As systemdesigns become more complex andconverter performance pushes higher,the JESD204 standard is poised toadapt and evolve to continue to meetthe new design requirements. Thisability of the JESD204 standard toevolve will increase the use of theinterface, eventually propelling it intothe role of default interface of choicefor converters and FPGAs.
For more information, see the“JESD204B Survival Guide” athttp://www.analog.com/static/import-
Survival-Guide.pdf. Also, Xilinx andAnalog Devices offer a three-partYouTube series on rapid prototypingwith JESD204B and FPGA platforms athttp://www.youtube.com/playlist?list=
Figure 5 – Eye diagram (5 Gbps) with proper terminations
Figure 6 – Eye diagram (5 Gbps) with improper terminations
difference by design
Platform Features• 4×5 cm compatible footprint• up to 8 Gbit DDR3 SDRAM• 256 Mbit SPI Flash• Gigabit Ethernet• USB option
All ProgrammableFPGA and SoC modules
4form x factor
Design Services• Module customization• Carrier board customization• Custom project development
rugged for harsh environmentsextended device life cycle
To create an IP-based streaming-video systemwith a small PCB footprint, designers historicallyhave had to choose between the inflexible archi-tecture of an ASSP or a larger, more flexible sys-tem based on an FPGA-microprocessor combo.Placing a soft-core microprocessor inside theFPGA could eliminate the need for a separateprocessor and DRAM, but the performance of thefinal system might not match that of a solutionbuilt around an external ARM® processor thatwould also contain USB, Ethernet and other use-ful peripherals. With the advent of the Xilinx®
Zynq®-7000 All Programmable SoC along withsmall-footprint H.264 cores, it’s now possible toconstruct a system in a very small-form-factorPCB with only one bank of DRAM, but with thepower of dual ARM cores and high-speed periph-erals that are all interconnected with multiplehigh-speed AXI4 buses (see Figure 1).
Although H.264 cores for FPGAs have existedfor quite some time, up until now none have beenfast enough and small enough to convert 1080p30video and still fit into a small, low-cost device.Using the latest micro-footprint H.264 cores fromA2e Technologies in tandem with the Zynq SoC,it’s possible to build a low-latency system thatcan encode and decode multiple streams ofvideo from 720p to 4K at variable frame ratesfrom 15 to 60 frames per second (fps). Placing anA2e Technologies H.264 core in a Zynq SoCdevice allows for a significant reduction in boardspace and component count, while the dual ARMcores in the Zynq SoC remove the need for a sep-arate microprocessor and the memory bank towhich it must be attached.
T O O L S O F X C E L L E N C E
Micro-footprint H.264cores combine withXilinx’s Zynq SoC in asmall, fast, streaming-video system. T
The result is a complete stream-ing-video system, built around Linuxdrivers and a Real Time StreamingProtocol (RTSP) server, that is capa-ble of compressing 1080p30 videofrom two cameras with an end-to-end latency of approximately 10 mil-liseconds. The A2e TechnologiesH.264 core is available in encode-only and encode/decode versions.Additionally, there is a low-latencyversion that reduces encode latencyto less than 5 ms. The 1080p30encode-only version requires 10,000lookup tables (LUTs), or roughly 25percent of a Zynq Z7020’s FPGA fab-ric, while the encode/decode versionrequires 11K LUTs.
SOC-FPGA BASED SYSTEMSProducts based on the Zynq SoC pro-vide significantly more flexibilitythan those built using an application-specific standard product (ASSP).For example, it’s easy to construct asystem supporting four 1080p30inputs by placing four H.264 cores inthe FPGA fabric and hooking eachcamera input to the FPGA. ManyASSPs have only two inputs, forcingdesigners to figure out how to multi-plex several streams into one input.Each A2e Technologies H.264 core iscapable of handling six VGA-resolu-tion cameras or one camera with1080p30 resolution. So, it would bepossible to construct a two-core sys-
tem that could compress video from12 VGA cameras.
ASSPs typically provide for an on-screen display feature only on thedecoded-video side, forcing design-ers either to send the OSD informa-tion as metadata or to use the ARMcore to time video frames and pro-grammatically write the OSD data tothe video buffer. In an FPGA, addingOSD before video compression is assimple as dropping in a block of IP.You can place additional processingblocks such as fisheye-lens correc-tion before the compression enginewith relative ease. FPGAs also allowfor in-field, future upgrades of fea-tures, such as adding H.265 compres-
56 Xcell Journal Third Quarter 2013
T O O L S O F X C E L L E N C E
Figure 1 – Video compression systems based on a single-chip Zynq SoC (top) vs. a two-chip processor-FPGA combination
sion. Figure 2 is a block diagram ofan H.264 compression engine withinput from two 1080p30 cameras,with OSD applied to the uncom-pressed image.
MANAGING LATENCYCertain applications, such as controlof remotely piloted vehicles (RPVs),are based upon feedback fromstreaming images coming from theremote device. In order to control aremote device, the latency betweenthe time when the sensor sends thevideo to the compression engine andthe point when the decoded image isdisplayed (referred to as “glass toglass”) typically needs to be less than100 ms. Some designers target num-
bers in the 50-ms range. Latency isthe sum the following:
• Video-processing time (ISP, fisheye-lens correction, etc.)
• Delay to fill frame buffer
• Compression time
• Software delay associated with sending out a packet
• Network delay
• Software delay associated with receiving a packet
• Video decode time
Many systems encode video usinghardware but end up decoding itusing a standard video player such
as VLC running on a PC. Even thoughthe buffer delay for media players canbe reduced from the typical 500 to1,000 ms to something significantlyless, the latency is still far greaterthan 50 ms. To truly control latencyand keep it to an absolute minimum,hardware decode of the compressedvideo stream is required, along withminimal buffering.
Latency for H.264 encoders is typi-cally specified in frames, as a full framemust typically be buffered before com-pression begins. Assuming you canspeed up your encoder, you coulddecrease the encode latency simply bydoubling the frame rate—that is, oneframe of latency for 30 fps is 33 ms vs.one frame of latency for 60 fps is 16.5
AXI InterconnectConfiguration Audio Data Uncompressed Video
Peripherals(USB, Ethernet,UART, I2C, etc)
Application Processor Unit
(ARM Cortex-A9 CPUs,Cache, DMA, etc)
GP0 HP0 HP1 HP2
Zynq SoC Processing System (PS)
Zynq SoC Programmable Logic (PL)
0 1 2 3
Video On -Screen Display
Video On -Screen Display
Figure 2 – Architecture of a dual-1080p30 H.264 compression engine
ms. Many times this option is notavailable due to performance limita-tions of the camera and encoder.Thus, the solution is an encoderspecifically designed for low latency.In the new A2e Technologies low-latency encoder, only 16 video linesneed to be buffered up before com-pression starts. For a 1080p30 videostream, the latency is less than 500 µs.For a 480p30 video stream the latencyis less than 1 ms. This low-latencyencoder will allow designers to con-struct systems with low and pre-dictable latency.
In order to minimize total latency,you must also minimize delays intro-duced by buffers, network stacks and
RTSP servers/clients on both theencoding and decoding sides, as it ispointless to use a low-latency encoderwith a software path that introduces alarge amount of latency. An RTSP serv-er typically is used to establish astreaming-video connection between aserver (camera) and the client (decod-ing/recording) device. After a connec-tion is established, the RTSP serverwill stream the compressed video tothe client for display or storage.
MINIMIZING LATENCYTypically, the software components inthe server and client are designedonly to keep up with the bandwidthrequired to stream compressed video,
nor minimize latency. With a non-real-time operating system likeLinux, it can be difficult to guaranteelatency. The typical solution is to cre-ate a low-latency, custom protocolfor both the server and client.However, this approach has thedownside of not being compatiblewith industry standards. Anotherapproach is to take a standard likeRTSP and make modifications at thelower levels of the software to mini-mize latency, while at the same timeretaining compliance with standards.
However, you can also take steps tominimize copies between kernel anduser space, and the delays associatedwith them. To reduce latency though
58 Xcell Journal Third Quarter 2013
T O O L S O F X C E L L E N C E
DDR Flash DDR Flash
Figure 3 – Block diagram of complete encode-decode system
A typical streaming-video system uses an RTSP server to establish a streaming-video connection between a camera and the client (decoding/recording) device. The RTSP server will stream
the compressed video to the client for display or storage.
the whole software path, you willneed to decouple the RTSP server andcompressed data forwarding so thatthe Linux driver, not the RTSP server,performs the forwarding.
In the A2e Technologies low-laten-cy RTSP server, we have made twomajor changes to reduce latency.First, we removed the RTSP serverfrom the forwarding path. The RTSPserver continues to maintain statis-tics using the Real Time ControlProtocol (RTCP), and periodically(or asynchronously) updates the ker-nel drive with the changes in the net-work destination (i.e., destination IPor MAC address). Second, the kerneldriver attaches the necessary head-ers (based on the information pro-
vided by the RTSP server) andimmediately forwards the packet bydirectly injecting it into the networkdriver (for example, udp_send),thus eliminating the memory copyto/from user space.
Figure 3 shows a complete H.264IP-based encode/decode systemwith a combined latency of less than50 ms. The system is based upon theZynq SoC, the A2e Technologieslow-latency H.264 encoder-decoderand the A2e Technologies low-laten-cy RTSP server/client. Note that theonly real difference from a hard-ware perspective between theencode and decode systems is thatthe encode side must interface to acamera/sensor and the decode side
must be able to drive a flat-panel dis-play. You can easily design a boardthat has all required hardware fea-tures needed for use in both encodeand decode.
In order to minimize video compres-sion/decompression latency for real-time control applications, designersneed special encoders and optimizedsoftware. Utilizing the Xilinx Zynq SoCand the A2e Technologies low-latencyH.264 encoder and optimized RTSP serv-er/client, you can create a highly config-urable system with extremely low laten-cy in a small PCB footprint.
For more information, visithttp://www.xilinx.com/products/intel-
W ith semiconductor compa-nies releasing ever-more-advanced devices, circuit
designers are constantly challenged tofigure out the best ways to ensure theirdesigns run at maximum performanceso their systems can reap all the benefitsof these new ICs. An often overlookedbut critical step to getting the best per-formance out of the most advancedFPGAs is to create a robust and reliablepower supply in your system.
The Xilinx® 7 series FPGAs provideunprecedented low power, high per-formance and an easy path for migrat-ing designs. This innovative familyemploys a 28-nanometer TSMC HPL(high-performance, low-power) processand offers a 50 percent reduction inpower consumption compared withcompeting FPGAs.
Yet as FPGAs increase in capacityand performance, FPGA I/O resourcesare becoming a major performance bot-tleneck. While transistor counts effec-tively double every year, the size of anFPGA die stays more or less the same.This is problematic for the I/O carryingsignals on and off the chip, as I/O canonly increase linearly, which presents a
slew of layout, signal integrity andpower challenges for systems designerstrying to create higher-bandwidth sys-tems. To overcome the issue, Xilinxemploys highly sophisticated, multigiga-bit transceiver (MGT) I/O modules in itsstate-of-the-art devices. The Virtex®-7HT FPGA, for example, provides a max-imum bandwidth of 2.78 Tbps, using 96sets of 13.1-Gbps transceivers.
As FPGAs get more sophisticated,the power supply requirements havealso become more complex. For today’sdesigns, teams must consider multiplevoltage rails corresponding to each cir-cuit and functional block, a plurality ofvoltage levels, as well as high-currentrequirements. Let’s examine the powerrequirements associated with theVirtex-7 FPGAs in greater depth.
ACCURACY IN OUTPUTVOLTAGE SETTING With the evolution of semiconductormanufacturing process technology,the density and performance of semi-conductors, especially FPGAs, haveimproved greatly. At the same time,semiconductor companies have beenreducing the core voltages of their
devices. Both factors complicatepower supply design and make it hardfor designers to select the right powercircuitry. Board layout and manufac-turing technologies are not sufficientto manage voltage within a few milli-volts, but the power systems must bemeasurably accurate and reliable. As aresult, it is imperative for design teamsto take power supply characteristicsand features as well as the device’scharacteristics into considerationwhen designing their power systems.For example, a core voltage of 1 V hasa tolerance of +/-30 mV. When a cur-rent changes by 10 A, a 50-mV voltagedrop occurs with a 5-milliohm powerline (see Figure 1). As such, it is noteasy to manage a power supply volt-age within the tolerable values in allthe environments in your boarddesigns. In addition, if design teamsare using a point-of-load (POL) powersupply, they need to be extremelyaccurate, within +/-1 percent, and havea high-speed transient response. Forthese and other reasons, design teamsshould thoroughly investigate thedesign and options available to thembefore building their boards.
T O O L S O F X C E L L E N C E
350nm 250nm 180nm 90nm 65nm 28nm
Figure 1 – Evolution of the process and tolerance of the core voltage
One of the first things design teamsneed to be aware of when selecting cir-cuitry for their power system is thatFPGA designs have strict sequencingrequirements for startup. For example,the Virtex-7 FPGA power-on sequencerequires one of two choices:VCCINT_VMGTAVCC_VMGTAVTT orVMGTAVCC_VCCINT_VMGTAVTT.Meanwhile, the power-off sequence isthe reverse of whichever of thesepower-on sequences you chose. If youpower the device on or off in anysequence other than the ones recom-mended, the slack current from VMG-TAVTT could be more than the specifi-cation during power-up and power-down. To satisfy such sequencingrules, designers need to use either aspecialized IC for sequencing or theP-Good and on/off features of POLs.
A power system for a design usingan FPGA also needs to include circuitrythat allows the team to monitor out-put voltages. Typically, designerscheck the output voltage during aproduct’s fine-tuning stage and trimthe output voltage to manage printedpattern drop voltages. Designers typi-cally trim the voltage with POLs, butof course this means that the POLsyou use in your design need to have afeature for output-voltage adjustment.The power system also needs to allowyou to monitor your design’s voltageand current to check for current timedeviations and power malfunctions,and to better predict failures.
High-speed serdes require a highcurrent because they have several
channels and a high operating fre-quency. The low-dropout (LDO) regu-lators that engineers use for this pur-pose generate a lot of heat due topoor efficiency and thus are not anideal power solution for circuits thatconsume high current. High-speedserdes are quite sensitive to the jittercaused by noise in power circuitry.High jitter will corrupt signals andcause data errors. It is also veryimportant for serdes to have verygood signal integrity so as to achievegood high-speed data communica-tions. This of course means that whenchoosing a POL, designers need toensure their POL isn’t noisy.
SELECTING A HIGH-ACCURACYPOL CONVERTERA crucial step in building a proper powersystem for a Virtex-7 FPGA is ensuringthe POL converters meet a given device’srequirements. Fast-transient-responsePOL converters excel in designs wherethe current is high and shifts in currentare large. Not only are these high-accura-cy POL converters with fast transientresponses helpful in managing theFPGA’s shifting core voltage. They arealso useful for managing the VMGTAVCCand VMGTAVTT of an MGT. Figure 2shows a measurement of voltageresponse for the dynamic-load change ofseveral POL converters. To keep thingsfair, we have used the same conditionsfor all the POL converters. The onlyexception is that we used the capacitorthat each POL converter’s manufacturerrecommends in their literature.
• Input voltage: 5 V fixed
• Output voltage: 1 V fixed (Virtex-7 core voltage)
• Load current: 0~10 A
• Current change slew rate (current change rate per unit of time): 5 A/µs
Streaming data and numeric pro-cessing create changes in dynamiccurrent in FPGAs. Our results showthat some POL converters under-shoot and overshoot the output volt-ages, getting beyond 1 V ±30 mV,which is the VCCINT and VMGTAVCCvoltage that Xilinx recommends forVirtex-7 FPGAs. Of course, the num-ber might vary depending on usage,board layout and decoupling capaci-tors surrounding the POLs and theFPGAs. However, it is clear that weneed a very good power supply tokeep up with the recommended oper-ating-voltage range of the Virtex-7FPGAs, where large current changes(10 A and 5 A/µs) occur rapidly. TheBellnix BSV series fits these needsand also has a fast transientresponse, necessary for high-endFPGAs and other high-speed devices,such as discrete DSPs and CPUs.
PROGRAM FOR OUTPUT VOLTAGEAND SEQUENCINGAs described above, as FPGA processtechnologies shrink, their supply volt-age gets lower but the number of func-tions increases, as does their perform-ance. This means the current changes
62 Xcell Journal Third Quarter 2013
T O O L S O F X C E L L E N C E
Figure 2 – Examples of transient response in competing point-of-load converters; results for the Bellnix POL converter are at left.
rapidly, increasing power require-ments. It is getting more difficult fordesign engineers to manage voltagewithin the allowed tolerance thatFPGAs require. This is true not onlyfor the core voltage of the FPGA butalso for the FPGA’s high-speed serdes.There are several important pointsrelated to power supplies you need toconsider to get the maximum per-formance from a Virtex-7 transceiver.For example, you need to adjust thePOL output voltage when a voltagedrop occurs due to power line resist-ance. Margining testing likewiserequires an output-voltage adjust-ment. Designers also need to monitor
and trim the output voltage of thePOL to evaluate systems quickly.
Here is a method to adjust the out-put voltage with a CPU program tosatisfy Virtex-7 power requirements.You can use the circuit diagram inFigure 3 to make a BSV-series POLinto programmable output voltage.
In the figure, Amp 1 detects theoutput voltage. You can control theoutput voltage with high accuracy byremote sensing with a differentialamplifier and cancel out the voltagedrop by connecting the POL converterto the load. Amp 2 compares the out-put voltage detected by Amp 1 withthe D/A converter’s output voltage. We
connected Amp 2’s output to the POLconverter’s trim pin. The output voltageof the POL converter changes with thetrim pin’s input voltage. Thus, the volt-age of the load and that of the D/A con-verter output match.
You can set an output voltage by con-trolling the D/A converter with the CPU.First, you must set the D/A converteroutput to 0 V, and then turn the on/offcontrol to ON and set the D/A converteroutput to the desired value when youturn on the POL converter. If the nonin-verted input voltage is high when theinverted input of Amp 2 is at 0 V (whenthe output is OFF), Amp 2 will try toraise the output voltage at the trim pinand cause an overvoltage when you turnon the POL converter. Also, you mustconnect an RC circuit to the D/A con-verter output to prevent Amp 2’s nonin-verted input pin from rising high rapidlyand creating an overshoot.
At Bellnix, we developed an evalua-tion power module called the BPE-37 forXilinx’s 7 series GTX/GTH FPGAs. BPE-37 uses the circuit described earlier andhas four POL converters required for theMGT. These four POL converters are allBSV-series POLs featuring high accura-cy, fast transient response and low ripplenoise. The BPE-37 uses a microcon-troller to control the D/A converter andto set each output voltage to an adequatelevel. The microcontroller also startsand stops the POL converters using theiron/off pins to control sequencing. Thistechnique allows design teams to modifyoutput voltage, change sequencing andmonitor the output voltage and currentvia PMBus serial communication.
The BPE-37’s circuit configurationmakes it possible to add digital function-ality to the conventional analog-typePOLs to satisfy all the power require-ments for Xilinx’s Virtex-7 series. TheGTX/GTH transceivers in these FPGAsrequire POLs with the fastest transientresponse. Table 1 shows the BPE-37’skey specifications.
For more details on Bellnix’s offering,please refer to http://www.bellnix.com/
T O O L S O F X C E L L E N C E
Third Quarter 2013 Xcell Journal 63
+ +– +
Figure 3 – How to make a BSV-series POL into programmable output voltage
IP Integrator is now in full productionrelease. The tool is available withevery seat of Vivado Design Suite.Enhancements include:
• Automated IP upgrade flow for migrat-ing a design to the latest IP version
• Cross-probing now enabled on IPIntegrator-generated errors andwarnings
• Integration with IP issued bySystem Generator
• Synthesis run-times up to 4x fasterfor IP Integrator-generated designs
• Error-correcting code (ECC) nowsupported in MicroBlaze™ withinIP Integrator
Vivado IP Catalog
The Vivado IP flow enables bottom-up synthesis in project-baseddesigns to reduce synthesis run-times (IP that has not changed willnot be resynthesized). Significantenhancements to the “Manage IP”flow include:
• Simplified IP creation and man-agement. An IP project is automat-ically created to manage thenetlisting of IP.
• Easy generation of a synthesizeddesign checkpoint (DCP) for IP, toenable IP usage in black-box flowsusing third-party synthesis tools.
• DCP for IP and the Verilog stubfiles are now co-located with theIP’s XCI file.
• Scripts are delivered for compilingIP simulation sources for behav-ioral simulation using third-partysimulators.
Vivado also offers improved han-dling of IP constraints. IP DCP con-tains the netlist and constraints forthe IP, and constraints are scopedautomatically.
64 Xcell Journal Third Quarter 2013
What’s New in theVivado 2013.2 Release? Xilinx is continually improving its products, IP and design tools as it strives to help designers
work more effectively. Here, we report on the most current updates to Xilinx design tools including
the Vivado® Design Suite, a revolutionary system- and IP-centric design environment built from
the ground up to accelerate the design of Xilinx® All Programmable devices. For more information
about the Vivado Design Suite, please visit www.xilinx.com/vivado.
Product updates offer significant enhancements and new features to the Xilinx design tools.
Keeping your installation up to date is an easy way to ensure the best results for your design.
The Vivado Design Suite 2013.2 is available from the Xilinx Download Center at
VIVADO DESIGN SUITE 2013.2 RELEASE HIGHLIGHTS
• Full support for Zynq®-7000 All Programmable SoCs
• Vivado IP Integrator now integrated with High-Level Synthesis and SystemGenerator in one comprehensive development environment
• Complete multiprocessor application support, including hardware/softwareco-debug, for detailed embedded-system analysis
• All 7 series FPGA and Zynq-7000 All Programmable SoC devices supported with PCI-SIG®-compliant PCI Express® cores
• Tandem two-stage configuration for PCI Express 100-ms requirements
The following devices are production ready:
• Zynq-7000 All Programmable SoC: 7Z010, 7Z020 and 7Z100
• Defense-Grade Zynq All Programmable SoC: 7000Q, 7Z010, 7Z020 and 7Z030
• Defense-Grade Virtex®-7Q FPGA: VX690T and VX980T
• Enhanced caching to speed upsubsequent model compilation.
UPDATES TO EXISTING IP
New IEEE 1588 hardware timestamping for one-step and two-step
AXI Ethernet Lite
Virtex-7 FPGA in production
Trimode Ethernet MAC
Virtex-7 FPGA, Artix-7 FPGA, Zynq-7000 All Programmable SoC ver-sions in production
Zynq-7000 All Programmable SoCversion in production
10G Ethernet MAC, RXAUI, XAUI
Virtex-7 FPGA, Zynq-7000 All Pro-grammable SoC versions in production
PCI32 and PCI64
Artix-7 FPGA in production
LEARN MORE ABOUT THEVIVADO DESIGN SUITE
Vivado QuickTake Video Tutorials
Vivado Design Suite QuickTake videotutorials are how-to videos that takea look inside the features of theVivado Design Suite. Topics includehigh-level synthesis, simulation andIP Integrator, among others. Newtopics are updated regularly. Seewww.xilinx.com/training/vivado.
For instructor-led training on theVivado Design Suite, please visitwww.xilinx.com/training.
Debugging support has been added forthe Zynq-7Z100 All Programmable SoC;7 series XQ package and speed gradechanges; and Virtex-7 HT general ESdevices (7VHT580T and 7VHT870T).
• Target Communication Framework(TCF) agent (hw_server)
• Vivado serial I/O analyzer supportfor IBERT 7 series GTZ 3.0
• Added support for debugging multi-ple FPGAs in the same JTAG chain
Tandem Configuration for Xilinx PCIe® IP
Designers can use tandem configura-tion to meet the fast enumerationrequirements for PCI Express®. Atwo-stage bitstream is delivered to theFPGA. The first stage configures theXilinx PCIe IP and all design elementsto allow this IP to independently func-tion as quickly as possible. Stage twocompletes the device configurationwhile the PCIe link remains function-al. Two variations are available:Tandem PROM loads both stages fromthe same flash device, and TandemPCIe loads stage two over the PCIelink. All PCIe configurations are sup-ported up to X8Gen3.
• Tandem configuration is nowreleased with production status, forboth Tandem PROM and TandemPCIe. Devices with this status includeXC7K325T-FFG900, XC7VX485T-FFG1761 (PCIE X1Y0 locationrequired) and XC7VX690T-FFG1761(PCIE X0Y1 location required).
• Tandem configuration is available asbeta for additional devices, for bothTandem PROM and Tandem PCIe.Hardware testing for these deviceshas been limited. They include theXC7K160T-FFG676, XC7K410T-FFG676, XC7VX415T-FFG1158(PCIE X0Y0 location recommended)and XC7VX550T-FFG1158 (PCIEX0Y1 location recommended).
For more information, see PG054 (ver-sion 2.1) for Gen2 PCIe IP, or PG023(version 2.1) for Gen3 PCIe IP.
VIVADO DESIGN SUITE:SYSTEM EDITION UPDATES
Vivado High-Level Synthesis
• Video libraries with OpenCV inter-faces are enhanced with support for12 additional functions.
• Adders can now be targeted forimplementation in a DSP48 directlyfrom Vivado HLS.
• The Xilinx FFT core can now beinstantiated directly into the C/C++source code. This feature is beta.Contact your local Xilinx FAE fordetails on how to use this feature.
Vivado IP Integrator support inSystem Generator for DSP
• Model packaged as IP with supportfor gateway inputs and outputs to bepackaged as interfaces or raw ports.
• Capability to specify AXI4-Streaminterface in gateway blocks and toinfer.
• AXI4-Lite and AXI4-Stream inter-faces from named gateway blocks.
• System Generator generates exampleRTL Vivado project and example IPIntegrator block diagram to enableeasy evaluation of packaged IP.
• New System Generator IP Integratorintegration tutorial available withSystem Generator tutorials.
Introducing support for MATLAB®
and Simulink® release R2013a
Blockset enhancements include:
• Support for FIR Compiler v7.1 withintroduction of area, speed and cus-tom optimization options.
• Model upgrade flow now includesport and parameter checks to theHTML report.
I f you’ve looked at clouds from both sides now, take another gander.Here’s your opportunity to Xercise your funny bone by submitting anengineering- or technology-related caption for this cartoon showing an
engineer with his head—and all the rest of him, actually—in the clouds. Theimage might inspire a caption like “Dave made major strides in cloud com-puting after studying transcendental meditation.”
Send your entries to [email protected]. Include your name, job title, companyaffiliation and location, and indicate that you have read the contest rules atwww.xilinx.com/xcellcontest. After due deliberation, we will print the sub-missions we like the best in the next issue of Xcell Journal. The winner willreceive an Avnet ZedBoard, featuring the Xilinx® Zynq®-7000 AllProgrammable SoC (http://www.zedboard.org/). Two runners-up will gainnotoriety, fame and a cool, Xilinx-branded gift from our swag closet.
The contest begins at 12:01 a.m. Pacific Time on July 15, 2013. All entries mustbe received by the sponsor by 5 p.m. PT on Oct. 1, 2013.
So, look for that silver lining and get writing!
ROBIN FINDLEY, embedded-systemsconsultant/contractor with Findley
Consulting, LLC in Clarkesville, Ga.,won a shiny new Avnet ZedBoard
with this caption for the asteroid car-toon in Issue 83 of Xcell Journal:
Congratulations as well to
our two runners-up:
“Remember when you asked, ‘Whatcould possibly go wrong?’”
– Mark Rives, systems engineer,
Micron Technology, Berthoud, Colo.
“Larry, you know how I said our timing looked great???”
– Dave Nadler, owner,
Nadler & Associates, Acton, Mass.
“Paul watches as Engineering lobsanother design over the wall.”
NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia, or Canada (excluding Quebec) to enter. Entries must be entirely original. Contest begins on July 15, 2013. Entries must be received by 5:00 pm Pacific Time (PT) on Oct. 1, 2013. Official rules are available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.