Xcell Journal issue 77

www.xilinx.com/xcell/

S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

Xcell journalXcell journalI S SUE 77 , FOUR TH QUAR TER 2011

Robotic-Assisted SurgicalSystem Uses Xilinx FPGAs

Zynq-7000 EPP Makes FlexibleSoftware-Defined Radio Platform

ISE Design Suite 13.3 Now Available for Download

Robotic-Assisted SurgicalSystem Uses Xilinx FPGAs


ISE Design Suite 13.3 Now Available for Download

Xilinx Ships WorldsHighest-CapacityFPGA With SSITechnology

Xilinx Ships WorldsHighest-CapacityFPGA With SSITechnology

How to Mount a Full-Frontal Assaulton Power

page52

Avnet, Inc. 2011. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Compact, Easy-to-Use KitDemonstrates the Versatilityof Spartan-6 FPGAsThe low-cost Spartan-6 FPGA LX9 MicroBoard

is the perfect solution for designers interested

in exploring the MicroBlaze soft processor or

Spartan-6 FPGAs in general. The kit comes with

several pre-built MicroBlaze systems allowing

users to start software development just like

any standard off-the-shelf microprocessor. The

included Software Development Kit (SDK) provides

a familiar Eclipse-based environment for writing and

debugging code. Experienced FPGA users will find

the MicroBoard a valuable tool for general-purpose

prototyping and testing. The included peripherals

and expansion interfaces make the kit ideal for a

wide variety of applications.

Xilinx Spartan-6 FPGALX9 MicroBoard Features

tAvnet Spartan-6 FPGA LX9 MicroBoard

tISE WebPACK software with device locked SDK and ChipScope licenses

tMicro-USB and USB extension cables

tPrice: $89

To purchase this kit, visit www.em.avnet.com/s6microboardor call 800.332.8638.

Add it up: 10 GbE, 40 GbE, and 100GbE ports offer more capacity and versatility than any HighSpeed Serial I/O board ever. The Xilinx,Vir tex 6, HXT FPGA provides user program-ability and datarates to 11 GB/s. ASIC designers, Network engineers, and IP developers, can use this tool toprototype, emulate, model and test your most challenging projects. The board maybe used stand-alone, PCIe hosted, or plugged into any Dini Group ASIC prototyping board. Users will appreciatethe scope of interfaces:

CFP Module100 GbE or Single/Dual 40 GbE

2-QSFP+ Modules40 GbE

4-SFP+ Modules40 GbE

8-SFP+ Modules8 GbE

All FPGA resources are user available and supported by a 240-pin DDR3UDIMM bulk memory. Daughter cards are accommodated for expansion,customization, and FMC (Vita-57). Put your high speed serial designs to work.Here is a Dini board just for you.

www.dinigroup.com 7469 Draper Avenue La Jolla, CA 92037 (858) 454-3419 e-mail: [email protected]

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

2011 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

PUBLISHER Mike [email protected]

EDITOR Jacqueline Damian

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan [email protected]

INTERNATIONAL Melissa Zhang, Asia [email protected]

Christelle Moraga, Europe/Middle East/[email protected]

Miyuki Takegoshi, [email protected]

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

2 Million Logic Cell FPGA Available Today!

W hen I first started covering IC design methodology and the semiconductor busi-ness back in 1995, the biggest FPGA available, Xilinxs XC5215, boasted athen-mammoth 1,936 logic cells, roughly the equivalent of 10k ASIC gates.Though the XC5215 was the monster FPGA for its day and a very cool device, its use in anend product was circumscribed. Back then, an FPGA would likely serve as either gluelogic to help two disparate devices communicate or would hold functionality that the sys-tem required in the 11th hour.

If a company wanted to build custom functionality into its end products, they built orcommissioned an ASIC. Up to that point, ASICs were relatively easy to design and cost-effective to build. But that changed very quickly. Within a year, the ASIC industry had toconfront how it was going to fill the insanely generous 100,000 gates now afforded to themwith the introduction of cutting-edge 0.5-micron processes.

Alas, after much debate, the logic-chip industry collectively opted to fill all those gatesby means of IP blocks and design reuse. Over the next 15 years, as silicon process geome-tries shrank, the challenges mounted and the industry had to deal with a succession ofother issues such as timing closure, design-for-manufacturability and gate leakage. Theingenuity of the folks in this industry never ceases to amaze me, and, over time, good oldengineering came up with ways to adequately address all these issues.

But while engineering has made it possible for ASIC designs to still be built, the eco-nomics of IC manufacturing havent been so kind. Indeed, multimillion-dollar mask costsadded to the tool costs and complexity of ASICs have made ASIC design feasible for fewerand fewer products. Today a company has to sell millions of ASICs to simply recoup thecosts of building them in the latest process technologies.

Thankfully, good old engineering isnt concentrated only on the ASIC business. The folksat Xilinx have been busy making FPGAs not just a viable alternative to ASICs, but a better one.This issues cover story details a huge engineering accomplishment: Xilinx is now shipping tocustomers its 2 million logic cell or 20 million ASIC-gate-equivalent Virtex-7 2000T FPGA.This new device is not only the highest-capacity FPGA in the industry, it is a much better alter-nativein terms of both engineering and economicsto ASICs for a growing number ofapplications. This chip is also the first commercially released device that uses Xilinxsstacked-silicon interconnect technology, which I described in detail in the cover story in XcellJournal Issue 74.

As a refresher, SSI technology connects several silicon slices (essentially, dice) side-by-sideon top of a silicon interposer. The dice are then interconnected by traces running through theinterposer, much in the way that disparate ICs communicate by means of traces in a printed-cir-cuit board. But with guidance from Xilinx tools, it doesnt look to the designer like multiplechipsjust one humongous one. In pioneering this device, Xilinxs engineering staff has takenthe semiconductor industry a monumental first step into the era of 3D IC technology.

Im certain that those folks at Xilinx who have been working on SSI for the last five yearswill tell you that while it has been a monumental challenge, it has also been extremelyrewarding as this device makes its way into customer end products. Their innovation inturn allows you, the designer, to innovate new products, penetrate new markets andachieve higher levels of prosperity.

Mike SantariniPublisher

C O N T E N T S

VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES 2222

3030

Cover StoryXilinx Ships Worlds Highest-CapacityFPGA Using Stacked-SiliconInterconnect Technology

88

1414

Xcellence in Aerospace & Defense

Zynq-7000 EPP Makes Flexible Software-Defined Radio Platform14

Xcellence in Networking

FPGAs Synchronize Next-Generation Networks 22

Xcellence in New Applications

Accelerating Texture Mapping with Spartan-6 FPGAs 30

Letter From the Publisher2 Million Logic Cell

FPGA Available Today 4

F O U R T H Q U A R T E R 2 0 1 1 , I S S U E 7 7

Profiles in Xcellence

Xilinx FPGAs Guide Robotic-Assisted Surgical System 38

Xplanation: FPGA 101

How to Build a Better DC/DC Regulator Using FPGAs 42

Xplanation: FPGA101

Using the Clock Period Constraint to Your Advantage 46

Ask FAE-X

Optimizing FPGAs for Power: A Full-Frontal Attack 50

Xperts Corner

Dual-MicroBlaze Xilkernel System Eyes Automotive Apps 58

THE XILINX XPERIENCE FEATURES

50502011

3838

Excellence in Magazine & Journal Writing2010, 2011

Excellence in Magazine & Journal Design and Layout2010, 2011

XTRA READING

Xtra, Xtra The latest Xilinx tool updates and patches,

as of October 2011 64

Xclamations! Share your wit and wisdom by supplying a caption for our

techy cartoon, for a chance to win an

Avnet Spartan-6 LX9 MicroBoard 66

8 Xcell Journal Fourth Quarter 2011

COVER STORY

Xilinx Ships WorldsHighest-Capacity FPGAUsing SSI Technology

Xilinx Ships WorldsHighest-Capacity FPGAUsing SSI Technology

by Mike SantariniPublisher, Xcell JournalXilinx, [email protected]

Fourth Quarter 2011 Xcell Journal 9

I ts here. Xilinx is now shipping to cus-tomers the highest-capacity FPGA in theworld: the Virtex-7 2000T. The 6.8 bil-lion-transistor device has 1,954,560 logiccellsdouble the capacity of the largest com-peting 28-nanometer FPGAs. This is the thirdFPGA released by Xilinx using TSMCs 28-nmHPL process, but more important, it is thefirst commercial offering that makes use of astacked-silicon interconnect (SSI)Xilinxsapproach to 3D IC technology (see coverstory, Xcell Journal Issue 74).

The Virtex-7 2000T FPGA marks a majormilestone in Xilinxs history of innovationand industry collaboration, said VictorPeng, senior vice president of programmableplatforms development at Xilinx. Stacked-silicon interconnect technology offers tran-sistor capacities that otherwise wouldnt bepossible in an FPGA for at least anotherprocess generation, and it puts the largest ofour 28-nm devices into customer hands atleast a year earlier than is typical in the roll-out of a new generation. This is especiallyimportant for those doing ASIC and ASSPemulation and prototyping.

Equipped with 6.8 billion transistors and 1,954,560logic cells, and built in a near-3D IC construction,

the Virtex-7 2000T is available to customers today.

C O V E R S T O R Y

CPU1 CPU2Core

I/O User Memory DDR2 SODIMMDDR3 SODIMM

IIC/PMBUS2x Ethernet

2x UARTGPIO

Trace/Debug

ZBT SRAM

XC5VLX330

XC5VLX330T

XC5VLX330

XC6VLX240T

Single-Ended Signals

Differential Signals

XC5VLX330

XC6VLX240T

Soft JTAG

FMC1(HPC)

FMC2(LPC)

FMC1(HPC)

x8PCIe

Cable

Use

r FPG

ACo

nfigu

ratio

n

Traditionally, FPGA vendors haveimplemented their new architectureson the latest manufacturing processtechnologies to take advantage ofMoores Law, which holds that transis-tor counts double every 22 monthswith the introduction of each new sili-con process. Following Moores Lawover the last two decades has allowedFPGA vendors to consistently offernew FPGAs that double the capacityof their previous-generation devices.

However, for the Virtex-7 2000T anda few other members of the Virtex-7family, Xilinx moved beyond MooresLaw to create SSI. This technologyconnects several silicon slices (activedice) side-by-side on top of a passivesilicon interposer. The dice are theninterconnected by metal interconnectrunning through the interposer, muchin the same way that disparate ICscommunicate by means of metal inter-connect in a printed-circuit board (seeFigure 1). In this way, Xilinx was ableto make devices that exceed the paceof Moores Law: Virtex-7 2000T FPGAsare twice the size of the nearest com-

petitors largest 28-nm device and 2.5xlarger than Xilinxs largest Virtex-6FPGA. But the beauty of the architec-ture, said Panch Chandrasekaran,

product line manager for Virtex-7FPGAs at Xilinx, is that despite beingcomposed of four dice, the 2000T pre-serves the traditional FPGA use model.


C O V E R S T O R Y

Figure 2 Capacity is so ample that the Virtex-7 2000T can accommodate complex designs of up to 20 million ASIC gates while saving drastically on NRE costs. Here is an example of how one such design lays out in a Virtex-7 2000T.

Figure 1 Xilinxs Virtex-7 2000T takes a giant step into the era of 3D ICs. The graphicshows, from top to bottom, the package cover, slices, silicon interposer and substrate.

Core

CPU1

CPU2

I/O

Memory

User

Designers will program it as oneextremely large FPGA using the Xilinxtool flow and methodology.

In addition to having 1,954,560 logiccells, the Virtex-7 2000T includes config-urable logic blocks totaling 305,400 CLBslices and max distributed RAM of21,550 kbits. It has 2,160 DSP slices,46,512 Kb of block RAM, 24 clock man-agement tiles, four PCIe blocks and 36GTX transceivers (each capable of 12.5Gbits/second). It also has 24 I/O banksand a total of 1,200 user I/Os.

While the release of the Virtex-7 2000Tmarks a big achievement for Xilinx and abold step into the era of the 3D IC for thesemiconductor industry, Chandrasekaransaid the real value is the doors it opensfor user innovation and the new designcapabilities it brings to customers look-ing for the highest-capacity devices.Those who want to speed productdevelopment and supply silicon emula-tions to their software developers, thosewho want to condense multiple chipsinto a single device and those who cantjustify doing ASICs for their designsthey will all greatly benefit from this fan-tastic technology, he said. By imple-menting the device in SSI technology,Xilinx is putting next-generation capacityin designers hands today.

ASIC AND IP EMULATION AND PROTOTYPINGToday, an average high-end ASIC orASSP design has 420 million gates, saidGary Smith, design-tool analyst withGary Smith EDA and an ASIC method-ology expert. The largest Ive heard ofwas 1.1 billion, he said. Because ofthese high gate counts, more than 90percent of ASIC design teams usesome form of hardware-assisted verifi-cation systems, whether they be com-mercial emulation systems or do-it-yourself ASIC-prototyping boards.

Traditionally, companies creatingcommercial emulation systems ordesign groups prototyping theirdesigns have been first-in-line cus-tomers for the biggest FPGAs a ven-dor can produce. Suppliers of com-mercial emulation systems are look-ing for FPGAs with the highest possi-ble capacity. This market will espe-cially benefit from having the Virtex-72000Ts more-than-Moores Lawcapacity at their disposal, saidChandrasekaran. It allows them tooffer next-generation-capacity emula-tion systems today, and ultimatelyallows their customers to cut designcycles and bring new, more innovativeproducts to market faster.

Most of these commercial emulationsystems include two or more boardsand up to several racks of FPGAs,depending on the size of the ASIC, IP oreven system a customer wants to emu-late. Meanwhile, customers of emula-tion systems use them to speed up ver-ification and ensure their designs func-tion properly, and to get the hardwareversions of their designs to softwaregroups so they can get a leg up on soft-ware development and have it mostlycompleted by the time the foundrydelivers the real, silicon ASIC. The idea,of course, is to speed time-to-market.

In a typical use model of a com-mercial emulation system, users firstdesign and functionally verify theirASIC or IP using traditional EDA ver-ification software. After theyve donethat sufficiently, they then implementthe register transfer-level (RTL) ver-sion of their design in a commercialemulator for further verification ofthe design. Each emulator vendor typ-ically offers its own software thatworks in conjunction with Xilinxsdesign software to synthesize the RTLand partition the ASIC design intoblocks that can be optimally distrib-uted among the FPGAs in the emula-tor. The emulation vendors software


C O V E R S T O R Y

then provides an interface to a work-station or PC running various EDAverification tools to test the design asit runs on the emulator.

Emulation vendors also offer lower-cost variantssometimes calledreplicates or, generically, prototyp-ing systemsof their emulators.These variants simply emulate theASICs functionality. Companies givethese systems to their software groupsto get an early jump on developingdrivers, firmware and applications thatwill run on the design.

Chandrasekaran said that largerFPGAs allow emulation vendors toeither field higher-capacity emulationsystems, or to build mid- and lower-capacity systems with fewer FPGAs,cutting power and the bill of materialswhile raising overall clock speeds ofthe designs running on them. TheVirtex-7 2000T is so large that thesevendors will even be able to offer emu-lators built on a single FPGA chip,said Chandrasekaran. Because thedesigns are running on fewer chips oreven just one chip, the overall per-formance of the system will be faster.

The Virtex-7 2000T will also be idealfor design groups that perhaps cantafford off-the-shelf commercial emula-tion systems, which can cost upwardsof a million dollars. Many designgroups build their own custom boardsto prototype and/or emulate an ASIC,or an entire systems functionality, andget a jump on software development,said Chandrasekaran. And even thosewho use emulation systems to develop

their IC might create their own FPGAvariants for their software groups.

Chandrasekaran said the device willalso be attractive to IP vendors. Notonly can they use the FPGAs to developnew blocks of IP, they can also usethem to demonstrate the functionalityof their cores to potential customers.

SYSTEM ARCHITECTURECONSOLIDATION AND POWER SAVINGSIn addition to being attractive for ASICand IP emulation and prototyping, thenew Virtex-7 2000T will also be attrac-tive to system architects looking forways to lower the power consumptionof their systems, while increasing per-formance and capabilities.

There are many end products onthe market that use multiple FPGAs,said Chandrasekaran. With theVirtex-7 2000T, they can integrate thefunctionality of several FPGAs into asingle FPGA. This system integrationimproves performance, because allthese functions are on a single chip.System integration lowers power byeliminating the I/O interfaces betweendifferent ICs on the board. I/O inter-faces increase power consumptionproportionately to the number of I/Os.Thus, the higher the performance ofthe design and the greater the numberof ICs in a system, the greater thepower consumption.

Moreover, partitioning system func-tionality among multiple ICs is a com-plex job and can lead to longer develop-ment cycles and higher test costs.

Consolidating the number of devices inthe system reduces these partitioningchallenges as well as the costs associat-ed with verification and test. At morethan twice the capacity of competingFPGAs, the Virtex-7 2000T lets cus-tomers integrate more to reduce powerfourfold compared with multiple-chipsolutions. They can also increase sys-tem performance by eliminating I/O bot-tlenecks, and reduce system complexityby eliminating unnecessary designpartitioning, said Chandrasekaran.Architects can save a lot of boardspace to add other functions, or cansimply shrink the size of the product.

As with other devices in the 7 series,Xilinx implemented the Virtex-7 2000Tin TSMCs FPGA-specific 28-nm HighPerformance Low Power (HPL) processtechnology (detailed in the cover storyof Xcell Journal Issue 76.) Thanks tothe HPL process, the Virtex-7 2000Tstransistors experience less leakage thanthose of competing devices implement-ed in 28-nm high-performance process-es, Chandrasekaran said. This meansthe device has power consumptioncomparable to that of competingdevices that are half its capacity.

ASIC DISPLACEMENTLast but not least, the Virtex-7 2000Twill also be attractive to the growingnumber of design groups that simplycant justify the cost and riskinvolved in developing an ASIC orASSP at the 28-nm process node. Assilicon processes have advanced, thecosts of designing and manufacturing

C O V E R S T O R Y

12 Xcell Journal Fourh Quarter 2011

As silicon processes have advanced, the cost of designing andmanufacturing them has rapidly risen. The Virtex-7 2000T will appeal to the growing number of design groups that

simply cant justify the cost and risk involved in developing an ASIC or ASSP at the 28-nm process node.

them have rapidly risen. At 28 nm,ASIC or ASSP nonrecurring-engi-neering (NRE) expenses are betterthan $50 million, and the probabilityof an ASIC respin increases to near-ly 50 percent. A single overlookedmistake during the design cycle canseverely damage profitability; multi-ple mistakes can lead to project can-cellations, missed market windowsand even destroy companies.

The Virtex-7 2000T can replace 10million- to 20 million-gate ASICs with-out the NRE costs associated withASICs. Instead of worrying aboutevery little mistake leading to a cata-strophic mask spin, designers caninstead focus on designing, said

Chandrasekaran. Whats more, theVirtex-7 2000T is reprogrammable, so ifthey do make a mistake they can simplyreprogram the device.

METHODOLOGY STAYS THE SAMEWhile the Virtex-7 2000T is extremelylarge, programming the device wontrequire an extreme methodology change.

Over the last few years Xilinx hasbeen optimizing its design tools withultralarge-capacity designs in mind,said Chandrasekaran. Today, cus-tomers can efficiently partition, floor-plan and optimize for power and per-formance. Chandrasekaran explainsthat most if not all large FPGAs typical-ly require designers to perform some

amount of partitioning, and to placetiming-critical functions in close prox-imity to each other. For design groupsplacing large designs in the Virtex-72000T, Xilinx tools will assist them infloorplanning and partitioning theirdesigns to help them achieve optimaltiming and performance.

The latest release of Xilinx designtools supports the Virtex-7 2000T. Userscan get started today designing withVirtex-7 2000T, said Chandrasekaran. Inthe coming year, Xilinx is scheduled torelease other Virtex-7 FPGAs in mono-lithic as well as SSI configurations.

For more information and to seethe Virtex-7 2000T in action, visitwww.xilinx.com/7.

Fouth Quarter 2011 Xcell Journal 13

C O V E R S T O R Y

Versatile FPGA Platform

www.techway.eu

The Versatile FPGA Platform provides a cost-effective

way of undertaking intensive calculations

and high speedcommunications in an

industrial environment.

PCI Express 4x Short CardXilinx Virtex FamiliesI/0 enabled through an FMC site (VITA 57)Development kit and drivers optimized for Windows and Linux

Optical-Mez


XCELLENCE IN AEROSPACE & DEFENSE



Todays tactical and commercialsoftware-defined radios musthave the flexibility and process-ing power to support a growing numberof wideband and broadband wave-forms including an extensive library oflegacy waveforms. The secure Xilinx

Zynq-7000 Extensible ProcessingPlatform (EPP) provides an ideal solu-tion for these applications, not onlybecause it features a high-performanceprocessing system that leveragesARM technology, but also because itprovides a large programmable logicunit that supports partial reconfigura-tion and the Xilinx Isolation DesignFlowall within a single device.

Lets take a closer look at a software-defined radio (SDR) project built on theZynq-7000 EPP, focusing special atten-tion on how to utilize the partial recon-figuration and Isolation Design Flowcapabilities of the programmable logicunit to support various waveforms,reduce part count and save power. TheZynq-7000 EPP can cut the parts countfor our example design from fivedevices to one, while providing plentyof flexibility to support future wave-forms using the same hardware plat-form. Add in its power-saving featuresand its clear that putting this device atthe center of an SDR can significantlyimprove the size, weight, power andcost (SWAP-C) of the system and pro-vide a flexible, programmable platform.

Figure 1 shows an example blockdiagram of a tactical SDR. As you cansee, the plain-text portion of the radio

(sometimes referred to as the redside, since the information could beclassified) contains a general-purposeprocessor (GPP) and red FPGA. Theplain-text information is then encrypt-ed and transformed into cipher text.The black side of the radiowhichcontains a black FPGA, anotherGPP and a modem FPGA for wave-form processingthen processes thecipher-text information. To ensure thesecurity of the information, the designisolates and separates the plain-textand cipher-text portions of the radioto prevent an information leak of clas-sified or sensitive data out of the sys-tem in plain text.

Therefore, a typical SDR imple-mentation may use three differentFPGAs as well as two separate GPPs,making the device count for thesefunctions as high as five. Designersmust size the modem FPGA appropri-ately to process all of the variouswaveforms the radio supports. Themodem is often required to have allthe functions available simultaneous-ly even if only one is needed at atime. For example, processing theSoldier Radio Waveform needsroughly 120K logic cells, 8 Mbits ofRAM and 800 DSP slices, while theMobile User Objective System wave-form requires more than 200K logiccells, 10 Mbits of RAM and 900 DSPslices, requiring the modem FPGA tobe quite large. Also, the red FPGA inthe crypto block must also be largeenough to contain all of the various

by Anita SchreiberSenior Staff Applications EngineerDefense Applications and Systems ArchitectureXilinx, Inc.Albuquerque, N.M. [email protected]

X C E L L E N C E I N A E R O S PA C E & D E F E N S E

This Xilinx FPGA equipped with ARMprocessors integrates all the logicneeded to handle multiple waveforms.

cryptographic algorithms for theassociated waveforms. The requirednumber of devices, the amount ofI/O signaling between them (whichincreases power dissipation) and thehigh logic density of the devices(which increases static current)make this a nonoptimal solution interms of SWAP-C.

With the Xilinx Zynq-7000 Exten-sible Processing Platform, the modemFPGA, black FPGA, red FPGA, redGPP and black GPP can all be com-bined into one device, as Figure 2shows . The Zynq-7000 EPP enables thecombination and integration of thesedevices and provides a flexible, secureSDR solution that reduces SWAP-C.

The Xilinx Zynq-7000 EPP com-bines an industry-standard ARMprocessor-based system with Xilinx28-nanometer low-power, high-per-formance programmable logic in asingle device. The processing system(PS) provides dual ARM Cortex-A9processors, Level 2 cache and on-chip memory, as well as a richperipheral set sufficient for general-purpose and waveform processing.The programmable logic (PL) pro-vides ample logic cells, configurablememory (dual-port RAM, FIFOs,shift registers, BRAM) and hardwaremultipliers for DSP which can be uti-lized for high-speed parallel process-ing of needed functions.

PROCESSING SYSTEMThe dual ARM A9 cores each have 32kbytes of instruction and data L1cache and 512 kbytes of shared L2cache. Additionally, 256 kbytes of on-chip memory are available to providelow-latency memory to both proces-sors. A snoop control unit maintainscache coherency. Logic within the PLportion of the device and the ARM A9processors can share memory, allow-ing fine-grained interaction betweenthe processors and user logic. AnAccelerator Coherency Port (ACP)allows coherent sharing of informa-tion between the processor and the PLby enabling logic within the program-mable logic access to both the L2



Rx

Tx

Analog RF Digital RF

ADC

DAC

ClocksDUC

/DDC

ModemFPGA

SharedMemory

WaveformMemory

GPP

Local Memory

BlackFPGA

Modem DSP

Local DSP

Voice DSP

Local Memory

Common Bus Common Bus

Flash &SRAM

Crypto

CryptoAccelerator

Red FPGA GPP

LocalMemory

SharedMemory

WaveformMemory

VoiceCodec

PTCT

Rx

Tx

Analog RF Digital RF

ADC

DAC

ClocksDUC

/DDC

ModemFPGA

SharedMemory

WaveformMemory

GPP

Local Memory

BlackFPGA

Modem DSP

Local DSP

Voice DSP

Local Memory

Common Bus Common Bus

Flash &SRAM

Crypto

CryptoAccelerator

Red FPGA GPP

LocalMemory

SharedMemory

WaveformMemory

VoiceCodec

PTCT

DUC/

DDC

ModemFPGA

GPP GPPBlackFPGA

CryptoAccelerator

RedFPGA

Zynq-7000EPP

Figure 1 Block diagram of an example software-defined radio showing the red (for classified information) and black FPGAs

Figure 2 The Zynq-7000 EPP integrates multiple FPGAs into one device for a simple tactical SDR.

cache and the on-chip memory. Directaccess to the on-chip memory fromthe PL is also available. In addition,many industry-standard peripheralsand memory interfaces are availableto support general and waveform pro-cessing, as Figure 3 shows.

SECURE PROCESSINGWith the secure processing featuresof the Zynq-7000 EPP and the currentpush on the part of the governmentto build secure products by compos-ing solutions from commercialparts, the Zynq-7000 EPP is a goodbet to replace both the red-side andthe black-side GPPs in some applica-tions. The Zynq-7000 EPP providesboth a secure boot process and a

secure run-time environment. Unlike many ASSPs, the Zynq-7000

provides a Master Secure Boot Modefor configuration of secure, encrypteddesigns using the internal hardwareAES decryption engine and SHA-256-based authentication engine (HMAC)components within the programmablelogic. The secure boot sequence isshown in Figure 3.

The master ARM A9 processor willboot first from the on-chip ROM. Itthen reads the processing systemboot image from the external bootdevice specified by the bootstrappins. At the same time, the A9 config-ures the device configuration blockto push the processing system masterimage through the AES decrypt and

HMAC authentication engines. Theconfiguration logic loops back thedecrypted, authenticated imageimmediately, without internal buffer-ing, for storage in on-chip memorywithin the PS. The master A9 pollsthe final authentication status fromthe configuration logic to ensure theimage was properly authenticated. Ifit fails, the master A9 will trigger asystem secure reset. Once the PSimage has successfully loaded, con-trol is turned over to the first-stageboot loader. Based on the user appli-cation, the boot loader could theneither start processing, configure theprogrammable logic, load additionalsoftware or wait for further instruc-tion from an external source.



Processing SystemProgrammable

Logic:System Gates

DSP, RAM2x SPI

2x I2C

2x CAN

2x UART

GPIO

2x SDIOwith DMA

2x USBwith DMA

2x GigEwith DMA

I/OMUX

Static Memory ControllerQuad-SPI, NANO, NOR

Dynamic Memory ControllerDDR3, DDR2, LPDDR2

AMBA Switches AMBA Switches

ARM Coresight Multicore & Trace Debug

NEON/FPU Engine NEON/FPU Engine

Cortex-A9 MPCore32/32 KB I/O Caches

Cortex-A9 MPCore32/32 KB I/O Caches

512KB L2 Cache Snoop Control Unit (SCU)

Timer Counters

General Interrupt Controller DMA Configuration

256 KB On-Chip Memory

ACP

AMBA Switches

AMS

Multistandard I/Os (3.3V & High-Speed 1.8V)

Mul

tista

ndar

d I/O

s (3.

3V &

High

-Spe

ed 1.

8V)

Multigigabit Transceivers

PCIe

Figure 3 Zynq-7000 EPP block diagram details the many industry-standard peripherals and memory interfaces that are available.

SECURE RUNTIMEENVIRONMENTThe Zynq-7000 EPP provides a secureruntime environment by incorporatingARM TrustZone technology through-out the device. TrustZone providescontent protection by enabling soft-ware tasks to run in memory areasthat are segregated, keeping the con-tent safe from unauthorized reading orwriting. Eliminating access by othertasks creates and maintains a trustedprocessing environment. TrustZone isbuilt into the ARM A9 processors andeach element within the PS.

In addition, the AMBA TrustZonesignals are extended into the pro-grammable logic to allow the devel-opment of trusted master and slavedevices within the PL. Since com-plete, secure programmable logicconfiguration support is provided,these user-developed master/slavedevices can be completely trusted, asis any other hardened block withinthe PS. When used with TrustZonesoftware, the Zynq-7000 EPP canenable a secure system capable ofhandling keys, private data andencrypted information.

PROGRAMMABLE LOGICThe programmable logic within theZynq-7000 EPP is built using Xilinxs28-nm high-performance, low-powerprocess technology. Four devices areavailable within the Zynq-7000 family,with logic cells ranging from 28,000 to235,000. Like the other devices in theXilinx 7 series family, the Zynq-7000EPPs programmable logic providesconfigurable block RAM, programma-

ble DSP functions, hardened Gen2PCIe (larger devices) and AgileMixed Signal (AMS) technology. AMStechnology provides dual 12-bit 1-Msample/second ADCs, dual inde-pendent track-and-hold amplifiers, on-chip voltage reference and thermalsupply sensors, and external analoginput channels.

As seen in Figure 2, the programma-ble logic of the Zynq-7000 EPP cancombine the digital RF and the logic ofthe modem, black and red FPGAs.This is possible not because of a largeprogrammable logic fabric, butthrough the use of partial reconfigura-tion of the PLand the Xilinx IsolationDesign Flow.

PARTIAL RECONFIGURATION Partial reconfiguration of the Zynq-7000 EPP PL takes the flexibility pro-vided by normal FPGA technology astep further by allowing the modifica-tion of an operating design by recon-figuring portions of the programmablelogic to perform a different function.After you have fully configured the PLwith a complete configuration file, youcan download partial configurationfiles to modify reconfigurable regions.

The programmable logic consists ofstatic logic and reconfigurable logic.The static logic remains functioningand is completely unaffected by theloading of a partial configuration file.The reconfigurable logic is replaced bythe contents of the partial configura-tion file. Partial reconfiguration filesdownload without compromising theintegrity of the applications runningon those parts of the device not being

reconfigured. Partial reconfigurationallows time-multiplexing of differenthardware functions dynamically on asingle Zynq-7000 EPP device. You canidentify, isolate and implement time-independent functions as reconfig-urable modules and swap them in andout of a single device as needed. Thisreduces the need to have a Zynq-7000EPP with programmable logic largeenough for all of the required functionsthat your design may need at differenttimes. The programmable logic needonly be large enough for the largestfunction needed at any one time.Software-defined radio, by dint of itsmutually exclusive functionality, candirectly benefit from this technologyand see a dramatic improvement inflexibility and resource usage, and areduction in static power.

The Zynq-7000 EPP architecturefurther extends the flexibility of par-tial reconfiguration by allowing soft-ware running in the PS to reprogramportions of the programmable logicvia partial reconfiguration. Since, forsecure designs, the PS and PL imagesare initially authenticated and trust-ed, the partial reconfiguration fileloaded under PS or PL control can beeither encrypted or unencrypted. So,not only can you swap the softwareprocessing algorithms in and out fordifferent waveforms, but you can dothe same for the corresponding pro-grammable logic to implement thosealgorithms as well.

ISOLATION DESIGN FLOWWith the development of the IsolationDesign Flow (IDF), red and black pro-



The programmable logic of the Zynq-7000 EPP can combine the digital RF

and the logic of the modem, black and red FPGAs. This is possible not

because of a large programmable logic fabric, but through the use

of partial reconfiguration and the Xilinx Isolation Design Flow.

cessing logic can now reside on thesame FPGA device, allowing design-ers of the cryptographic portions ofSDRs to realize the full capability ofprogrammable logic. Xilinx developedthe Isolation Design Flow to allow

independent red and black functionsto operate on a single device and toeliminate the requirement for sepa-rate red and black FPGAs. Examplesof such single-chip applicationsinclude redundant Type-I encryptors,

resident red and black data, and func-tionality operating on multiple inde-pendent levels of security.

The Isolation Design Flow makesit possible to implement multiplephysically isolated or independent



bootstrap

ROM CPU

DAP

Processing System

MDDRFlash

DDRMemory

DeviceConfiguration

Block

Secure Vault AXI

FIFO FIFO

PCAP-Stream

On-ChipRAM

NANDNORQSPIIOU

AXI Top Switch

Device KeyAES HMAC JTAG

Programmable Logic

QSPI

SelectMap

Zynq-7000 EPP

Configuration File 1

1st-Stage Boot Loader

Common Boot PathPS Boot Path

PL Configuration Path

Configuration File 1(AES CBC Encryptedwith SHA-256 HMAC)

1st-Stage Boot Loader Step 3

Step 4

Step 2

Step 1

Step 1 CPU Boots from ROM

Step 2 PS Boot Image Decryption and Authentication

Step 3 Decrypted, Authenticated Image stored in on-chip memory

Step 4 PL Configuration (Optional)

Secure Boot Process

Figure 4 Zynq-7000 EPPs secure boot process controls the devices security.

functions within a single FPGAdevice, utilizing a fence of unuseddevice components between eachfunction. Each isolated function isfenced in, generating isolatedregions within the device. The flowuses early floorplanning, modulardesign, modular synthesis and adher-ence to a set of guidelines and con-siderations to guarantee isolationbetween desired functions. Once adesign is implemented, you can usethe Xilinx Isolation Verification Tool(IVT) to visualize the modules andfence, along with verifying that youhave successfully implemented thedesign rules for isolation.

By defining each of the waveformprocessing blocks as reconfigurablelogic, the programmable logic within

the Zynq-7000 EPP only has to belarge enough to contain one waveformprocessing block at a time. The use ofthe Isolation Design Flow allows theencryption and decryption (red andblack) processing logic to coexistwithin each waveform processingblock. Partial reconfiguration allowsdifferent waveform processing blocksto be dynamically swapped in and outof the Zynq-7000 EPP PL under soft-ware control as shown in Figure 5.

Depending on the application, thissame SDR can support future wave-forms by developing the necessarysoftware and waveform processingblocks that can be swapped into theZynq-7000 EPPs PS and PL as recon-figurable logic, extending the overallproduct life cycle of the SDR.

POWER-SAVING FEATURES The Zynq-7000 EPP supports many fea-tures to lower the overall systempower consumption, such as the pro-cessing systems power-only mode,sleep mode and peripheral independ-ent clock domains. Designers can usethese features to significantly reducethe dynamic power consumption of thedevice during idle periods.

The Zynq-7000 EPPs processing sys-tem and programmable logic are ontwo independent power rails, each withits own dedicated power supply pins. Itsupports a PS power-on only mode, butdoes not support a PL-only mode. Usersoftware in the processing system canturn the programmable logic fabricpower on or off at any point to reducethe static power required. When the PS



Programmable Logic

Processing System

Zynq-7000 EPP

HardenedPeripherals(GigE, USB,CAN, SDIO,I2C, GPIO,UART, SPI)

Static Logic Interface and Control

External Memory Interfaces(Static and Dynamic)

CPU1 General-PurposeProcessor

CPU2 General-PurposeProcessor

Encrypt Link Format Error Correct Encode

Error Correct Decode

Interleave

De-Interleave

Reconfigurable Logic Waveform A

Pulse Format

Sync Detect

Modulate

De-Modulate

Up-Convert

Down-ConvertLink FormatDecrypt



Interleave

De-Interleave

Reconfigurable Logic Waveform B

Dynamically swapped via Fabric Partial Reconfiguration under

Processor Control

Dynamically swapped via Fabric Partial Reconfiguration under

Processor Control

Pulse Format

Sync Detect

Modulate

De-Modulate

Up-Convert




Interleave

De-Interleave

Reconfigurable Logic Waveform C

Pulse Format

Sync Detect

Modulate

De-Modulate

Up-Convert


Figure 5 -- Use of Zynq-7000 EPP partial reconfiguration and the Isolation Design Flow to support multiple waveforms

is powered off, it holds the PL in a per-manent reset condition until the PScomes out of reset.

In sleep mode (wake from interrupt,wake from exception), a single ARM A9processor is running at approximately10 MHz, interconnects are shut down,DDR is in self-refresh mode, all DDRtermination is off and processing sys-tem peripherals are in clock-gatingmode except for the selected wakeupdevice (CAN, Ethernet or GPIO). Thedynamic power consumption in sleepmode will come from only a small partof the CPU circuit used to monitor thewakeup interrupt, the snoop controlunit and the wakeup peripheral device.You can also shut down the PL powerduring sleep mode, resulting in furtherreductions in static power.

The Zynq-7000 EPP supportsmany clock domains, each with inde-pendent clock-gating control. Youcan shut down the unused clockdomains in order to reduce dynamicpower. Each of the peripherals with-in the processing system (except theGPIO, due to its small size and lowspeed) is on an independent clockdomain, each with clock-gating con-trol. You can turn them off via soft-ware through the system-level con-trol registers.

SECURE INTEGRATIONContinuing the tradition of secure inte-gration, a single Xilinx Zynq-7000 EPPnow provides the processing powerand logic formerly supplied by five ormore separate devices for the process-

ing and logic functions in tacticalradios and commercial SDRs. Thedevice delivers the capability and flex-ibility to reuse the same logicresources to process various wave-forms by means of dynamicallyredefining some or all of the logic func-tions within the programmable logic,therefore enabling support of multiplewaveforms and future waveforms.

Merging several devices into oneZynq-7000 EPP can result in a significantreduction in overall size, weight, powerand cost of the SDR. Furthermore, theZynq-7000 EPP provides several fea-tures that enable additional reductionsin both static and dynamic power con-sumption, making it an ideal solutionfor a flexible, secure commercial ortactical SDR.



BEE4 Applications Digital Communication

LTE Advanced

Defense Radar Test & Measurement

Arbitrary WaveformGenerator

BEEcube Technology Nectar OSTM Software

BIOS for FPGA Debugging& Monitoring

BEEcube Platform Studio(BPSTM) IDE for AutomaticAlgorithm and IP Deployment

4x Multi-GHzMulti-Nyquist ADC/DAC

Expansion Modules

4x GloballySynchronous

Clock Trees

x86 Control Modulew/ Removable Drive

4x GigabitEthernet

Contact info: +1 (510) 252-1136 [email protected] www.beecube.com

4x Gen 2 PCIExpress 8-Lane

8x 20 Gb QSFP+(10GE Compatible)

Up to 128GB DDR3-800 ECC DRAM

4 Virtex-6 FPGAsLX240T, LX550T & SX475T

4x FMC (HPC) Slots

The Ultimate Mixed SignalFull-Speed FPGA

Prototyping Platform


Traditional telecom networks arefundamentally configured to trans-port voice, such as phone service.Internet services piggyback on this legacyplatform. Currently, designers are creatingthe next-generation network (NGN) totransport data, voice and video simultane-ously, providing transparency and scalabil-ity while minimizing total operational cost.The NGN is considered a logical evolutionfrom separate network infrastructuresinto a unified multiservice, secure, packet-based network for electronic communica-tions offering quality of service (QoS) andease of access for end users. Large tele-com providers have already started thetransition to the NGN, in a move towardnew Ethernet packet-based core infra-structures. The process will graduallyreplace and upgrade networks to supportnew and existing service offerings.

Flexibility and feature sets makeFPGA devices ideal for designing

timing and synchronization subsystems in advanced networking equipment.

XCELLENCE IN NETWORKING

by Dejan Habic Chief EngineerStratum [email protected]

FPGAs Synchronize Next-Generation Networks


This migration creates many tech-nical challenges, uppermost amongthem the synchronization of networkrequirements. Traditionally, circuit-switched networks such as Sonet andSDH distribute a high-quality clock andtiming source throughout the network,while Ethernet networks do not requiresuch a strict clock hierarchy. However,the need for a method to synchronizethese networks is becoming a specificrequirement for carriers. The key chal-lenge in realizing the NGN will be tostandardize, implement and deploy aquality solution that enables interwork-ing of all existing and new networks.Time and frequency alignment, alsocalled synchronization, is critical forensuring QoS for applications such aswireless, voice, real-time video anddata over a converged network.Embedding the sync-and-timing func-tion along with the other hardware inFPGA creates a flexible, programma-ble and low-cost solution that meetsthe highest telecommunication equip-ment standards.

THREE PROMISING CHOICESAt present, the three most promisingtechnologies that distribute accuratetiming and synchronization signalsthroughout the new networks areSynchronous Ethernet, the PrecisionTime Protocol (IEEE-1588) and a GPSclock. These technologies are wellcovered in a range of internationalrecommendations and standardspromulgated by the ITU-T and IEEE.The list is long, and includes ITU-TG.8261 (Timing and SynchronizationAspects in Packet Networks), ITU-TG.8262 (Timing Characteristics ofSynchronous Ethernet EquipmentSlave Clock) and ITU-T G.811 (TimingCharacteristics of Primary ReferenceClocks), as well as IEEE 1588 andmany others.

The broadband wireless backhaulnetwork represents the most challeng-ing application for synchronizationover the packet-switched networks,due to stringent mobile synchroniza-

tion requirements driven by an optimaluse of rare radio resources under high-speed mobility requirements. Two typesof technology are the most prevalentduplexing schemes in broadband wire-less networks: Frequency-divisionduplex (FDD) technologies require onlynetwork frequency synchronization,

while time-division duplex (TDD) tech-nologies require phase (and frequency)synchronization. Table 1 summarizesthe mobile network requirements at theradio interface for different technolo-gies. Video, circuit-emulation servicesover Ethernet and many other applica-tions also require tight synchronization.

Most existing designs use specificsilicon devices to perform synchroniza-tion and timing functions. If GPS clocksynchronization is required, designerscan find on the market only OEM mod-ules with specific functionality. Allthese components always carry a highunit cost, high integration cost and sin-gle-vendor dependency, and sometimeslack the features and performance tomeet the application requrements.

Great flexibility, IP core availabilityand rich clocking resources makeXilinx FPGA devices ideal for the imple-mentation of synchronization and timingsubsystems. In addition, most networkequipment already has an FPGA onboard with enough spare resources andlogic to implement a synchronization

subsystem that fully meets the require-ments of the particular application.Designers can now embed complexsynchronization circuitry in a singleFPGA, along with existing hardware, bycombining general and proprietary logicIP cores. The circuitry can potentiallycombine Synchronous Ethernet, IEEE-

1588 and a GPS clock, providing thehighest possible timing performance forthe lowest cost.

SYNCHRONOUS ETHERNETOver the past decade, with the emergingprominence of Ethernet in telecommu-nications networks, carriers have beenevolving their legacy circuit-switchedsystems to Ethernet packet-based sys-tems. However, Ethernet was notdesigned for the transport of synchro-nization signals, which are key require-ments for some of the existing andfuture applications such as TDM emula-tion, mobile backhaul and next-gener-ation mobile networks. SynchronousEthernet (SyncE), described in ITU-TG8262, represents one of the keydevelopments in the evolution ofEthernet into a carrier-grade technolo-gy suitable for telecommunicationwide-area networks.

In traditional Ethernet, data istransmitted continuously. The physi-cal-layer transmitter clock is derivedfrom an inexpensive 100-ppm crystal,

X C E L L E N C E I N N E T W O R K I N G

Radio system Frequency accuracy Time accuracy

GSM 50 ppb NA

UMTS FDD 50 ppb NA

UMTS TDD 50 ppb 2.5 s

CDMA2000 50 ppb 3 s

TD-SCDMA 50 ppb 3 s

LTE FDD 50 ppb NA

LTE TDD 50 ppb 3 s

WiMAX 802.16e 8 ppm 0.5-25 s

Table 1 Wireless technologies and their synchronization requirements

and the receiver locks onto it. There isno need for long-term frequency sta-bility and consistency between thefrequencies of different links, as thedata is packetized and buffered.

Designed to distribute a frequencyreference within an IEEE 802.3Ethernet network, SyncE is based on ahierarchical synchronization methodusing a synchronous physical layer,similar to the synchronous optical net-works (Sonet/SDH). In SyncE, the

physical-layer transmitter clockderives a signal from a high-quality fre-quency reference by replacing the crys-tal with a frequency source traceableto a primary reference clock (Figure1). The receiver at the other end of thelink automatically locks onto the phys-ical-layer clock of the received signal,thus itself gaining access to a highlyaccurate and stable frequency refer-ence. This process does not affect theoperation of any of the Ethernet layers.

In essence, supplying one Ethernetnetwork element with a primary refer-ence clock (PRC) and employing anEthernet PHY with well-engineeredtiming-recovery circuitry similar tothose used in Sonet/SDH networks, wecould set up a fully time-synchronizednetwork. This methodology providesaccess to a highly accurate and stablefrequency reference to applicationsthat need it. Although it requires hard-ware changes in equipment, SyncE isnot influenced by impairments intro-duced by the higher levels of network-ing technology such as packet loss orpacket delay variation. Thats a bigadvantage over other methods that relyon sending timing information in pack-ets over an unlocked physical layer.Hence, the SyncE frequency accuracyand stability may be expected toexceed those of networks with unsyn-chronized physical layers.

The SyncE timing-and-synchro-nization device is a digital phase-locked loop (PLL) that supports free-run, locked and holdover modes andgenerates multiple synchronous



MAC MACEthernetPHYEthernet

PHY

PLL PLLReference

Clock RecoveredClock

Figure 1 Synchronous Ethernet network element and clock distribution

PD LOOPFILTER VCXO

MULTIMODULUSDIVIDER

MODULATOR

DIGITAL PLLCONTROL

OUTPUTFREQ.

SCALING

OUT

SYNCOUTPUTS

COM.

EXTERNALREF.

UP

DW

LOCAL REF.

REFERENCESELECTION,EVALUATION

Figure 2 Synchronous Ethernet clock

clocks. An internal state engine con-trols mode selection automatically;alternatively, you could set it exter-nally. The free-run mode occurswhen the device is unlocked to eitherof the inputs, and output accuracydepends on the local oscillator speci-fied as 4.6 ppm. The holdover modeoccurs when the SyncE device haslost its reference inputs and utilizesstored timing data to control the out-put frequencytypically, while thenetwork synchronization is tem-porarily disrupted.

The locked mode is when the outputof the SyncE is phase-locked to any ofthe selected input references. Lockedmode is typically used when a slaveclock source is synchronized to the net-work. To provide sufficient filtering ofjitter and wander at reference inputs,the digital PLL is designed as a pro-grammable narrow-loop-band device.During mode switching, its outputphase must be precisely controlled toavoid disruption and signal loss.

A SyncE device continuously moni-tors all input references for their pres-ence and quality. If a system requiresredundant operation, the SyncE deviceshould support master-slave configura-tion of two devices. Depending on theapplication, the SyncE devices shouldsupport different output frequencies.

You can build a fully integrated uni-versal synchronization solution forSynchronous Ethernet that complieswith ITU-T G.8262 using Xilinx FPGAdevices with minimum external com-ponents. The newer Xilinx familieswith an improved clocking feature set,such as Spartan-6 or Virtex-6, areideal for the application. A typical

FPGA implementation of SyncE com-prises a digital PLL, MicroBlaze sys-tem and frequency output generator, asshown in Figure 2. Using an externalVCXO will help in achieving a very low-jitter output signal. The VCXO is run-ning in a fractional PLL that designerslock to a selected input reference.They achieve this lock by changing thefractional ratio of the high-resolutionsigma-delta modulator.

For its part, the MicroBlaze systemcontains a CPU, block RAM, timer,SPI or UART, I/O ports, high-resolu-tion digital phase detector and a fre-quency output generator. TheMicroBlaze is running applicationsoftware with multiple tasks includingSyncE state engine, reference moni-toring, communications and so on.The frequency output generator takesthe output of the digital PLL and gen-erates multiple additional frequenciesusing FPGA clocking resources(CMT) and programmable dividers.Multiple CMT blocks can sysnthesizea wide range of frequencies includingstandard T1, E1, Sonet/SDH andEthernet. An external TCXO or OCXOserves as the local reference.

THE PRECISION TIME PROTOCOL (IEEE-1588)The IEEE-1588 Precision Time Pro-tocol (PTP) is a standard for synchro-nizing independent clocks running onseparate nodes of a distributed packetnetworked system to a high degree ofaccuracy and precision. The protocolis independent of the networking tech-nology, and the system topology isself-configuring. Originally, the PTPwas designed for applications in indus-

trial automation and instrumentationor relatively small and local networksthat require precise synchronization.With the emergence of a new genera-tion of telecom networks in the pastcouple of years, designers are taking anew look at the PTP. The new version2 of the standard provides better per-formance over wide-area packet-based networks. Among the manyimprovements are concepts of a trans-parent clock, boundary clock, largermaximum message rate and unicastsupport. The IEEE-1588 clocks, locat-ed in network nodes, are organizedinto a master-slave hierarchy.

A master clock exchanges two-waytiming packets over Ethernet, whileslave clocks embedded in the equip-ment require synchronization. Eachslave synchronizes to its master basedon four messages exchanged amongthem: sync, follow up, delay requestand delay response. The master issending multicast messages to theslaves, while slaves are responding tothe master by unicast. In each nodethe messages are time-stamped onboth the receiving and transmittingpaths, and time stamps are embeddedin the subsequent messages. To calcu-late offset and delay (that is, time rela-tive to the master), the slave uses fourtime stamps and assumes that delaysfor messages traveling between mas-ter and slave are equal.

Generally, there are two types ofIEEE-1588 implementations: softwareonly and hardware-assisted software.In software-only implementations, thewhole protocol executes at the appli-cation level, including the processingof the time stamps and control of the



You can build a fully integrated Synchronous Ethernet solutionusing Xilinx devices with minimum external components. A typical FPGA implementation comprises a digital PLL,

MicroBlaze system and frequency output generator.

clock. Since the operating system andapplications introduce additional timeerrors into the time stamps, thisimplementation provides accuracy onthe order of hundreds of microsec-onds to milliseconds, which is oftennot enough for telecom applications.

For higher accuracy, a hardware-assisted software implementation isthe way to go. Since the hardwarehere performs the time-stampingclose to the physical interface, thismethod can provide accuracies in theorder of hundreds of nanoseconds. Atypical hardware-assisted implemen-tation connects a time-stamp unit(TSU) at the Media IndependentInterface (MII) between PHY andMAC (Figure 3). The TSU is responsi-ble for indentifying PTP messages,time-stamping each of them at thestart-of-frame delimiter and, finally,recording time stamps along withsome relevant attributes into thememory buffer or registers.

In addition to the TSU, thisapproach also implements the IEEE-1588 clock block in hardware. The

clock maintains and distributes thecurrent time in seconds (48-bits value)and subseconds (32-bits value). TheIEEE-1588 clock is updated each cycleof the local oscillator, with the PTPsoftware thats running in the proces-sor controlling the update rate. ThePTP software runs several tasks.Among other things, it processes timestamps, controls the clock synchro-nization, receives and sends PTP mes-sages, and services the protocol stackand state engine.

A critical part of the IEEE-1588implementation is selection of the localoscillator. You must specify the stabilityof the oscillator so that its drift causedby temperature and aging effect is toler-able between two corrections of thePTP clock. Often the oscillator specifi-cations boil down to a trade-offbetween cost and performance.

Todays telecom networks containmany switches, routers or other net-work equipment, which increase thesize and complexity of the network.



MPU(application

code, TCP/IPstack, 1588code, PTP

stack)

MACPACKET

RECOGNITIONAND CAPTURE

TIMESTAMPING

PHY

IEEE-1588CLOCK

INTERFACE BUS

ETHERNETPORT

CLOCKREF.

SYNCOUTREF

RX

MII

TX

FPGA SoC

Figure 3 Diagram of the IEEE-1588/PTP clock

PHS RATE

SECONDS SUBSECONDS

SEC SUBS RCNTCOUT COUT

PACKETRECOGNITIONSTATE ENGINE

MII

ADDRESS

TIME STAMPSEQ. IDDONE

BIT COMP PACKETID ROM

Figure 4 IEEE-1588/PTP clock (left) and packet-recognition (right) hardware blocks

Each of these network elements intro-duces delays and fluctuations of thePTP message delivery. These fluctua-tions, called packet delay variation(PDV), directly affect the accuracy ofthe clock synchronization, since thesynchronization algorithm includesthe propagation delay calculation.Some routers can introduce large PDVin the order of milliseconds and more,considerably decreasing synchroniza-tion accuracy. The system mustemploy sufficient filtering of the PDVif the PTP is to fully meet require-ments for telecommunication applica-tions. Generally, packet delay varia-tion can be divided into three compo-nents: transmission delay, processingdelay and buffering delay.

Transmission delay is the result ofthe signal speed between two ports,while processing delay is the result ofthe processing of a timing packet with-in a network element. The buffering orqueuing delay is the total amount ofwaiting time of the timing packet with-in different buffers or queues in agiven network element before beingprocessed. These three componentshave different PDV ranges. The trans-mission delay generates a PDV in therange of submicroseconds, while theprocessing delay is in the order of 1 to10 s. The last component, bufferingdelay, is often the main generator ofPDV, with a 10- to 10,000-s range.

Different filtering techniques areavailable at the slave clock to mini-mize PDV. The two most commonones use the moving average or theexponentially weighted moving aver-age. Both techniques are based on esti-mating average packet time arrival,with the assumption that the PDV dis-tribution is fixed, independent and

equally distributed. The efficiency ofthese filtering algorithms greatlydepends on network traffic load andtopology. To make the PDV filteringmore effective, designers use othernetwork management techniques aswell, such as increasing timing-packetrate and randomizing message trans-mission around a given mean value.

From a hardware perspective, FPGAimplementation of the IEEE-1588 pro-tocol is relatively straightforward. Thesystem consists of a MicroBlaze CPU,RAM, timer, MAC, packet-recognitionblock, time-stamping unit and clockdriven by an external oscillator. TheMicroBlaze runs the PTP stack soft-ware including the filtering and clock-disciplining algorithm.

Three hardware blocks central tothis architecture are not part of thestandard Xilinx IP LogiCORElibrary: the PTP clock, packet-recog-nition block and time-stamping unit(TSU). The PTP clock is implementedas three cascaded accumulators, asshown in Figure 4. The RCNT registeraccumulates fractional subsecondsbased on the value of the RATE regis-ter, which defines the PTP clocksaverage speed. With each clock cycle,the SUBS register accumulates thefixed value of the PHS register plusthe carry-out of the RCNT accumula-tor. The carry-out of the SUBS (30-bit)register accumulates into the SECregister (32 bits).

The packet-recognition block moni-tors the MII bus between the MAC andthe PHY to identify IEEE-1588 packetsand send signals to the TSU. Sincepacket monitoring is performed onboth the receiving and transmittingpaths, the design implements twoidentical packet-recognition blocks.

To detect start-of-frame and generatethe time-stamp signal, the blocksstate engine counts and comparessubsequent bytes and determines ifthe message is valid. The block alsogenerates an ID related to the specificIEEE-1588 message. Finally, the pack-et-recognition block generates theDONE signal to the MCU. The TSUblocks task is to capture the time ofthe PTP clock when the time-stampsignal is asserted. For the best per-formance, the local oscillator shouldbe either a TCXO or OCXO.

THE GPS CLOCKFor some time, telecommunicationsystems around the world have beenusing the Global Positioning Systemfor accurate timing and synchroniza-tion. GPS has 24 satellites positionedin six earth-centered orbital planes,each containing four atomic clocks.Each satellite transmits information ofits location. This data is modulatedonto the carrier frequency and repeat-ed at very precisely controlled inter-vals regulated by the atomic clocks.The GPS receiver on the groundreceives and decodes these signals,effectively synchronizing itself to theatomic clocks on the satellites. A GPSclock that is used in telecommunica-tion synchronization combines a GPSreceiver with an antenna, digital PLLwith disciplining software and high-quality stable oscillator. The disciplin-ing software controls and calibratesthe oscillator to remove small biasesin the frequency.

The GPS clock can synchronize bothsystem timing and transceiver frequen-cy, and is virtually fail-safe. It generatestiming signals whenever there is power,and never needs to be recalibrated.



The GPS clock can synchronize both system timing and transceiverfrequency, and is virtually fail-safe. It generates timing signals whenever there is power, and never needs to be recalibrated.

This makes it possible to integrate GPSinto many installations requiring high-quality timing subsystems or to simplyuse it as a primary reference source. Inthe past the cost of using GPS was stillrelatively high, especially consideringantenna installation. However, asprices came down, the GPS clockbecame a more viable solution for tim-ing and synchronization in next-gener-ation network systems.

In addition, there are other globalnavigation satellite systems (GNSS)that can be effectively used in timeand synchronization applications,such as the Russian GLONASS sys-tem, the European Unions Galileoand Chinas Compass. There arealready a few GPS receivers availableon the market supporting multipleGNS systems. These multisystemreceivers track a large number ofsatellites effectively, thus increasingtiming accuracy.

Figure 5 shows a typical implemen-tation of the GPS clock. Essentially adigital PLL, it utilizes application-spe-cific software using a 1-pulse/second(pps) signal from the GPS receiver as a

timing reference to discipline a localoscillator and provide a high-quality fre-quency and timing reference. The GPSclock operates in three modes: locked,free run and holdover. In free-run mode,the unit is unlocked to the referenceinput. Free-run mode is typically usedwhen the GPS signal and history of tim-ing data are not available, or immediatelyfollowing system power-up, beforeachieving network synchronization. Inthis mode, output signals are not syn-chronized to the reference signal andtheir accuracy is based on the local oscil-lators. In the locked mode, the output isphase-locked to the GPS 1-pps referencesignal and the output frequency tracksthe input reference.

The digital PLL has very low band-width, typically in the millihertz range.It filters a major part of the noise of theGPS 1-pps signal. A second- or eventhird-order loop filter with optimizedgain, bandwidth and sometimes time-varying loop parameters is usually agood choice for stable tracking andsufficient for filtering of the GPS 1-ppsreference signal. In holdover mode, thedevice either has lost or disqualified

the GPS 1-pps signal and uses storedtiming data, called history, to controlthe output frequency. Holdover modeoccurs when the GPS signal is tem-porarily disrupted and no valid refer-ence is available.

To minimize frequency and phasedrift caused mostly by aging and tem-perature performance of the oscilla-tor, this mode relies on an adaptivealgorithm. Using an oscillator modelcreated from data collected duringlocked mode, this algorithm calcu-lates prediction of the frequency drift.The data set consists of ambient tem-perature, frequency and time values.Designers can use some variation ofKalman filtering or recursive imple-mentation of linear regression for theadaptive algorithm.

Although the GPS clock uses theadaptive algorithm to compensate foroscillator drift, the stability of the out-put signal in holdover greatly dependson the stability of the onboard oscilla-tor. Therefore, you should take spe-cial care in selecting the oscillator.For most NGN requirements, you willneed an oven-controlled crystal oscil-



GPS Antenna

GPSReceiver Clock Engine DAC OCXO

T

MCU MEMORY UART X2 I/O DIGITAL PLL ETHERNET (OPTIONAL)

FPGA SoC

1-PPS OUT

10-MHz OUT

UART PTP/NTP

Figure 5 Typical implementation of a GPS clock

lator (OCXO), although sometimesyou may be able to use a lower-costoscillator. Also, take special care inselecting the GPS receiver, since notall receivers provide a sufficientlyaccurate signal of 1 pps.

An FPGA implementation of the GPSclock is straightforward. A Xilinx FPGAimplements all digital hardware as asystem-on-chip (SoC) complementedby a GPS receiver, high-stability localoscillator (OCXO) and additional ana-log circuitry, including DAC, tempera-ture sensor and others. The MicroBlazesystem comprises the CPU, BRAM,timer, SPI, two UARTs, I/O ports andhigh-resolution digital phase/frequencydetector. To achieve high resolution ofthe phase detector, use a DCM generat-ing a 300-MHz clock. Such high fre-quency ensures resolution of 1.66 nswhen using both edges. You canachieve higher resolutions, if needed,by implementing a TDC with a tappeddelay line method in the FPGA.

The SPI provides an interface tothe external DAC and digital tempera-ture sensor. The first UART controlsthe GPS receiver while the second isused for communication with a host.The control software with the adap-tive algorithm for the MicroBlaze sys-tem was written in C/C++.

As the transmission of telecommuni-cations data is increasingly dependentupon a new generation of packet net-work transport, designers are comingup with new methods of time and fre-quency transfer. Three synchronizationtechniquesSynchronous Ethernet,the Precision Time Protocol and a GPSclockare widely accepted and stan-dardized by the industry.

Designing synchronization and tim-ing subsystems for next-generation net-work equipment requires understand-ing and defining system requirements,selecting appropriate sync technologyand implementing the solution. Using aXilinx FPGA with a new clocking infra-structure, combined with a set of gen-eral and proprietary IP logic cores,makes the job much easier.



Get Published

www.xilinx.com/xcell

It's easier than you think!

Submit an article draft for our Web-based or printed publications and we will assign an editor and a

graphic artist to work with you to make your work look as good as possible.

For more information on this exciting and highly rewarding program, please contact:

Mike SantariniPublisher, Xcell Publications

[email protected]

See all the new publications on our website.

Would you like to write for Xcell Publications?


Once an application for custom ASIC cores,this demanding computer graphics process is now the province of low-cost FPGAs.

XCELLENCE IN NEW APPLICATIONS

Accelerating Texture Mappingwith Spartan-6 FPGAs

Ahandheld box that generatesspecial video effects for partiesand concerts without the helpof a computer is built around an FPGArather than a specialized multimediasystem-on-chip. In our Milkymist One,a Spartan-6 FPGA implementsalmost the entire digital portion of thesystem. Whats more, the FPGA isrobust enough to handle texture map-ping, a high-end graphics function thatrepresents the most intensive data-processing task our system must per-form. Texture mapping is the tradi-tional realm of ASIC graphics-process-ing units and, before they existed,high-end workstations.

by Sbastien [email protected]

Disc jockeys, VJs and other eventorganizers can use the Milkymist One(Figure 1) at concerts, during festivalsand in clubs to create an entertainingvideo installation. Connect a cameraand a video projector, press the powerbutton and seconds later, everythingyou film becomes live psychedeliceffects of color and light. Point thecamera at a dancer onstage, at peopleattending your party, at toys or otherobjects, and dazzle your audience withthe effects. If no camera setup is avail-able, the Milkymist One can producepurely generative effects that react tothe ambient sound, making it an idealoption for bands, clubs and partyorganizers who want a turnkey solu-tion for simple visual effects.

The device supports inputs frommany sources: MIDI keyboards, USBcomputer keyboards, DMX desks andOpenSoundControl (OSC) clients. Youcan even use a smartphone to interactwith the visual performance wirelessly,by connecting a WiFi router to theEthernet port. Another option is to usethe popular Arduino board, with itsmany sensor interfaces, to control theMilkymist One over MIDI.

We had to overcome significant chal-lenges to design such a device. Our pro-cessing algorithm requires a consider-able amount of computing power andmemory bandwidth to process the videowith a high frame rate and a low latency.Further, our device has to interface withmultiple I/O protocols. For this applica-

tion, many engineers would choose amultimedia system-on-chip that includ-ed a CPU and graphics acceleration.They would then need a number ofexternal chips to handle all the inter-faces. But by leveraging the power andflexibility of the Xilinx devices, wewere able to implement almost all thedigital portion of our system in oneSpartan-6 FPGA. This reduced the costand chip count while greatly improvingflexibility.

THE MILKYMIST ONE HARDWARE Our Milkymist One system board (seeFigure 2) is centered around a XilinxXC6SLX45. This FPGA contains allthe digital logic of our system, includ-ing a soft-core CPU, a memory con-

X C E L L E N C E I N N E W A P P L I C AT I O N S


Figure 1 Digital system functions of the handheld Milkymist One are under FPGA control.

troller, hardware accelerators and I/Operipherals.

The FPGA reads its configurationdata from a NOR flash chip, using theMaster BPI mode of the Spartan-6. Thesame flash chip then runs the bootloader using an execute-in-place

schema, in which processor instructionsare fetched from the NOR flash whilethey are being executed. The bootloader brings up the SDRAM and loadsthe application software. The same flashchip stores this application software andkeeps user data using YAFFS2, a flash-optimized file system that provides wearleveling and journaling.

Our application software can down-load FPGA bitstream updates fromthe Internet and write them to theflash. Thanks to the MultiBoot featureof Spartan-6 FPGAs, if a failedInternet update should result in a cor-rupt bitstream, the system can fall

back to a rescue golden bitstreamprogrammed at our factory.

A pair of DDR SDRAM chips,directly connected to the FPGA, pro-vide 128 Mbytes of system memory. Toassist in meeting the timing require-ments of this demanding interface, theSpartan-6 FPGA supplies double-data-rate I/O registers; runtime program-

mable delay-locked loops (with theDCMs); and I/O delay elements.

Our device supports two full-speedUSB host ports. Again, the FPGAabsorbs most of the hardware here.The Spartan-6 directly drives analogtransceiver chips that simply convertLVCMOS 3.3-volt levels into perfectlyUSB-compliant signals. The serialinterface engine and the host con-troller logic are implemented in theFPGA fabric. During prototyping, wewere even able to successfully con-nect USB devices directly to theFPGA, using just resistors and USBconnectors that we wired to the I/Oexpansion connector of a XilinxML401 development board.

For video output, the FPGA drives atriple digital-to-analog converter thatgenerates the RGB components of theVGA port. The flexibility of theDCM_CLKGEN primitive contained inthe Spartan-6 allows the synthesis ofmany different frequencies for thepixel clock, enabling our device to sup-port a large number of video modes.

Also, we are currently looking intothe synthesis of a composite video(CVBS) signal out of the VGA port.There are already some computergraphics cards on the market that use alow-cost passive adapter to connectCVBS devices to their VGA output.This is perfectly doable as well on asystem that uses an FPGA to generatethe raw color components. We wouldonly need to implement a CVBS signalgenerator using digital signal-process-ing techniques, and feed the produceddata into the VGA DAC. This wouldenable our device to easily connect tolegacy video projectors and video mix-ing consoles that are still popular in themusic and live-performance scenes.



Figure 2 The Spartan-6 FPGA resides at the center of the Milkymist Ones printed-circuit board.

The flexible DCM_CLKGEN primitive contained in the Spartan-6 allows the synthesis of many different frequencies for the pixel clock, enabling

our device to support a large number of video modes.

Our design connects the Spartan-6 toa pair of RS485 transceivers to provideDMX512 support. This protocol, which isused onstage to control lights, allows thedevice to synchronize the light ambiancewith the visual effects. Here again, thecomplete DMX512 signaling system iswithin the FPGA, and the external com-ponents are basically analog.

To interact with popular con-trollers and sensors, our system alsosupports MIDI. Our implementation issimilar to that of DMX512, with onlyanalog external components. We alsosupport Ethernet (using a PHY chiponly), audio (through a common AC97codec) and PAL, SECAM and NTSCvideo input.

Most of these peripherals areclocked from the FPGA, which synthe-sizes the necessary frequencies from asingle 50-MHz source using its digitalclock managers (DCMs). We have justtwo additional crystals on our board,and to reduce costs further, we arethinking about replacing them withmore FPGA-generated clocks in afuture PCB revision.

WHAT IS TEXTURE MAPPING?Among all the data-processing tasksthat the FPGA of the Milkymist devicemust perform, texture mapping is themost intensive. Texture mapping is acommon computer graphics operationfound in accelerated 3D APIs likeOpenGL and DirectX. It is typically usedto draw textured 3D polygons on thescreen. It can also distort an image (seeFigure 3), and we use it for this purpose.

Common graphics processingunits perform texture mapping on tri-angles, and break down more-com-plex polygons into a series of trian-gles. The inputs to the algorithm arethe 2D (possibly projected from theoriginal 3D coordinates) positions ofthe three vertices of the triangle to befilled, and the 2D texture coordinatesfor these three vertices. The algo-rithm then draws a textured trianglepixel by pixel, by interpolating linear-ly the texture coordinates of the ver-tices for each pixel and then copyingthe texture pixel (also called texel) atthese coordinates.

Texture mapping can implementimage-processing operations likezooming, rotating or scaling by simplychanging the positions of the verticesor the texture coordinates at each ver-tex. More often than not, the results ofthe linear interpolation are not integer,which means that the texture shouldbe sampled among four adjacent pixels(see Figure 4). In this case, for a betterrendering, the four pixels should beread and their colors should be aver-aged (with different weights dependingon the fractional parts) in a processcalled bilinear filtering. Our applica-



1, 2, 3, 4: real texture pixelsGrayed box: wanted texture pixelwith noninteger coordinates

The color of the resulting pixel is proportional to the surface ofeach real texture pixel it covers.

Figure 3 Texture mapping, a common computer graphics operation found in accelerated 3D APIs, is typically used to draw textured 3D polygons.

It can also distort an image, as seen here.

Figure 4 In texture mapping, the results of the linear interpolation are usually not integer. For this reason, the texture should be sampled among four adjacent pixels and their

colors averaged, a process called bilinear filtering.

tion requires bilinear filtering toobtain a good visual result.

Texture mapping, especially whenbilinear filtering is desired, is a verycompute- and memory-intensiveprocess that precludes softwareimplementations when performanceis needed.

FPGA IMPLEMENTATIONIt is expected that the memory laten-cy for reading the frame buffer wouldbe a performance-limiting factor.Instead of trying to alleviate its effectsby using complex and potentially

resource-intensive techniques such asadvanced prefetching, we simply use adirect-mapped texel cache, for sim-plicity and fast hit times, and designthe rest of the texture-mapping unit sothat the memory read latency becomesthe only limiting factor.

With a direct-mapped texel cachehaving a hit rate of 90 percent, a hittime of one cycle and a miss penalty ofnine cycles, the average memoryaccess time is 1.8 cycles. With an 80-MHz system clock, such a cache has athroughput of 44 megapixels per sec-ond, sufficient for our application.

To make sure that the memoryaccess time is the only limiting fac-tor, we designed the rest of the sys-tem to support a throughput ofapproximately one output pixel perclock cycle. This corresponds to aspatial implementation of the algo-rithm (that is, with little or no time-based resource sharing of the hard-ware components) but does notrequire resource-intensive duplica-tion of large hardware units. A spatialimplementation requires more areathan a time-shared one, but it is sim-pler to understand, needs fewer mul-tiplexers and is less prone to routingcongestion, making it easier toachieve timing closure in FPGAs.

For this reason, we chose a deeplypipelined implementation of the tex-ture-mapping algorithm. Figure 5depicts a block diagram of this scheme.

The first stages of the pipeline fetchlow-bandwidth vertex informationfrom the memory, and then computethe interpolated texture and destina-tion coordinates using a variant of theBresenham algorithm. We implement-ed those stages using behavioralVerilog HDL, which the free XST syn-thesizer (part of the ISE WebPACKdesign suite) processes to produce anoptimized netlist. The address genera-tor can take advantage of the hardwaremultipliers present in the DSP48A1slices of the Spartan-6 FPGA to effi-ciently compute the memory addresseswithin the texture frame buffer thatcorrespond to the interpolated coordi-nates. The XST synthesizer automati-cally infers hardware multipliers fromthe use of the * operator in the HDLsource code, which makes them veryeasy and convenient to use.

Things get more complicated whenit comes to fetching the texel datafrom the memory. At each clock cycle,we need to fetch four different pixelsfrom the cache. It would not makesense to have four separate caches,since different channels of the bilinearfilter often use data from the samecache line. We therefore need a



Vertex fetch engineWishbone

Vertical interpolator

Horizontal interpolator Control interface

Clamping/wrapping

Address generator

Texel cache

Bilinear filter

Write buffer

FastMemoryLink

CSR

FastMemoryLink

Figure 5 Block diagram shows our deeply pipelined implementation of the texture-mapping algorithm.

quadruple-port SRAM, which mayseem difficult in an FPGA.

Fortunately, the true dual-portSRAMs of the Spartan-6 FPGAs offeran elegant solution. We can implementquadruple-port SRAM at a moderatecost by using two primitive dual-portSRAMs in which we replicate the data.During normal operation (hits), eachport serves one channel. When refillingthe cache on a miss, reading is disabledand two of the ports (one per primitivedual-port SRAM) are used to feed thedata into the memories.

Figure 6 shows a simplified blockdiagram of the texel cache. At eachclock cycle, the texel cache processes,in a pipelined manner, our memoryaddresses from each channel if they hitthe caches. The hit signal is kept highand the pipeline is always running.

In case of a miss, the hit signal goeslow (stalling the pipeline), and the prior-ity encoder and the multiplexer (mux)select one of the missed addresses(there can be one or many). The memo-ry bus master issues a memory transac-

tion to retrieve the data from the systemmemory, replaces the contents of thecache line and rewrites the tag. Theaddress now becomes a cache hit. If noother address misses the cache, thetexel cache has successfully handled thefour-channel transaction and the hitsignal goes high again to proceed to thenext cycle. Otherwise, the processrepeats until all addresses hit the cache.

As you can see, it is possible toimplement a decent four-port cachingsystem in a modern FPGA by only dou-bling the number of block RAMs usedfor storage and with a very reasonableamount of control logic.

Following the texel cache, the bilin-ear filter blends together the contribu-tions of the four fetched texels. Hereagain, our design takes full advantageof the DSP48A1 slices of the Spartan-6to quickly compute the weightedsums. Finally, the result is stored intothe SDRAM-based system memoryusing a write buffer.

Once integrated in our soft-core sys-tem-on-chip, our texture-mapping unit

uses only a fraction of the resources ofthe low-cost Spartan-6 FPGA but pro-vides a peak fill rate of 70 million pix-els per second and an average fill rateof 37 million pixels/s. This is muchfaster than what software alone wouldprovide, even when running on a high-performance (and power-consuming)ASIC CPU, and meets the demands ofour application.

HIGHLY FLEXIBLE SINGLE CHIPHigh-performance reconfigurableFPGAs make it possible to combineheavy graphics processingoftenthought to necessitate ASICswithvery specific I/O interfaces in a highlyflexible single chip.

The Milkymist system takes advan-tage of many features of Spartan-6FPGAs: I/O delay elements, DDR regis-ters, large true dual-port block RAMs,DSP slices, flexible DCM_CLKGENelements, the ability to configure fromNOR flash and MultiBoot. Our com-plete design uses only about half of theFPGA resources, which leaves plentyof space for future improvements andfeatures. This is remarkable in a chipas low in cost as the XC6SLX45.

Regarding those future improve-ments, our whole FPGA design isopen source and is available underthe same lic

Documents

Xcell Journal issue 77