68
R Issue 60 Second Quarter 2007 THE AUTHORITATIVE JOURNAL FOR PROGRAMMABLE LOGIC USERS Xcell journal Xcell journal THE AUTHORITATIVE JOURNAL FOR PROGRAMMABLE LOGIC USERS INSIDE Programmable Logic in High-Volume Applications The Power of FPGA Architectures Inkjet Power Unleashed Low-Cost Security Solutions with Spartan-3A and Spartan-3AN Platforms Implementing Low-Cost DDR2 Interfaces with Spartan-3A FPGAs INSIDE Programmable Logic in High-Volume Applications The Power of FPGA Architectures Inkjet Power Unleashed Low-Cost Security Solutions with Spartan-3A and Spartan-3AN Platforms Implementing Low-Cost DDR2 Interfaces with Spartan-3A FPGAs www.xilinx.com/xcell/ Targeting the Edge of the Network Targeting the Edge of the Network

Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

R

Issue 60Second Quarter 2007

T H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

Xcell journalXcell journalT H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

INSIDE

Programmable Logic in High-Volume Applications

The Power of FPGA Architectures

Inkjet Power Unleashed

Low-Cost Security Solutionswith Spartan-3A and Spartan-3AN Platforms

Implementing Low-Cost DDR2 Interfaces withSpartan-3A FPGAs

INSIDE

Programmable Logic in High-Volume Applications

The Power of FPGA Architectures

Inkjet Power Unleashed

Low-Cost Security Solutionswith Spartan-3A and Spartan-3AN Platforms

Implementing Low-Cost DDR2 Interfaces withSpartan-3A FPGAs

www.xilinx.com/xcell/

Targetingthe Edge of theNetwork

Targetingthe Edge of theNetwork

Page 2: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Avnet Electronics Marketing designs, manufactures, sells and

supports a wide variety of hardware evaluation, development

and reference design kits for developers looking to get a

quick start on a new project.

With a focus on embedded processing, communications

and networking applications, this growing set of modular

hardware kits allows users to evaluate, experiment,

benchmark, prototype, test and even deploy complete

designs for field trial.

By providing a stable hardware platform that enhances

system development, design kits from Avnet Electronics

Marketing help original equipment manufacturers (OEMs)

bring differentiated products to market quickly and in the

most cost-efficient way possible.

For a complete listing of available boards, visit

www.em.avnet.com/drc

Support Across The Board.™

Design Kits Fuel Feature-Rich Applications

Build your own system bymixing and matching:

Processors

FPGAs

Memory

Networking

Audio

Video

Mass storage

Bus interface

High-speed serial interface

Available add-ons:

Software

Firmware

Drivers

Third-party development tools

Enabling success from the center of technology™

1 800 332 8638

em.avnet.com

© Avnet, Inc. 2007. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Page 3: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,
Page 4: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Low-Power TransceiversUltimate Connectivity . . .

Reduce serial I/O power, cost and complexity with the world’sfirst 65nm FPGAs.

With a unique combination of up to 24 low-power transceivers,

and built-in PCIe™ and Ethernet MAC blocks, Virtex-5 LXT FPGAs

get your system running fast. Whether you are an expert or

just starting out, only Xilinx delivers this complete solution to

simplify high-speed serial design.

Lowest-power, most area-efficient serial I/O solutionRocketIO™ GTP transceivers deliver up to 3.2 Gbps connectivity

at less than 100 mW to help you beat your power budget. The

embedded PCI Express® endpoint block ensures easy implemen-

tation and reduced development time. Embedded Ethernet MAC

blocks enable a single-chip UNH-verified implementation. And

the Xilinx solution is fully supported by development tools,

design kits, IP, characterization reports, and more.

Visit our website today, view the Webcast, and order your free

eval CD to give your next design the ultimate in connectivity.

www.xilinx.com/virtex5

Pow

er (

Wat

ts)

User Logic

PCIe

User Logic

PCIeStatic Power

Virtex-5 LXT FPGAs (65nm)

Nearest Competitor

(90nm)

Virtex-5 LXT FPGAs (65nm)

Nearest Competitor

(90nm)

5VLX30T vs. 2SGX60D. Target Frequency = 200 MHz. Worst-case process. 25K LUTs, 17K Flip-Flops,1 Mbit On-Chip RAM, 64 DSP Blocks, 128 2.5V I/Os.Based on Xilinx tool v8.2 and competitor tool v6.0.1

3.09

6.22

Are

a (L

UTs

)

Power consumption and area required to implement a typical design including 8-lane PCIe endpoint

25,100

34,600

The Ultimate System Integration Platform

©2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Page 5: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

NNothing distinguishes your work and brings more integrity to your magazine than recognitionfrom your peers.

Recently, the League of American Communications Professionals (LACP) honored issue 59 ofXcell Journal as the Bronze Winner in the 2006 Inspire Awards Newsletter & MagazineCompetition, “Overall” category. The competition was judged by a field of communicationprofessionals, a forum within the public relations industry that facilitates discussion of best-in-class practices within the profession.

“The Xcell Journal entry was remarkable in light of tremen-dous competition,” said Christie Kennedy, LACP ManagingDirector. “We received more than 425 entries for the 2006Inspire Awards, comprising newsletters and magazines fromseven countries.”

Xcell Journal was judged in several categories, including firstimpression, artwork, readability, creativity, message clarity,variety of features, audience focus, perceived relevance, andease of navigation. In the end, the magazine received a score of92 out of a possible 100 points, resulting in a “Superb –among the very best judged” rating.

“We’re very proud that Xcell Journal has been selected as awinner of this year’s LACP Inspire Awards,” said Sandeep Vij, vice president of worldwidemarketing at Xilinx. “Xcell Journal plays an important role in our overall marketing strategy.We work hand-in-hand with our staff and partners to develop editorial content that is rele-vant, compelling, and strategically focused so that the magazine continues to provide maxi-mum value to our readers.”

The global Xcell Publications team is setting the industry standard for innovative customdigital, online, and print publications. I’d like to personally thank all of the authors – fromXilinx, our partners, and our customers – who have contributed their time and effort overthe years to make Xcell Journal an award-winning publication for programmable logic usersaround the world.

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

Forrest CouchPublisher

PUBLISHER Forrest [email protected]

EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan Teie1-800-493-5551

TECHNICAL COORDINATOR Kevin Kitagawa

INTERNATIONAL Dickson Seow, Asia [email protected]

Andrea Barnard, Europe/Middle East/[email protected]

Yumi Homura, [email protected]

SUBSCRIPTIONS All Inquirieswww.xcellpublications.com

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

Xcell Journal Wins PrestigiousLACP 2006 Inspire Award

Page 6: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

O N T H E C O V E R

2121N O N - V O L A T I L I T Y

2525S E C U R I T Y

1212P O W E R O P T I M I Z A T I O N

The Power of FPGA ArchitecturesThe present and future of low-power FPGA design.

2929M E M O R Y I N T E R F A C E S

Implementing Low-Cost DDR2 Interfaces with Spartan-3A FPGAsXilinx provides complete memory interface solutions to help you get to market faster.

Inkjet Power UnleashedThe Spartan-3AN FPGA activates digital ubiquitous marking and printing.

Low-Cost Security Solutions with Spartan-3A and Spartan-3AN PlatformsCould your low-cost FPGA design be more secure?

Programmable Logic in High-Volume ApplicationsSpartan FPGAs and CoolRunner-II CPLDs meet the security, differentiation, and flexibility needs of the high-volume market.

Cover Story

1010

Page 7: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

S E C O N D Q U A R T E R 2 0 0 7, I S S U E 6 0 Xcell journalXcell journal

VIEWPOINTS

Letter from the Publisher .....................................................................................................5

View from the Top – Xilinx PLDs: On the Go and On the Road ................................................8

COVER STORY

Programmable Logic in High-Volume Applications ..................................................................10

FEATURES

Power Optimization

Energy-Efficient System Design ...........................................................................................11

The Power of FPGA Architectures ........................................................................................12

Optimizing FPGA Power with ISE Design Tools ......................................................................16

Non-Volatility

Inkjet Power Unleashed.....................................................................................................21

Security

Low-Cost Security Solutions with Spartan-3A and Spartan-3AN Platforms..................................25

Memory Interfaces

Implementing Low-Cost DDR2 Interfaces with Spartan-3A FPGAs .............................................29

GENERAL

Building a 3-Gbps eSATA/SATA Hardware RAID 5 Solution......................................................33

Implementing a Real-Time Beamformer on an FPGA Platform .................................................36

Using CRC Hard Blocks in Virtex-5 FPGAs .............................................................................42

Implementing Incremental Changes Efficiently ......................................................................46

Picking the Right PROM for Your System .............................................................................50

Accelerating Bilateral Filtering in Xilinx FPGAs .......................................................................53

Ultra-High-Speed Spectral Analysis in Xilinx FPGAs.................................................................56

Spartan-DSP Takes Aim at Affordable DSP Performance .........................................................60

REFERENCE

High-Volume Design Application Notes .................................................................................63

A Quicker Quick Start........................................................................................................64

Educational Opportunities ..................................................................................................65

Page 8: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

In 2006 Xilinx experienced phenomenalsuccess in the consumer and automotivesegments, with a record 40% growth. Oursuccess in these highly competitive mar-kets can be attributed to broad customeradoption of our Spartan™ FPGAs,CoolRunner™ CPLDs, and Xilinx®

Automotive (XA) product lines in high-volume applications.

We began our market diversificationefforts in the late 1990s, when close to 80%of PLD revenues were generated from thecommunications sector. We set our sightson the consumer market with the intro-duction of Spartan Generation FPGAs(1998) and CoolRunner CPLDs (1999).

At the time, FPGAs in high-volumeapplications were a very rare commodi-ty. This was a market typically served bystandard, fixed-function semiconductordevices such as ASSPs and ASICs.FPGAs and CPLDs were considered tooexpensive in comparison to ASSPs andASICs on a cost-per-unit basis to havean impact in the cost-sensitive con-sumer marketplace.

Xilinx believed otherwise. We exploitedMoore’s Law and increasingly finer semi-conductor manufacturing process geome-tries to dramatically drive down ourper-unit cost. At the same time, globalcompetition quickened the pace at whichnew products hit the market, so the con-

cept of time to market became increasinglyimportant to consumer product manufactur-ers. As a result, product developers focusedon a business dynamic that saw the majorityof their profit and market share being madein a short period of time at and shortly afterproduct launch. The inherent flexibility ofour FPGAs and CPLDs provided an addedtime-to-market advantage over the longdevelopment time of ASICs and ASSPs.

As advanced process technology contin-ues to drive down the cost of PLDs, we arebecoming more and more competitive inhigh-volume consumer applications, espe-cially in digital displays and consumerhandsets. In the last quarter of CY 2006,we shipped more than one million devicesin a single month to one of our top hand-set manufacturers.

In 2004, we continued our diversificationefforts by pursuing the automotive market.We introduced the industry’s first PLD linespecifically developed to meet the stringentrequirements of the automotive industry –the Xilinx Automotive (XA) family. Sincethat time, we have continued to makeinroads in the automotive market. In 2006, aleading luxury vehicle debuted with 18devices that enable infotainment and driver-assistance functions.

Today, Xilinx high-volume products rep-resent 35% of the company’s overall revenues,a compound annual growth rate of 38%.

The Year 2007 and BeyondXilinx pioneered the industry’s transforma-tion towards a focus on vertical marketsand engaging with customers at the systemarchitecture level. Today, this transforma-tion helps us address our customers’ com-plex design challenges by providing them

with innovative, flexible, and compellingsolutions that help them achieve theirobjectives of cost management, time tomarket, and leadership.

Industry analysts project furthergrowth for the PLD industry in both theconsumer electronics and automotivemarkets. New requirements across theconsumer and automotive electronicindustry are also forcing the need for flex-ible architectures that can cope with notonly current applications but future andpossible unknown features. According toiSuppli, the market for PLDs and ASICsin consumer electronics will grow to morethan $5.3 billion by 2010.

To further accelerate the adoption ofPLD designs in the consumer marketplace,we must now compete on more than justcost – our customers are demanding moreapplication-specific solutions. In the faceof rapidly changing consumer productrequirements, we need to deliver a morecomplete solution. Xilinx has respondedby offering domain-optimized PLDs alongwith high-value hard and soft IP cores andmarket-specific development platforms tobring a much more value-added and cus-tomer-specific set of solutions.

Xilinx continues to win designs in amyriad of emerging applications andmarkets. Our PLDs can be found in avariety of consumer electronic and auto-motive applications, including digital dis-plays, mobile phones, PDAs, rear-seatentertainment, and satellite navigationsystems. As it did in 2006, the consumerand automotive segments promise furthergrowth for Xilinx in 2007 as these high-volume application design wins moveinto production.

8 Xcell Journal Second Quarter 2007

Xilinx PLDs: On the Go and On the RoadThe broad adoption of high-volume solutions in CY06 results in record growth.

by Wim RoelandtsCEO and Chairman of the BoardXilinx, Inc.

Viewfrom the top

Page 9: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Break Through

Leading the way in high-performance, configurable DSP

Outperform your competition with the high-performance signal

processing delivered by XtremeDSP. With the new Spartan-DSP

series, low-cost implementations are now possible without compro-

mising performance. That opens up an exciting world of innovative

applications in wireless, video, imaging and more.

Jump start your design with XtremeDSP solutions

• XtremeDSP Device Portfolio: Configurable, scalable, high-performance

devices: Spartan-DSP and Virtex-DSP series

• XtremeDSP Design Environment: Flexible, easy-to-use design flow

with Xilinx Signal Processing Libraries, Xilinx System Generator™ for

DSP, and AccelDSP™ high-level MATLAB® language-based tool

• XtremeDSP Development Tools: Application-specific reference

designs, development boards, kits and IP blocks

• XtremeDSP Support: World-class support from ecosystem of partners

and dedicated Xilinx teams

Visit www.xilinx.com/dsp today, order your FREE evaluation soft-

ware, and take your next DSP design to the Xtreme.

www.xilinx.com/dsp

Innovation to the Xtreme

©2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Now it’s your turn to innovate...

Page 10: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Kevin KitagawaDirector, Worldwide Marketing – High-Volume ProductsXilinx, [email protected]

Programmable logic plays a key role in high-volume applications such as consumer, auto-motive, and industrial market segments, asOEMs demand lower system costs, highersystem integration, and faster time to mar-ket while also trying to create innovative anddifferentiated products. Customer require-ments also vary widely as OEMs balancecost, performance, and product features totarget different end customers. AlthoughASICs and ASSPs have traditionally playedin these markets, the high initial costs andrisks associated with ASICs and the inflexi-bility of ASSPs are not conducive to timelyand cost-effective products.

Xilinx has addressed OEM needs by pro-viding domain-optimized platforms such asXilinx® Spartan™ series FPGAs andCoolRunner™-II CPLDs. The productfamilies are tailored toward very differentcustomer requirements in a wide variety ofhigh-volume market segments. For exam-ple, CoolRunner-II products are optimizedfor battery-operated portable applicationsthat value power consumption, while thenon-volatile Spartan-3AN platform is opti-mized for applications that require high sys-tem integration or design security. Theflexibility and quick time-to-market advan-tages of programmable logic devices mini-mize the risk and maximize the probabilityof timely and successful products.

Design SecurityAs companies are pressed to reduce over-all product costs, the design and manu-facturing of high-volume products ismoving to areas where lower labor costssignificantly drive down manufacturingcosts. Although OEMs have enjoyed thebenefits of lower manufacturing costs,they have also experienced new prob-lems, as these same contract manufactur-ers sought to clone the OEM’s designs oroverbuild additional units. These unau-thorized manufactured devices are lostrevenue when sold in the marketplaceand can also create additional costs ifdefective units are returned to the com-pany for service or replacement.

As a result, design security has becomeessential. Xilinx products offer a wide rangeof security methods in products uniquelysuited for high-volume applications.

Product Differentiation and FlexibilityOEMs are constantly adapting to stan-dards that have evolved for better per-formance, more flexibility, or to addressexisting limitations. In addition, the rat-ification of a standard can take a longtime, especially as more and more stake-holders participate in the approvalprocess. OEMs must make hard deci-

sions and trade-offs between time to mar-ket and feature set when considering theadoption of these standards.

Field upgrades allow an OEM to bringdevices to market with the flexibility toupdate them after being released to massproduction. This extends the device lifecy-cle tremendously, and without a lot ofinvestment from the OEM. An OEM cannow release devices without fear that anewly endorsed standard will render theirdevice obsolete in the marketplace.

Spartan FPGAs and CoolRunner-IICPLDs allow OEMs to quickly adapt tochanges. Decisions such as protocols,buses, and number of interfaces can all bechanged to quickly address new marketrequirements.

ConclusionThe dynamic nature of high-volume applica-tions is causing unprecedented challenges fordesign engineers. OEMs must deliver prod-ucts that are not only feature-rich and power-efficient, but cost-efficient and quick tomarket as well. The comprehensive portfolioof domain-optimized Spartan series FPGAand CoolRunner-II CPLD products givesdesigners the best choice for their designs.

10 Xcell Journal Second Quarter 2007

Programmable Logic in High-Volume ApplicationsSpartan FPGAs and CoolRunner-II CPLDs meet the security, differentiation, and flexibility needs of the high-volume market.

C O V E R S T O R Y

Page 11: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Ivo BolsensCTOXilinx, [email protected]

In recent years, power consumption as afigure of merit for FPGAs has grown inimportance. Because of the increasing useof FPGAs in high-volume and embeddedapplications, power dissipation and energyefficiency are at the forefront of the systemdesigner’s concerns. Lower power con-sumption simplifies thermal managementand results in smaller (or no) heat sinks,cheaper packages, and lower airflowrequirements, which in turn lead to lowercost and more compact system solutions.Lower current requirements simplify powersupply design, resulting in fewer systemcomponents and less PCB area.

As high-volume applications arebecoming more battery operated, theimportance of energy efficiency (powerconsumed over time) is growing to allowfor longer battery life. In many applica-

tions, an important factor of energy effi-ciency is the static power consumptionthat increases as technology moves tosmaller feature sizes. Moreover, the cost ofenergy consumption plays a growing rolein the total cost and OPEX (operatingexpenditure) of end-customer solutions.

It is well known that, in terms of gigafloating-point operations per second(GFLOPS)/Watt, the FPGA programmablearchitecture is more power-efficient than aprocessor architecture. A high-end Xilinx®

Virtex™-5 component (such as anXC5VLX330) can deliver 3 to 12 times moreGFLOPS (GFLOPS being 64-bit multiply,multiply/add, or add) and consumes 2 to 3times less power than a leading-edge proces-sor architecture (such as today’s dual-corex86 architectures). Therefore, depending onthe application, the high-end Virtex-5 familywill deliver 3 to 30 times betterGFLOPS/Watt than leading-edge processors.Comparing 32-bit floating-point, fixed-point, or integer computations per Wattwould even further benefit the FPGA.

For a given system application, therelentless focus on power optimization hasresulted in Virtex-5 devices consumingbetween 30% to 80% less power thanVirtex-4 devices (depending on whichresources are being used). In the low-costSpartan-3 family, the focus has been fur-ther on ease of use of the power manage-ment and on improving energy efficiencyby providing different power-saving modesduring periods of inactivity.

In contrast to other figures of merit,such as cost and performance, the efficien-cy of power consumption cannot be cap-tured by one single comparative number.Different application characteristics anduse models can result in different benefitsfrom all of the possible power optimiza-tions. “The Power of FPGA Architectures”in this issue of the Xcell Journal demon-strates the effectiveness of design tech-niques such as sleep modes, power andclock gating, voltage scaling, and how spe-cialized hardware depends heavily on thesystem application at hand. “OptimizingFPGA Power with ISE Design Tools,” alsoin this issue, highlights the importance ofintelligent design tools and sophisticatedplace and route algorithms that can exploitthe characteristics of an application to opti-mize power consumption of the finalimplementation on an FPGA.

Designing power-efficient systems intoan FPGA encompasses a holistic optimiza-tion process that includes an understandingof all system constraints, the selection of theoptimal FPGA architecture, and the opti-mization of the design during all steps inthe design flow, from architecture to finalplace and route. Aggressive scaling and costreduction will continue to move FPGAsfurther into higher volume and lower powerproducts. This will result in a constantimprovement of FPGA architectures anddesign tools to meet the requirements oflower power and energy efficiency.

Second Quarter 2007 Xcell Journal 11

Energy-Efficient System DesignPower optimization techniques in silicon and software have become mainstream considerations in FPGA system design.

P O W E R O P T I M I Z A T I O N

Page 12: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Tim Tuan Staff Research Engineer Xilinx, [email protected]

Steve TrimbergerDistinguished EngineerXilinx, [email protected]

Reducing power consumption in FPGAsdelivers numerous benefits such as betterreliability, lower cooling cost, simpler powersupply and delivery, and longer battery lifein portable systems. Designing for lowpower consumption without compromisingperformance requires a power-efficientFPGA architecture and good design prac-tices to leverage the architectural features.

In this article, we’ll present a briefoverview of FPGA power consumption,current low-power features, user choicesthat impact power, and look at recent low-

power research for insight into futuretrends on power-efficient FPGAs.

Power ComponentsFPGA power consumption has two com-ponents: dynamic power and static power.Dynamic power is dissipated when signalscharge capacitive nodes. These capacitivenodes may be inside logic blocks, routingwires in the interconnect fabric, externalpackage pins, or board-level traces drivenby chip outputs. Total FPGA dynamicpower is the combined power from charg-ing all capacitive nodes.

Static power, on the other hand, hasnothing to do with the activity of the cir-cuit. It is dissipated either as transistorleakage current or as bias current. Totalstatic power is the combined total of eachtransistor’s leakage power and all bias cur-rents in the FPGA. As transistor dimen-sions shrink, dynamic power improvesbecause it depends on the side of driven

capacitances. However, static powerincreases because smaller transistors leakmore. As a result, static power is becomingan increasingly large fraction of the overallpower consumption of integrated circuits.

Power consumption is highly dependenton supply voltage and temperature, as shownin Figure 1. Reducing the FPGA supply volt-age results in a quadratic reduction indynamic power and an exponential reduc-tion in leakage power. Increasing temperatureresults in an exponential increase in leakagepower. For example, an increase from 85 °Cto 100 °C increases leakage power by 25%.

Power BreakdownLet’s examine the breakdown of totalFPGA power to understand where most ofthe power is consumed. FPGA power isdesign-dependent; namely, it depends onthe part family, clock frequency, togglerate, and resource utilization. In thisanalysis, we’ll use the Xilinx® Spartan™-3

12 Xcell Journal Second Quarter 2007

The Power of FPGAArchitecturesThe Power of FPGAArchitectures The present and future of

low-power FPGA design.The present and future oflow-power FPGA design.

P O W E R O P T I M I Z A T I O N

Page 13: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

XC3S1000 FPGA as an example andassume that clock frequency is 100 MHz,the toggle rate is 12.5%, and resource uti-lization is typical as determined by bench-marking many real designs.

Figure 2 shows the breakdown of theXC3S1000 FPGA in terms of activepower and standby power. Active power isreported as the power of an active designat high temperature, which comprisesboth dynamic and static power. Standbypower is the power of an idle design,

comprising static power at nominal tem-peratures. Not surprisingly, CLBs makeup the largest component in both activepower and standby power, but otherblocks contribute significant power aswell. I/Os and clock circuitries make up athird of total active power, which wouldbe even higher if you used high-poweredI/O standards.

of a multiplier built from FPGA fabric. Asmanufacturing variations can lead to largespreads in leakage current, Spartan-3LFPGAs screen for low-leakage parts toeffectively provide a part with 60% lesscore leakage power.

Beyond what is designed into theFPGA, many design choices also affectFPGA power consumption. Let’s examinesome of these choices.

Power EstimationA key step in low power design is powerestimation. Although the most accurateway to determine FPGA power consump-tion is through hardware measurements,estimation can help identify high-powermodules and is useful for power budgetingearly in the design cycle.

Several tools are available to aid powerestimation: Xilinx Power Estimator (XPE)and Xilinx Power Analyzer (XPA). XPEgives quick power estimates through a sim-ple user interface. Its estimation is based onhigh-level statistics such as the number oflogic cells, number of block RAMs, andaverage switching activity. XPA givesdetailed power estimates based on simulat-ed switching activity and exact utilizationstatistics generated post-place and route.Choosing the right tool depends on whatdesign information is available and whatlevel of accuracy is needed.

As shown in Figure 1, some external fac-tors have exponential effects on power con-sumption; slight changes in environmentcan cause large changes in estimated power.Exact accuracy is difficult to achievethrough power estimation tools, but theynevertheless provide excellent guidance forpower optimization by identifying high-power blocks.

Voltage and Temperature ControlAs shown in Figure 1, both lower voltageand lower temperature can lower leakagecurrent significantly. Just a 5% lower supplyvoltage can lead to 10% lower power.Supply voltages can be easily adjusted bychanging the power supply configuration.Current FPGAs do not support wide volt-age adjustment; the recommended voltagerange is typically +/- 5%. Junction temper-

Configuration and clock circuitries con-sume nearly half of total standby power,largely because of bias currents. Therefore,total chip power reduction must comefrom multiple solutions that address allmajor power-consuming parts.

Designing for Low PowerMany power-driven design techniques areemployed in the design of an FPGA.Because configuration memory cells canmake up a third of the transistors in an

FPGA, Xilinx uses a low-leakage “midox”transistor in Virtex™ families to reduceleakage in memory cells. Transistors withlonger channel length and higher thresh-olds are also used throughout to reducestatic power. Dynamic power consumptionis addressed with low capacitance circuitsand custom blocks. A multiplier in a DSPblock consumes less than 20% of the power

Second Quarter 2007 Xcell Journal 13

0 10 203 04 0 506 07 0 80 90 100

Temperature (°C)

Rela

tive P

ow

er

Dynamic Power

Leakage Power

1.3V1.2V

1.1V

1.0V

XC3S1000 Active Power

CLB53%

Block RAM2%

I/O22%

Clock16%

Config7%

CLB

39%

Block RAM

5%9%

16%

31%

XC3S1000 Standby Power

Config

Clock

I/O

Figure 1 – Power dependency on voltage and temperature

Figure 2 – Breakdown of a Spartan-3 XC3S1000 FPGA’s typical power consumption

P O W E R O P T I M I Z A T I O N

Page 14: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

ature can be reduced by using cooling solu-tions such as heat sinks and airflow. A 20 °Creduction in temperature can lead to a morethan 25% reduction in leakage power.Lower temperature also exponentiallyimproves chip reliability. Studies haveshown that a 20 °C reduction can lead to a10x improvement in overall chip life.

Suspend and Hibernate ModesSpartan-3A FPGAs provide two low-poweridle states. In suspend mode, the circuitryon the VCCAUX power supply is disabledto reduce leakage power and eliminate biascurrents, reducing static power consump-tion by more than 40%. Chip configura-tion and circuit state are retained. Exitingfrom suspend mode is triggered by assert-ing an awake pin. The process takes lessthan 1 ms.

The hibernate mode in Spartan-3Adevices allows all power regulators to beswitched off to achieve zero power. Torestart, the part must be re-powered andreconfigured, which takes tens of millisec-onds. While powered off, all I/Os are in ahigh impedance state. If any I/Os need tobe actively driven during hibernate mode,the corresponding I/O bank must remainpowered, consuming a small amount ofstandby power.

I/O Standard ChoicesDifferent I/O standards have considerablydifferent levels of power consumption. Youcan achieve significant reduction by choos-ing a lower power I/O standard at theexpense of speed or logic utilization. Forexample, LVDS is a high power consumer,with 3 mA per input pair and 9 mA peroutput pair. Thus, from a power perspec-tive, LVDS should only be used when it isrequired by system specifications or whenthe highest performance is needed.

A lower power, higher performancealternative to LVDS is HSTL or SSTL, butthese still consume 3 mA per input. Whenpossible, we recommend LVCMOS inputsinstead. Lastly, DCI standards are highpower consumers. When connecting tomemory devices such as RLDRAM, con-sider using ODT on the memory andLVDCI on the FPGA to save power.

Embedded BlocksUsing embedded blocks instead of pro-grammable fabric can save a substantialamount of power. Xilinx FPGAs have anumber of embedded blocks such as thePowerPC™ hard-core processor, DSPslices, ChipSync™ technology, embeddedEthernet MAC(s), FIFOs, and SRL16.Embedded blocks are custom-designed;hence they are smaller and have less switch-ing capacitance than programmable logic.These blocks have between 5x and 12xlower power than equivalent programma-ble logic implementations. Using embed-ded blocks can lead to static powerreduction if the design becomes smallerand fits into a smaller part. A potential pit-fall is that very simple functions may not bemore efficiently implemented using largeembedded blocks. This can be easily avoid-ed by checking both implementationsusing XPE.

Clock GeneratorsPower considerations in clock generationcan save power. Digital clock managers arewidely used to generate clocks with differ-ent frequencies or phases. However,DCMs consume a non-trivial amount ofpower off VCCAUX; therefore, their useshould be limited when possible. A singleDCM can often generate multiple clocksby using multiple outputs such as CLK2X,CLKDV, and CLKFX. This is a lowerpower solution than using multiple DCMsfor the same function.

Block RAM ConstructionMultiple block RAMs are often combinedto create a single large RAM. How this isdone can have significant power implica-tions. The timing-driven method is toaccess all RAMs in parallel. For example, a2k x 36 RAM would be constructed out offour 2k x 9 RAMs. The access time of thislarger RAM is the same as that of a singleblock RAM; however, the power consump-tion per access is that of four block RAMs.

A low-power solution is to construct thesame 2k x 36b RAM out of four 512 x 36bRAMs. Each access would be pre-decodedto select one of the four block RAMs toaccess. Although the access time is

increased because of pre-decoding, thepower-per-access of the larger RAM isapproximately the same as that of a singleblock RAM.

Low-Power ResearchIn recent years, Xilinx Research Labs hasstudied various techniques for low-powerFPGA design. These research items are notcurrently implemented in commercialFPGAs, but they do offer an insight intowhat FPGAs might look like in the future.

Voltage ScalingAs mentioned earlier, reducing supply volt-age is one of the most effective ways toreduce power, and the associated perform-ance degradation is acceptable to manydesigns that do not need peak performance.However, current FPGAs operate in a smallrange, and one of the limitations is foundin some voltage-sensitive circuits.

At Xilinx Research Labs, we redesignedthe CLB circuits to operate at a much lowervoltage, enabling a substantial trade-off ofperformance headroom for lower power.For example, for a 90-nm process, a 200-mV reduction could bring a 40% powerreduction at the expense of 25% peak per-formance; a 400-mV reduction could bringa 70% power reduction at the expense of55% peak performance.

Fine-Grain Power SwitchingOne of the unique overheads of program-mable logic design is that not all on-chipresources are used in any given design.However, they remain powered and con-tribute to total power in the form of leak-age power. Block-level power switching canshut off power to individual unused blocks.Each block is coupled to the power supplythrough a power switch. When the switchis closed, the block is functional. When theswitch is open, the block is effectively dis-connected from the power supply, and soconsumes 50-100x less leakage power. Thegranularity of the power switches may be assmall as individual CLBs and block RAMs.In our design, these power switches can beprogrammable through configuration bit-streams or controlled by the user directly orthrough an access port. Benchmarking real

14 Xcell Journal Second Quarter 2007

P O W E R O P T I M I Z A T I O N

Page 15: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 15

designs shows that fine-grain power switch-ing can reduce leakage power by 30%.

Deep-Sleep ModeOne of the key requirements in portableelectronics is to consume little or no powerwhen the device is idle. InSpartan-3A FPGAs, you canachieve this by entering hiber-nate mode, which requiresexternal control, is slow towake up, and cannot restorethe FPGA state. In our design,we dynamically control thefine-grain power switchesdescribed previously to powerdown all internal blocks whileleaving the configuration andcircuit state storage elementspowered. The resulting state isa deep-sleep mode where leak-age power is 1%-2% of nomi-nal, the FPGA state is saved,and exiting this mode takesmicroseconds.

Heterogeneous FabricThe maximum clock frequency of a circuitis determined by the delays of its timing-critical paths. Non-critical paths can beslower without affecting total chip per-formance. In large systems, a few blocksmay be speed-critical, such as the data pathin a processor, while other blocks may benon-critical, such as caches.

Today, FPGAs are homogeneous interms of power and speed; every CLB hasthe same power and speed characteristics.Power can be saved in a heterogeneousarchitecture, where some blocks are lowpower (and slower), by implementing non-critical blocks in the low power blocks.Doing so does not impact overall chip per-formance, as the timing-critical blocks arenot compromised.

One method of creating a heteroge-neous architecture is to distribute two corepower rails, a high-voltage rail (VDDH)and a low-voltage rail (VDDL). Each partof the FPGA selects from one or the otherusing embedded power switches and corre-spondingly takes on high-speed or low-power characteristics. Voltage selection is

complete once you have a detailed timingof the design, so only the non-criticalblocks should operate from VDDL.

Another method of creating a hetero-geneous architecture is to divide theFPGA into different regions that are pre-

fabricated to be high speed and lowpower. Regions can be implemented withdifferent supply voltages, different thresh-olds, or through a number of other designtrade-offs. To avoid performance degrada-tion, design tools must map timing-criti-cal parts of the design into the high-speedregions and non-critical parts into the lowpower regions.

Low-Swing SignalingAs FPGAs increase in capacity, more andmore power is consumed in on-chip pro-grammable interconnect. An effective way toreduce this communication power is to uselow-swing signaling, where the voltage swingon the wire is much lower than the supplyvoltage. Low-swing signaling is commonlyfound today in scenarios where communica-tion takes place over high-capacitance wiressuch as buses or off-chip links. Low-swingdrivers and receivers are more complex thanCMOS buffers, so they consume more sili-con area. However, as on-chip interconnectgrows to be a larger portion of overall power,the power benefits of low-swing signaling

will justify the added design complexity. Ofcourse, FPGA users never see the differentvoltages of their internal signals.

Figure 3 depicts an FPGA architecturewith some of the preceding concepts. Theprogrammable fabric is heterogeneous,

comprising both high-speed and low-powerregions. An on-chip power-mode controllermanages various power-down modes, be itdeep sleep, suspend, or hibernate. Withinthe fabric, each logic tile can be powered offusing dedicated power switches.Communication through the routing fabricgoes through low-swing drivers andreceivers to reduce interconnect power.

ConclusionBeyond the power optimizations currentlyused in the design of modern FPGAs, anumber of user design decisions can yieldsignificant power benefits. Looking for-ward, more aggressive architectural solu-tions are available to contain powerconsumption in future technology genera-tions and enable new FPGA applications.

In addition to the solutions described inthis paper, Xilinx is also engaged in varioussoftware power optimization research anddevelopment work. These efforts aredescribed in the article, “OptimizingFPGA Power with ISE Design Tools,” alsoin this issue of the Xcell Journal.

Low-Swing

Drivers/Receivers

Power

Rails

Power Switches

Logic

Tiles

Power Mode

Controller

Heterogeneous

Fabric

High Speed

Low Power

Figure 3 – Concept architecture with various solutions for power reduction

P O W E R O P T I M I Z A T I O N

Page 16: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Subodh Gupta, Ph.D.Staff Software Engineer, Design Software DivisionXilinx, [email protected]

Jason Anderson, Ph.D.Principal Engineer, Design Software DivisionXilinx, [email protected]

In the more than 20 years since Xilinxinvented the FPGA, research and develop-ment has produced dramatic improve-ments in FPGA speed and area efficiency,narrowing the gap between FPGAs andASICs and making FPGAs the platform ofchoice for implementing digital circuits.Today, power consumption is a rising con-cern for FPGA vendors and their cus-tomers. Reducing the power of FPGAs iskey to lowering packaging and coolingcosts, improving device reliability, andopening the door to new markets such asmobile electronics.

Xilinx is taking the lead in offering low-power FPGA solutions. In this article, we’llillustrate the application of computer-aideddesign (CAD) techniques, such as thoseincorporated into Xilinx® ISE™ version9.2i software, for effective power reduction.

16 Xcell Journal Second Quarter 2007

Optimizing FPGA Power with ISE Design ToolsOptimizing FPGA Power with ISE Design ToolsYou can reduce the power dissipated in Virtex-4, Virtex-5, and Spartan-3 designs through new power-driven back-end flows.You can reduce the power dissipated in Virtex-4, Virtex-5, and Spartan-3 designs through new power-driven back-end flows.

P O W E R O P T I M I Z A T I O N

Page 17: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Power dissipation in CMOS circuitscomprises both static (leakage) power anddynamic power. Dynamic power is causedby transitions on the signals of a circuit andis governed by the equation:

where Ci represents the capacitance of a sig-nal, i; fi is referred to as “switching activity”and represents the rate of transitions on sig-nal i; and V is the supply voltage.

Static power, on the other hand, is con-sumed when a circuit is in a quiescent, idlestate. Static power results from leakage cur-rents in off transistors, primarily sub-threshold and gate-oxide leakage. Anoff-MOS transistor is an imperfect insula-tor allowing sub-threshold leakage currentto flow between its drain and source termi-nals. Gate-oxide leakage is caused by tun-neling current through the gate terminal ofa transistor to its body, drain, and source.

Technology scaling, such as the recentmove to 65 nm, means lower supply volt-ages and smaller transistor dimensions,leading to shorter wire lengths, less capaci-tance, and an overall reduction in dynamicpower. Smaller process geometries alsomean shorter transistor channel lengths andthinner gate oxides, producing an increasein static power as technology scales.

Power Dissipation in FPGAsThe programmability and flexibility ofFPGAs make them less power-efficient thancustom ASICs for implementing a givenlogic circuit. The FPGA configuration cir-cuitry and configuration memory consumesilicon area, producing longer wire lengthsand higher interconnect capacitance.Programmable routing switches attach tothe pre-fabricated metal wire segments inthe FPGA interconnect and add to thecapacitive load incurred by signals.

Most dynamic power in an FPGA isconsumed in the programmable routingfabric. Likewise, static power is proportion-al to total transistor width. The intercon-nect comprises a considerable fraction of anFPGA’s transistors and therefore dominates

∑∈

××=signalsi

iidynamic VfCP 2

2

1

ment of a design and then uses a forceabstraction to move logic blocks away fromhighly congested regions, ultimately pro-ducing a feasible, overlap-free placement.Once analytical placement is finished,swap-based local optimization is run on theplaced design to further refine the place-ment. The traditional cost function used inour placer considers wire length and timingin this equation:

Cost = a × W + b × T

where W and T are the wire length andtiming costs, respectively, and a and b arescalar weights. The values of a and b can beset according to the relative priority of tim-ing versus wire length. The placer costingscheme is illustrated in Figure 1.

As actual routes are not available duringplacement, the wire length cost depends onwire length estimation. Likewise, the tim-ing cost depends on the user-supplied con-straints and estimates of connection delays.To optimize power, we have extended theanalytical placement and local optimiza-tions by adding a power component to thecost function, as shown on the right-handside of Figure 1. The modified cost func-tion is as follows:

Cost = a × W + b × T + c × Pdynamic

where Pdynamic is the estimated dynamicpower (as defined previously) and c is ascalar weight. You can extract signal switch-ing activity data from simulation and supply

leakage. Consequently, the interconnectfabric should be a prime target for FPGApower optimization.

Certainly, power can be addressedthrough process technology, hardwarearchitecture, or circuit-level changes.Virtex™-5 FPGAs, for example, contain“diagonal” interconnect resources, allowingconnections to be made with fewer routingconductors and reduced interconnectcapacitance. At the transistor level, Virtex-4 and Virtex-5 FPGAs employ triple-oxideprocess technology for leakage mitigation.Three possible oxide thicknesses are avail-able for each transistor, depending on itsspeed, power, and reliability requirements.The proliferation and increased use of hardIP blocks such as DSPs and processors canalso lower power consumption versusimplementing such functionality in thestandard FPGA fabric.

It is also possible to reduce power withoutincurring any costly hardware changes. Youcan tackle power issues through new power-driven CAD algorithms and design flows,such as those in ISE 9.2i software.

Power Optimization in ISE 9.2i Design ToolsISE software version 9.2i incorporatespower optimization in placement, routing,and through a post-routing technique thatreduces power within logic blocks.

PlacementThe core algorithm in the Xilinx placeruses analytical (mathematical) techniques.It begins with an initial overlapped place-

Second Quarter 2007 Xcell Journal 17

Timing

Cost

Placer

User

Constraints

Delay

Estimator

Power

Cost

Capacitance

Estimator

Switching

Activity

Data

Wirelength

Estimator

Wirelength

Cost

Figure 1 – Placer costing scheme incorporating power

P O W E R O P T I M I Z A T I O N

Page 18: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

it to the tool. Alternately, if you do not pro-vide any switching activity data, then thetool assumes a default switching activity forthe primary inputs and sequential outputsand propagates activity to the rest of thesignals, considering the logic functionality.For best results, user-supplied switchingactivity data is required.

During placement, the capacitance of asignal is not known and must therefore beestimated. We have constructed an empiri-cally derived capacitance estimation modelbased on signal parameters available duringplacement:

Ci = f (FOi, XSi, YSi)

where f represents a generic mathematicalfunction, Ci is the capacitance of signal i,FOi is the number of fanouts of signal i, andXSi is the X-span and YSi is the Y-span ofsignal i in the placement, respectively. Theseparameters are architecture-independentand are readily available during placement.

To build the model, we extracted thecapacitance, number of fanouts, X-span,and Y-span for each signal in a suite ofdesigns collected from Xilinx customers.We then used least-squares regressionanalysis to fit the capacitance to a quadrat-ic function of the model parameters. Onaverage, the analytical equation gives a30% error for a wide range of designs.

RoutingOnce the logic blocks are assigned to physi-cal locations on the FPGA, we must routethe connections between the blocks. Therouter uses a negotiated congestion-routingalgorithm, where during initial iterationsshorts between signals are permitted. In lateriterations, the penalty for creating shorts isincreased, until no routing conductor isused by more than one signal. Timing-criti-cal connections are routed to minimize theirdelay, which involves computationallyintensive RC delay calculations. Most con-nections, however, are not timing-critical.

In the power-aware router, we choose tooptimize the capacitance of such non-criti-cal connections. To achieve this, wechanged the router’s cost function for non-timing critical connections to considercapacitance, as opposed to the priorapproach of basing cost on other factorssuch as estimated delay or scarcity. To illus-trate the algorithm, consider the routinggraph of Figure 2.

Each node in the graph represents arouting conductor or logic block pin and

each edge represents a programmable rout-ing switch. The router must select a pathbetween the source and target pins. Theoriginal cost and capacitance cost of eachnode is shown internal to each node in thefigure. To minimize the original costs, therouting between the source and target pinwould have taken the path shown in blue.However, in the power-aware flow, therouter will use the path shown in green,which has lower overall capacitance.

Results from Power-Aware Placement and RoutingA set of industrial designs were placed androuted using both the traditional place androute flow, as well as the power flowdescribed earlier. The designs were aug-

mented with built-in automatic input vec-tor generation by attaching a linear feed-back shift register-based (LFSR-based)pseudo-random vector generator to the pri-mary inputs. This permitted us to performboard-level measurements of dynamicpower without requiring a large number ofexternally applied waveforms.

The industrial designs were mapped intoSpartan-3, Virtex-4, and Virtex-5 devices.Results showed that dynamic power wasreduced as much as 14% for Spartan-3

FPGAs, 11% for Virtex-4 FPGAs, and 12%for Virtex-5 FPGAs. On average, across alldesigns, dynamic power was reduced by12% for Spartan-3 FPGAs, 5% for Virtex-4 FPGAs, and 7% for Virtex-5 FPGAs. Forall families, on average, the speed perform-ance hit was between 3% and 4%, which issmall and we believe acceptable in power-conscious designs. Considering that theseare initial results coming from softwarechanges alone, we consider the power ben-efits quite encouraging.

Reducing Power within Logic BlocksThe place and route optimizations dis-cussed in this article aim to lower power inthe interconnect fabric. We also devised anapproach to reduce power within logic

18 Xcell Journal Second Quarter 2007

1/10

5/6

5/3

1/4

1/3

2/3

1/5

10/5

1/2 1/2

Source Pin

Original Cost/

Capacitance Cost

Target Pin

Figure 2 – Routing graph with capacitance costs on nodes

The place and route optimizations discussed in this article aim to lower power in the interconnect fabric.

P O W E R O P T I M I Z A T I O N

Page 19: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 19

blocks, specifically in look-up-tables(LUTs) when not all LUT inputs are used(Figure 3). A K-input LUT is a small mem-ory capable of implementing any logicfunction with no more than K inputs.Figure 3 shows how a hypothetical three-input LUT (having inputs A1, A2, and A3)implements a two-input logical-ANDfunction. The truth table for the logical-AND is shown in the LUT SRAM con-tents on the left side of the multiplexer tree.

Note that input A3 is unused in Figure 3.Normally, unused inputs are treated as “don’tcares” and assumed to be either 0 or 1.Consequently, to account for this in the caseof Figure 3, the Xilinx software “replicates”

the logic function in both the top and bot-tom half of the LUT SRAM memory con-tents. Unused LUT inputs occur frequentlyin customer designs, especially in Virtex-5designs, where LUTs have six inputs.

In the Virtex-5 hardware, unused LUTinputs are pulled high to logic-1; thisproperty is the root of our optimization. IfA3 is pulled to logic-1, the bottom inputto the deepest two-input multiplexer inthe tree is never selected. However, becausethe logic function is replicated in both thetop and bottom half of the LUT memorycontents, toggling can occur on internalmultiplexer nodes n1 and n2 based onchanges in signal inputs A1 and A2. Thistoggling consumes dynamic power need-lessly, as transitions on n1 and n2 are never

propagated to the LUT output.This optimization entails detecting

unused LUT inputs at the post-routingstage and setting LUT memory contents tobe logic-0 so as to eliminate unnecessaryinternal toggling and without disruptingthe logic functionality. Returning to theexample of Figure 3, the bottom half of theLUT memory contents would be set tologic-0. No toggling would occur on inter-nal nodes n1 and n2, and therefore nodynamic power would be consumed bycharging or discharging n1 and n2.

We conducted board-level power meas-urements to evaluate this optimization onindustrial designs and observed that it

provides several percentage points ofdynamic power savings. These results areespecially promising, as this optimizationis “no cost” in the sense that it can be car-ried out post-routing and causes no areaor performance penalty.

ConclusionMany future directions exist for furtherreducing power in design tools. In front-end HDL synthesis, proven optimizationsfrom the ASIC domain can be adapted forFPGAs, such as clock gating and operandisolation. As well, FPGA-specific poweroptimizations can be applied, such as map-ping logic into available block RAMs(which can be used as large ROMs) insteadof using LUTs and the general fabric.

Power-aware logic synthesis and activity-driven technology mapping to LUTs are well-studied in the literature and will yieldconsiderable power reductions for XilinxFPGAs. In placement, improved capacitanceestimation accuracy will yield even higherpower reductions.

Two areas that we feel are particularlyfertile are glitch and leakage optimization.Glitches are spurious transitions that occuron signals caused by unequal path delays incircuits. Such transitions are unnecessary,yet they play a significant role in dynamicpower. CAD techniques to mitigate glitchesinclude equalizing path delays or insertingregisters along the most “glitchy” paths.

Leakage in a digital CMOS circuit dependsstrongly on the circuit’s applied input state.Therefore, one approach to leakage reduc-tion in CAD is to automatically alter the cir-cuit so that its signal values spend more timein low-leakage states.

The results show that significant strideshave already been made in reducing powerthrough ISE design tools. We are optimisticthat the future is bright for achieving addi-tional power reductions in software. Power-conscious solutions comprising power-awareCAD algorithms and power-optimizeddevices such as Virtex-5 FPGAs form a com-pelling story. Continued advances in low-power software and hardware will open thedoor for Xilinx FPGAs entering new power-sensitive markets.

A1

3-Input

LUTA3 O

O

SRAM CellA1

n1

n2

Never selectedbecause unused A3 pulledto logic-1 in hardware

UnusedA3 assumed

to be 'X'logic function

duplicatedby today's

software

A2 A3 (unused)

A3

1

0

0

0

1

0

0

0

Figure 3 – Optimizing power within a LUT

P O W E R O P T I M I Z A T I O N

Page 20: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

SystemBIST™ enables FPGA Configuration that is less filling for your PCB area, less filling for your BOM budget and less filling for your prototypeschedule. All the things you want less of and more of the things you do want for your PCB – like embedded JTAG tests and CPLD reconfiguration.

Typical FPGA configuration devices blindly “throw bits” at your FPGAs at power-up. SystemBIST is different – so different it has three US patentsgranted and more pending. SystemBIST’s associated software tools enable you to develop a complex power-up FPGA strategy and validate it.Using an interactive GUI, you determine what SystemBIST does in the event of a failure, what to program into the FPGA when that daughterboardis missing, or which FPGA bitstreams should be locked from further updates. You can easily add PCB 1149.1/JTAG tests to lower your down-stream production costs and enable in-the-field self-test. Some capabilities:

� User defined FPGA configuration/CPLD re-configuration

� Run Anytime-Anywhere embedded JTAG tests

� Add new FPGA designs to your products in the field

� “Failsafe” configuration – in the field FPGA updates without risk

� Small memory footprint offers lowest cost per bit FPGA configuration

� Smaller PCB real-estate, lower parts cost compared to other methods

� Industry proven software tools enable you to get-it-right before you embed

� FLASH memory locking and fast re-programming

� New: At-speed DDR and RocketIO™ MGT tests for V4/V2

If your design team is using PROMS, CPLD & FLASH or CPU and in-house software to configure FPGAs please visit our website at http://www.intellitech.com/xcell.asp to learn more.

Copyright © 2006 Intellitech Corp. All rights reserved. SystemBIST™ is a trademark of Intellitech Corporation. RocketIO™ is a registered trademark of Xilinx Corporation.

Page 21: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Alex BretonCEO and Founder PrintDreams [email protected]

Marking products and packaging is cur-rently a cumbersome process that involvesgood planning and several work steps. Inmost cases, it occurs in production lines asproducts pass through a fixed printing sta-tion located aside the conveyor line.

Post-production marking occurs mainlythrough labeling, which requires label rollsor sheets to be pre-printed. The labels mustthen be applied to the product or packaging.

RMPT (random movement printingtechnology) from PrintDreams makes mark-ing more flexible, easier, and even fun.Powered by Xilinx® Spartan™-3 FPGAs,our printing technology acts as a magicwand – whatever it touches gets marked. Allyou need to do is to move the device acrossthe area to be marked and the technologywill take care of the rest. Figure 1 shows astorage box marked with RMPT technology.

Second Quarter 2007 Xcell Journal 21

Inkjet Power UnleashedInkjet Power UnleashedThe Spartan-3AN FPGA activates digital ubiquitous marking and printing.The Spartan-3AN FPGA activates digital ubiquitous marking and printing.

N O N - V O L A T I L I T Y

Page 22: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

The secret resides in advanced sensortechnology in combination with effectivecontrol algorithms to determine what,when, and where to print. We cancontrol all inkjet nozzles as a col-umn. We can also control every noz-zle independently if movement withthree degrees of freedom is required.By using the Spartan-3AN FPGAplatform, we can configure newparameters very quickly. The FPGAaccesses the content to be printeddirectly from its internal block RAMif content size is limited.

RMPT printers are hand-helddevices that are portable and bat-tery operated. The unique dual-mode power management (withstatic power reductions of asmuch as 99%) combined withgreat performance is somethingthat our engineers and customershighly appreciate.

Unlimited Sizes and FormatsThere are basically no size or formatlimits for what you can achieve withRMPT technology. The greatreconfigurable nature of XilinxFPGAs makes it possible to move into newformats very fast. We can get them to workas separate controllers, adding new entitiesto achieve larger print swaths.

In this sense, another useful feature inthe Spartan-3AN platform is the built-inmulti-boot feature, which allows us to usethe same hardware design to run differentconfigurations in the geometry of print-head nozzles and print-head blocks. Thissaves us money and new design efforts.

When we invented RMPT, we brokeone of the holiest unspoken rules in theprinter industry: “The width of the printerdevice by default always supersedes thewidth of the media to be printed.”Suddenly, the physical limit went all the

to become the bottleneck of further sizereduction, while in the old technologicalapproach there was always plenty of space

in the large housings. The on-chipflash of the Spartan-3AN platform istherefore a feature we welcome verymuch, as it helps us to additionallyreduce footprint size and externalcomponents.

The extremely compact size sur-prises users, who immediately won-der how they can insert a letter-sizesheet of paper through such a smalldevice. Instead, now it is the printerthat goes to the paper (or anythingelse, for that matter). One of our ini-tial customers came up with a sloganthat summarizes the technologyquite well: “Take the printer to theproject.”

Affordable and Environmentally Friendly TechnologyOur technology is very competitiveeven when compared to currentsolutions, both because of the afford-ability of Spartan-3 FPGAs and the

fact that RMPT is basically solidstate, without complex

mechanical setups.

way down to the print-head cartridge size,a battery, and an electronic board. Withour new approach, the PCB footprint starts

The extremely compact size surprises users, who immediately wonder howthey can insert a letter-size sheet of paper through such a small device. Instead, now it is the printer that goes to the paper (or anything else, for that matter).

Figure 1 – PrintDreams RMPT technology deployed on a storage box

Figure 2 – PrintDreams printing devices feature just a small PCB with a Xilinx Spartan-3 XC3S400AN FPGA as the main component.

22 Xcell Journal Second Quarter 2007

N O N - V O L A T I L I T Y

Page 23: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 23

N O N - V O L A T I L I T Y

If you take apart any desktop orportable printer device, you will noticehow many mechanical parts are involved –sometimes more than 200 pieces. It makesthe design and manufacture of printerdevices a complex, energy-consuming, andcostly process to set up. Every part requiresspecial tooling, plus a step in a long assem-bly line. The industry eats up hundreds oftons of steel and plastic every year for themanufacture of printing devices.

In our case, all we basically need is onesmall PCB and a print head. Our devicescan weigh about as much as a cell phoneand are extremely easy to recycle. Theyeliminate the need for labels and are anexcellent choice for marking, coding, andprinting specific tasks. The only area whereRMPT printing is not competitive is forprinting larger print jobs (more than 5-10pages at a time). We do not aim to competewith very high-quality printing either.Instead, we want to compete with ourcompact size, affordability, and flexibility.

Too Much Flexibility?Our technology is not dependent on cer-tain types of substrates. Provided that asuitable ink is available, there are no limitsfor the types of substrates on which RMPTcan print. Cardboard, wood, fabrics, andplastics are the most common materials,but we even have solutions to print ontometal and glass if required.

One of the first segments where RMPTwas introduced was the crafts industry,through a product called the Xyron DesignRunner. It is amazing to see the amountand variety of materials that scrapbookersuse for their projects. We have been sur-prised to see some projects done in corkand others in very rough fabrics.

For some of our upcoming products –this time in the toy industry – the greatflexibility of our technology becomes evena concern. We would not want childrenprinting “digital graffiti” on floors andwalls. The most hair-raising scenario(mainly for parents than for anyone else)would be children painting each otherusing our devices, because skin is justanother one of the various substrates whereour technology has no problem printing.

In other areas, this flexibility is wel-come. Because we can control not onlycommon inkjet print heads but also othertypes of equipment such as laser heads, wecan easily arrange more advanced and per-manent marking of metal parts or otherhard substrates. Again, the flexibility ofXilinx FPGAs is indispensable to us.

Pushing Technology to its LimitsThe most critical issue during technicaldevelopment of the advanced version ofRMPT has been the real-time aspects. Inmany industries, such as telecom and net-working, the delays that define real time aremeasured in milliseconds. We measure realtime on a microsecond level.

Our system must be able to handle asmany as 2.5 million decisions every second,deciding whether to print from one of eachof the print-head nozzles. In a typical 600-dpi inkjet print head, this corresponds tomore than 300 nozzles. That exerts animmense pressure on both the hardwareand software components.

It has been a bottleneck eliminationprocess all along. Whereas we once used aDSP to serially run more than 300 calcula-tions, we have now enhanced our system tooperate at high capacity and with great par-allelization, without losing the ability toreconfigure new types of print heads. AnASIC would have locked us into probablyone single type of print-head architecture.

Being the first with a technology likethis, we are of course very keen on protect-ing our IP. Therefore, the new Spartan-3AN platform security features that protectagainst cloning and reverse engineeringeliminated our outstanding concerns dur-ing the selection of computing technologyfor our devices.

ConclusionHigh flexibility and performance withgood security at an affordable price are thefeatures that led us to choose the XilinxSpartan-3AN platform. Even better powermanagement and a smaller footprint makethe platform simply irresistible.

For more information aboutPrintDreams and its RMPT technology,visit www.printdreams.com.

GetPublished

Would you like to be published

in XcellPublications?

It's easier than you think!

Submit an article draft for our Web-based

or printed Xcell Publications and we will

assign an editor and a graphic artist

to work with you to make your work

look as good as possible.

For more information on this

exciting and highly rewarding program,

please contact:

Forrest Couch

Publisher, Xcell Publications

[email protected]

Page 24: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

HAPS – HARDI ASIC Prototyping Systema high performance, high capacity FPGA platform for ASIC prototyping and emulation composed of multi-FPGA boards and standard or custom-made daughter boards

HapsTraka set of rules for pinout and mechanical characteristics, which guarantees compatibility with previous and futuregeneration HAPS motherboards and daughter boards

HARDI Electronics Inc., 26831 Magdalena Lane, Mission Viejo, CA 92691, (949) 202-5572

www.hardi.com, [email protected]

Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 are registered trademarks of Xilinx Inc.

Hap

sTrak

2003co

mpat i b

le

Hap

sTrak

2004co

mpat i b

le

Hap

sTrak2005c

o

mpat i b

le

Hap

sTrak

2007co

mpat i b

le

HAPS-10

HAPS-30

HAPS-20

HAPS-50

Page 25: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Maureen SmerdonStrategic Marketing ManagerXilinx, [email protected]

Security has become a hot topic: whetherboarding a plane, closing the front door, orbeginning a next-generation circuit design,security is a significant issue. For designers,the greatest threat comes from the alarmingamount of counterfeited products on themarket as a result of design thefts.According to the Anti-CounterfeitingCoalition, the estimated dollar exchangeassociated with counterfeiting throughoutthe United States in 2003 was $287 billion,or 63% of the total $456 billion of coun-terfeit products sold annually worldwide.

In this article, I’ll describe Xilinx® secu-rity measures that can protect your low-cost FPGA designs.

The Top Three Security ThreatsThe most common security breach in elec-tronics design is reverse engineering. Thisoccurs when a thief attempts to recreate orrebuild a product with the intent of sellingit on the open market at a lower cost.Through reverse engineering, thieves canbuild designs much faster without incur-ring the expense of R&D, quite probably ata lower cost.

Today, as companies have moved to out-source manufacturing, they are subject tonew security breaches called overbuildingand cloning. In overbuilding, the out-sourced manufacturer simply manufacturesmore units than the OEM (original equip-

ment manufactur-er) ordered. Theseadditional units are soldwithout authorization fromthe OEM.

Cloning is when a thief createsa duplicate of your design, IP, orproduct under the same (or another) label.Again, cloners do not incur any R&Dcosts. Both overbuilt and cloned productshave a drastically reduced time to market.

What remains unknown are the intangi-bles associated with such security breaches.Whether a product is reverse-engineered,overbuilt, or cloned, it means a substantialrevenue loss for the OEM. In addition to theloss of revenue, there is also a cost associatedwith quality in the form of returns. Thiscould affect the brand image and potentiallyadd to the OEM’s bottom line due toincreased RMAs (return materials authoriza-tions) or technical support to determinewhat the issue is and how to resolve the endcustomer’s problem. Ultimately, it may be

difficult to deter-mine if the product is real

or counterfeit. These losses arepermanent and unrecoverable.

Security Using DeviceDNATraditionally, FPGAs use some kind of bit-stream encryption to guard against reverseengineering and cloning. In yesteryear’smodel, this worked well. It will not, how-ever, protect you in today’s world againstoverbuilding.

As a designer, how can you protect yourdesign against all three security thefts? Xilinxhas a few solutions, recently introducing theSpartan™-3A and Spartan-3AN device fam-ilies with DeviceDNA to help you fend offcloners, overbuilders, and reverse engineers.

The DeviceDNA design-level securityprotects the design, the IP, and the embed-ded code. DeviceDNA is a specific 57-bitID that is unique to each device. This 57-bit ID is contained in an area on the FPGAand is fixed or set at the Xilinx factory; it

Second Quarter 2007 Xcell Journal 25

Low-Cost Security Solutions withSpartan-3A and Spartan-3AN Platforms Could your low-cost FPGA design be more secure? Check out what Xilinx has to offer with the newest additions to the Spartan-3 generation.

S E C U R I T Y

Page 26: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

cannot be altered. Both Spartan-3A andSpartan-3AN FPGAs contain a unique IDin each device shipped.

This ID is then combined with a design-er’s personalized algorithm and stored on theFPGA. The algorithm is basically an arith-metic equation that defines how to take theDeviceDNA and create a result. The resultcan then be stored anywhere, such as in theexternal memory or in flash. The algorithmis the secret to the security because only thedesigner knows it. Although it is stored onthe FPGA, to an onlooker it just looks likepart of the bitstream.

Spartan-3A SecurityFor Spartan-3A devices, the algorithmcompares the result using DeviceDNA tothe result stored in the flash after the devicehas been configured. If they match, thedesign is authorized. If they do not match,then the design can be set up to behave ina variety of ways, from slight to severe func-tional impairment.

Let’s look at an example of authentica-tion in our daily life. Say you stop at a localfast food restaurant for a snack. You are outof cash and are forced to use your ATMcard (DeviceDNA), which only you areauthorized to use. You place your order andthen swipe the card. The machine asks foryour PIN number (personalized algo-rithm). The system then compares the PINnumber you entered with the numberstored at the bank. If it matches, you getyour snack. If not, you will go hungry.

The potential weakness is if someonehas both your ATM card and your PINnumber. The PIN authorization algorithmnumber, once learned, is easily cloned. Thisis why the authorization algorithm is incor-porated into the design itself. The algo-rithm is placed in the most secret locationinside of programmable logic, with mil-lions of configuration options.

Spartan-3AN SecurityFor the Spartan-3AN platform, our newnon-volatile FPGA, the process is almost thesame – with a few enhancements. The firstsecurity enhancement is that the bitstream ishidden inside the FPGA. This makes itmore difficult for someone to monitor.

not breached. Because the algorithm isunknown, it is the key to the design-levelsecurity. The algorithm is implemented inthe fabric of the FPGA; therefore, itbecomes a handful of bits within the mil-lions of configuration bits in the FPGA.Unless you know how the bits fit togetheror what the algorithm is, it looks like just amass of numbers. Figure 1 outlines a possi-ble flow using Spartan-3AN devices.

The Spartan-3AN design-level securityshown in Figure 2 is a completely self-contained security solution. The flashcontains both the FPGA configurationbitstream and a previously generated

The second security enhancement thatthe Spartan-3AN FPGA has are twounique serial numbers, the DeviceDNAand a factory flash ID, found in flashmemory. The two unique IDs give morethan 70 bytes of serial numbers, resultingin a large number of algorithmic possibil-ities and therefore increasing the amountof time it would take to breach theauthentication algorithm. Now thedesign is specifically tied to both theFPGA and the flash IDs.

Applying the earlier analogy, having twounique IDs is like requiring two differentATM cards to get a snack.

The third improvement is in the storedauthorization code. On the Spartan-3ANplatform, the authorization code can bestored on-chip in a special one-time pro-grammable 64-byte register called the FlashUser Field. This allows the complete secu-rity system to be self-contained. With noneed for external interfaces or storage, over-all security increases and reverse engineer-ing is more difficult.

The authentication algorithm is user-defined, allowing you to implement theright level of security within your designbudget. The authentication algorithm isalso the primary secret in the security sys-tem. Something in the authenticationprocess must be a secret so that security is

64-Byte

Flash ID

Personalized

Algorithm

Compare

Authorized

Calculated

Authorization

Code

Stored

Authorization

CodeMat

chNo M

atch

Unauthorized

57-Bit

DeviceDNA

64-Byte Flash

User ID

Device DNA

Algorithm

Flash IDStored

AuthorizationCode

FPGA

Flash

Figure 1 – Possible security setup with Spartan-3AN FPGAs

Figure 2 – Spartan-3AN device with security

26 Xcell Journal Second Quarter 2007

S E C U R I T Y

Page 27: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 27

authorization code. This code is stored inthe one-time programmable Flash UserField by a trusted/secured manufactureror registration process.

At power up, the FPGA configuresnormally. Once configured, the FPGAapplication includes circuitry that vali-dates the design authorized to operate onthe associated Spartan-3AN FPGA. TheDeviceDNA and factory flash ID will beread by the authentication algorithm,which in turn generates an active author-ization code that is compared to the pre-viously generated authorization codestored in the Flash User Field. If bothcodes are equal, the device is authenticat-ed. Otherwise, the device is illegitimateand unauthorized.

Access DeniedThe handling of failed authentication isanother one of the strengths of theDeviceDNA design-level approach.Authentication can be completely inte-grated into the design. Thus, multipleresponses can result from an unauthorizeddesign, such as:

• No functionality – the design com-pletely stops functioning

• Limited functionality – primary or keycircuits are disabled or bypassed

• Time bomb – full functionality foronly a limited period of time

• Active defense – the system monitorsactivities and defends against attack

• Permanent self-destruction – erases allflash content and permanently lock-downs flash to all zeros

The design-level security described hereis the basic level of security that can beachieved within the Spartan-3A andSpartan-3AN platforms.

ConclusionThe security measures in Spartan-3A andSpartan-3AN platforms provide many waysto protect from reverse engineering, over-building, and cloning. To learn more aboutsecuring your low-cost FPGA designs, seethe Spartan Generation Configuration User Guide at www.xilinx.com/bvdocs/userguides/ug333.pdf.

S E C U R I T Y

Page 28: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,
Page 29: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Adrian CosoroabaMarketing ManagerXilinx, [email protected]

Applications can generally be classified intotwo categories: high performance, wheregetting the highest bandwidth is para-mount; and low cost, where the cost of thesystem is more important. For high-endapplications, Xilinx offers Virtex™-5FPGAs, which are capable of meeting thehighest bandwidth requirements.

However, not all systems are pushingmemory performance limits. DDRSDRAMs and DDR2 SDRAMs, with datarates running below 400 Mbps per pin, areadequate for most low-cost systems. Forthese applications, Xilinx offers theSpartan™-3 Generation, comprisingSpartan-3, Spartan-3E, Spartan-3A, andSpartan-3AN FPGAs.

In this article, I’ll explain the architectureof a DDR2 SDRAM memory interfaceimplementation with Spartan-3 GenerationFPGAs and how Xilinx® tools and hard-ware-verified reference designs can help youbuild your own memory interface.

Second Quarter 2007 Xcell Journal 29

Implementing Low-Cost DDR2Interfaces with Spartan-3A FPGAsXilinx provides complete memory interface solutions to help you get to market faster.

M E M O R Y I N T E R F A C E S

Page 30: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Memory Controller and DDR2 Interface Three fundamental building blocks com-prise a controller and memory interface foran FPGA-based design: the read and writedata interface; the memory controller statemachine; and the user interface, whichbridges the memory interface design to therest of the FPGA design (Figure 1).

In Spartan-3 Generation FPGAs, theuser interface is a handshaking-type inter-face. When you send a command (read orwrite) along with the address and data forwrite, the user-interface logic respondswith the “user_cmd-ack” signal when thenext command can follow.

The functional blocks implemented inthe fabric are clocked by the output of thedigital clock manager (DCM), which alsodrives the LUT (look-up table) delay cali-bration monitor. This calibration circuit isused to select the number of LUT-basedelements required to delay the read datastrobe (DQS) with respect to read data(DQ), in order to properly align it for cap-ture inside the FPGA.

During a read transaction, the DDR2SDRAM device sends the DQS and associ-

falling-edge (FIFO_1) FIFOs. These 16-entry-deep FIFOs are asynchronous, withindependent read and write ports.

The transfer of DQ from the DQS clockdomain to the memory controller clockdomain occurs through these asynchronousFIFOs. Data can be simultaneously readout of both FIFO_0 and FIFO_1 in thememory controller clock domain. FIFOread pointers are generated in the FPGAinternal clock domain. The generation ofwrite enable signals (FIFO_0 WE andFIFO1_WE) occurs using the DQS and anexternal loop-back or normalization signal.

ated data to the FPGA edge-aligned withthe DQ. Capturing the DQ is a challeng-ing task, because data changes at every edgeof the non-free-running DQS strobe.

DQ capture is implemented using theLUTs in the configurable logic blocks(CLBs). The DQ capture implementationuses a tap-delay mechanism based on LUTs.The DQS clocking signal is delayed to pro-vide enough of a timing margin. The captureof DQ is implemented in dual-port LUT-based RAM (Figure 2). The LUT RAM isconfigured as a pair of FIFOs; each data bitis input into the rising-edge (FIFO_0) and

DCMInput_clock FPGA_Clock

DQS

DQ

DM

and Control

Clocks all modules in fabricUser_data_valid

User_output_data

User_input_data

User_cmd_ack

User_burst_done

User_command

User_address

User_data_mask

LUT Delay

Calibration

Monitor

Write Datapath

Read

Capture

User Interface

Controller

DDR2

SDRAM

Spartan-3 Generation

FPGA

FIFO

FIFO

LUT

Delay

LUT

Delay

Delayed DQS

FIFO 0

FIFO 1 User Data

FIFO_1_WE

FIFO_1_WE

FIFO_0_WE

FIFO_0_WE

DQ

Normalization

Signal

CLK

IOB CLB

Spartan-3

GenerationDDR2

SDRAMAddress

Data/DQS

Figure 1 – DDR 2 SDRAM interface implementation for Spartan-3 Generation FPGAs

Figure 2 – DQ capture FIFO implementation for Spartan-3 Generation FPGAs

30 Xcell Journal Second Quarter 2007

M E M O R Y I N T E R F A C E S

Page 31: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 31

The external normalization signal is driv-en to an input/output block (IOB) as anoutput; it is then taken as an input throughthe input buffer. This technique compen-sates for the IOB, device, and trace delaysbetween the FPGA and the memory device.

The write data interface generates andcontrols write data commands and tim-ings. The write data interface uses IOBflip-flops and the DCM’s 90-, 180-, and270-degree outputs to transmit a DQSstrobe properly aligned to the commandand data bits, per DDR and DDR2SDRAM timing requirements.

There are other aspects of the design,including the overall controller statemachine logic generation and user inter-face. To make the complete solution easierfor designers, Xilinx developed theMemory Interface Generator (MIG) tool.

Controller Design and Integration with the MIG Software ToolIntegrating all of your building blocksincluding the memory controller statemachine is essential for the completeness ofyour design. Controller state machines varywith memory architecture and systemparameters. State machine code can also becomplicated – and a function of many vari-ables – such as:

• Architecture (DDR, DDR2)

• Number of banks (external or internalto the memory device)

• Data bus width

• Memory device width and depth

• Bank and row access algorithms

Finally, parameters like data-to-stroberatios (DQ/DQS) can add further com-plexity to the design. The controller statemachine must issue the commands in thecorrect order while considering the timingrequirements of the memory device.

The complete design can be generatedwith the MIG software tool, freely avail-able from Xilinx as part of the ISE™ soft-ware CORE Generator™ suite ofreference designs and IP. The MIG designflow is very similar to the traditionalFPGA design flow. The benefit for design-

ers is that with this software tool, there isno need to generate RTL code fromscratch for the physical layer interface orthe memory controller.

To set system and memory parameters,you use the MIG’s graphical user interface(GUI). For example, after selecting theFPGA device, package, and speed grade, youcan select the memory architecture and evenpick the actual memory device or DIMM(dual inline memory module). This sameGUI provides a selection of bus widths andclock frequencies. Other options providecontrol of the CAS (column access strobe)latency, burst length, and pin assignments.

In less than a minute, the MIG toolgenerates RTL and UCF files (the HDLcode and constraints files, respectively).These files are generated using a library ofhardware-verified reference designs, withmodifications based on your inputs. The

output files are categorized in modules thatapply to different building blocks of thedesign, such as user interface, physicallayer, and controller state machine.

You also have complete flexibility to fur-ther modify the RTL code. Unlike othersolutions that offer “black-box” implemen-tations, the code is not encrypted, provid-ing complete flexibility to change andfurther customize the design. After anyoptional code changes, you can performadditional simulations to verify the func-tionality of your overall design.

The MIG tool also generates a synthe-sizable test bench with memory checkercapability. The test bench is a designexample used in the functional simula-tion and hardware verification of theXilinx reference design. The test benchissues a series of writes and read-backs tothe memory controller. You can also use itas a template to generate your own cus-tom test bench.

The final stage of the design is to importthe MIG files in the ISE project, mergethem with rest of your FPGA design files(followed by synthesis and place and route),run additional timing simulations if neces-sary, and perform hardware verification.The MIG software tool generates a batchfile with the appropriate synthesis, map,and place and route options to help youoptimally generate the final bit file.

Hardware Verification and Development BoardsHardware verification of reference designsis an important final step to ensure a robustand reliable solution.

Xilinx has fully verified the implementa-tion of a DDR2 SDRAM memory inter-face for Spartan-3A FPGAs in hardware.We implemented a DDR2 SDRAM designusing a low-cost Spartan-3A starter kitboard. Our design uses the on-board 16-

bit-wide DDR2 SDRAM memory deviceand the XC3S700A-FG484 FPGA. Thereference design uses only a small portionof the Spartan-3A device’s availableresources: 13% of IOBs, 9% of logic slices,16% of BUFG MUXs, and one of the eightDCMs. This implementation leaves plentyof resources for other functions.

Xilinx has verified memory interfacedesigns for various Spartan-3 GenerationFPGAs. Table 1 shows hardware-verifiedmemory interfaces for each of the Spartan-3Generation FPGA development boards.

ConclusionYou can speed up your memory interfaceand controller designs with low-costSpartan-3 Generation FPGAs, the MemoryInterface Generator (MIG) tool, and Xilinxdevelopment boards.

For more information and details onmemory interface solutions, visitwww.xilinx.com/memory.

Xilinx FPGA Spartan-3 FPGAs Spartan-3E FPGAs Spartan-3A FPGAs

Development Board SL361 Starter Kit 3E Starter Kit 3A

Memory Interfaces Supported DDR DDR DDR2

Table 1 – Low-cost development boards for memory interfaces

M E M O R Y I N T E R F A C E S

Page 32: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

FREE on-line tutorialswith Demos On Demand

A series of compelling, highly technical product

demonstrations, presented by Xilinx experts, is now

available on-line. These comprehensive videos provide

excellent, step-by-step tutorials and quick refreshers

on a wide array of key topics. The videos are segmented

into short chapters to respect your time and make for

easy viewing.

Ready for viewing, anytime you are

Offering live demonstrations of powerful tools, the

videos enable you to achieve complex design require-

ments and save time. A complete on-line archive is

easily accessible at your fingertips. Also, a free DVD

containing all the video demos is available at

www.xilinx.com/dod. Order yours today!

©2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Page 33: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Paul ChuDirector of Product Marketing Accusys, Inc. [email protected]

Jason LinHardware Design ManagerAccusys, Inc. [email protected]

Cliff TsaiSenior Field Application EngineerXilinx Taiwan Pte. [email protected]

RAID (redundant array of independentdisks) technology has been widely adoptedin systems with hard disks for protectingdata, ensuring system availability, andimproving system performance. Accusyshas delivered many forms of RAID prod-ucts, including the low-cost, high-perform-ance ATA hardware RAID 5 solution.

To build next-generation eSATA RAIDproducts supporting the 3-Gbps SATA2standard, Accusys used a Xilinx®

Virtex™-4 FX FPGA to implement theeSATA host interface controller as well as theRAID engine. In this article, we’ll explain thebenefits of a 3-Gbps eSATA RAID solutionand the role of the Virtex-4 FX FPGA.

Second Quarter 2007 Xcell Journal 33

Building a 3-Gbps eSATA/SATAHardware RAID 5 SolutionBuilding a 3-Gbps eSATA/SATAHardware RAID 5 SolutionXilinx FPGAs help RAID architect Accusys deliver innovative RAID storage.

Page 34: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Introduction to RAIDDifferent types of RAID algorithms offerdifferent performance, reliability, andcapacity. RAID 0 can improve I/O per-formance by concurrently distributing dataover multiple disk drives, which also offersbigger space than the capacity of a singledisk drive. However, if any of the diskdrives in the RAID fails, the data will belost. To protect the data, extra informationassociated with that data must be stored incase the original data has to be regenerated.

RAID 1 stores data to two disk drivessimultaneously, so data is retained whenone disk drive fails. But it is very costlybecause double storage capacity is required.RAID 5 offers the most cost-effective solu-tion because only the capacity of one diskdrive is needed for protecting data on anynumber of other disk drives. It also deliverssuperior performance because data is distrib-uted to multiple disk drives, like RAID 0. Asa result, RAID 5 is the most popularRAID level adopted in most of the storagesystems deployed.

Advantages of Hardware RAIDRAID processing is performed either bysoftware running on the host computer(software RAID) or by a dedicated RAIDprocessor (hardware RAID). However, formost mission-critical systems or heavy-dutyapplications, a hardware RAID solution ispreferred because it can deliver better per-formance with optimized hardware thansoftware RAID, which consumes consider-able resources on the host computer.Hardware RAID also provides superior reli-ability because it can regenerate the data of adisk drive in a shorter time; thus, the RAIDcan be recovered faster to a normal state.Therefore, mainstream RAID products areall implemented based on hardware RAID.

High-Speed and Low-Cost eSATA One of the key features of a RAID solutionis its host interface, by which the host com-puter accesses the storage presented by theRAID solution. There are two commonlyused host interfaces: PCI-X/PCIe for RAIDcards and SCSI family connections likeFibre Channel, or serial attached SCSI forRAID systems. Despite its high perform-

capacity wasted is too great and there is noperformance benefit.

Unlike other eSATA RAID solutions onthe market, the Accusys ACS-76000 RAIDsystem on chip offers hardware RAID 5.However, its performance is limited by the1.5-Gbps SATA1 speed. As high-definitionmedia content and other applicationsbecome popular, a higher performancesolution will be necessary.

Implementation OverviewTo achieve higher performance, the firstrequirement is to upgrade the storageinterface from SATA1 to SATA2, so youcan achieve 3 Gbps and support NCQ.The second requirement is to have a high-performance and standard bus for con-necting to a processor. We chose thePCI-X bus because it is commonly used inhigh-performance embedded systems.Finally, to support hardware RAID 5 pro-cessing, an exclusive OR (XOR) engine isneeded to calculate the parity and regener-ate data. Figure 1 is a block diagram of theSATA2-SATA2 RAID controller board.

We selected the Xilinx Virtex-4 FXFPGA because it provides the multi-gigabitserial I/O for implementing the 3-GbpsSATA host interface and can support theSATA device-mode controller logic forconnecting to the digital parallel bus inter-face of the MGT hard core. We also imple-

ance, using RAID solutions with these twointerfaces is too troublesome for users with-out IT expertise because of the complicatedinstallation procedure. On the other hand,the eSATA interface, which makes it easy toattach external storage devices, is gainingpopularity in desktop computers.

The eSATA features a high-speed serialconnection running at 3 Gbps and sup-ports native command queuing (NCQ) foroptimizing multi-tasking applications ormulti-stream video processing. In addition,there is no need to install extra drivers oradd-on cards for accessing eSATA devices,so its cost is lower and it is easier to use. Asa result, RAID storage devices using eSATAas the host interface are good solutions interms of performance and ease of use.

Prior ArtMost existing eSATA RAID products on themarket do not support hardware RAID 5,which largely restricts their applicationfields. The RAID implementation of theseproducts are derived from SATA port mul-tipliers, offering only simple packet dis-patching and supporting only basic RAIDlevels like RAID 0 or RAID 1. RAID 0 canserve high-performance applications likevideo editing, but users are bothered by theroutine backup and unexpected data lossescaused by bad sectors and drive crashes.RAID 1 provides data protection, but the

PBI Bus

DD

R I

/F

PCI-X Bus

RS-232

Flash

DDR SDRAM

DIMM

Accusys PCI-X to

SATA II Device

Controller

To Host SATA I/F

To Disk SATA I/F

PCI-X to SATA II

Host Controller

Microprocessor

I2C GPIO

Figure 1 – Block diagram of SATA2-SATA2 RAID controller

34 Xcell Journal Second Quarter 2007

Page 35: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 35

mented a 64-bit/133-MHz PCI-X con-troller and an XOR controller on theFPGA. A PCI-X-based external processorcan parse commands from the SATA hostand return status. In addition, the XORcontroller can directly read the sourcememory data on the PCI-X bus and calcu-late the XOR data for the RAID 5 encod-ing. All of these tasks are done without theprocessor’s intervention. Please refer toFigure 2 for a block diagram of the FPGA.

Implementing the 3-Gbps eSATAThree different layers are defined in theSATA controller: the transport layer, linklayer, and physical layer (PHY):

1. Transport layer. The SATA commandexecution and data transfer occurs byexchanging frame information struc-tures (FISs), which contain ATA regis-ters or disk data. The transport layerconstructs FISs from the PCI-X busfor transmission, decomposes receivedFISs, and passes them to the PCI-Xbus. In short, the transport layer is theinterface between the link layer andthe higher application layer, which inour design is the PCI-X bus.

2. Link layer. The link layer is responsi-ble for packet framing, 8b/10b encod-ing and decoding, and generating and

checking the CRC codes. The linklayer also handles flow control, buffer-ing data and exchanging primitives asneeded to accommodate burst trans-fers. The Virtex-4 FX device’s multi-gigabit transceiver (MGT) embeds the8b/10b encoder and decoder, as wellas a CRC-32 generation circuit, whichprovides tremendous convenience forbuilding the SATA link layer.

3. Physical layer. The Virtex-4 FX FPGA’s6.5-Gbps RocketIO™ transceiver pro-vides the basic building block of theSATA physical layer, responsible for de-serialization of received data from thehost PHY and for serialization of out-bound 10b encoded data from the linklayer. It also supports out-of-band(OOB) signals for initializing the con-nection between the SATA host anddevice. To support external 3-GbpsSATA2, Gen2m electrical specificationsare required in the implementation ofthe SATA physical layer.

Implementing the PCI-X BusThe PCI-X controller includes both themaster and target function of a 64-bit/133-MHz PCI-X device. It isequipped with a buffer RAM for storingdata from the SATA transport layer and

from the PCI-X bus for processing thesplit transaction flow. In addition, anembedded PCI-X arbiter is implementedto provide more flexible control for thePCI-X master. The DMA controller playsan important role in this design because itcan provide an efficient bi-directional datatransfer between FISs (from the SATAtransport layer) and the PCI-X bus. As aresult, the performance of bulky datatransfer significantly improves.

Implementing the XOR ControllerFor applications like video editing orrecording, efficient XOR calculation ismandatory to meet the demanding require-ments for low I/O latency and highthroughput. The XOR controller isdesigned with as many as 256 data sourcesand 128 chained commands. Each of thesecan calculate 64-KB parity data. With theXOR controller implemented on theFPGA, the processor and the memorybandwidth are freed from performing thelengthy and routine XOR calculation on256 sources with 128 chained commandsat 64 KB per entry.

ConclusionThe 3-Gbps eSATA hardware RAID offershigh performance, low cost, and ease ofuse. As more computers and embeddedsystems are shipped with eSATA ports, theeSATA RAID product represents animportant external RAID solution for cus-tomers requiring both high performanceand high reliability. Leveraging theadvanced features of Virtex-4 FX FPGAssuch as its digital clock managers and easy-to-use building blocks like dual-port blockRAM, we developed the key componentfor the 3-Gbps eSATA RAID systemquickly and efficiently.

With our success delivering productsbased on Xilinx Virtex-4 FX FPGAs,Accusys is positioned to deliver eSATARAID products for high-end applicationslike video editing and workgroup-sharedstorage, as well as entry-level applicationslike DVR backup or home storage. Formore information about RAID and storagetechnologies, visit the Accusys website atwww.accusys.com.tw.

MGT

To Host SATA I/F

To PCI-X I/F

PCI-X Unit

XOR

Controller

Virtex-4 FX20 FPGA

DMA

Controller

Register

R/W

SATADevice Controller

PCI-XTarget

PCI-XMaster

PCI-XArbiter

OOBDetection/

Transmission

PrimitiveFIS Control

Unit

Figure 2 – Block diagram of 3-Gbps eSATA RAID engine

Page 36: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Chris Dick Xilinx Chief DSP Architect Xilinx, Inc. [email protected]

Fred HarrisProfessorSan Diego State [email protected]

Miroslav PajicEngineerSignum [email protected]

Dragan VuleticEngineerSignum [email protected]

Most real-world communication systemshave a mix of processing elements. For exam-ple, application programs, human/machineinterface management, and higher network-ing protocol stack processing are best imple-mented on a general-purpose processor.

But for high-rate, algorithmically com-plex data processing – often with hard real-time deadlines – hardware resources likeFPGAs are a better match. The interfacebetween the two depends on the circum-stance; the FPGA can be a pre-processor,coprocessor, post-processor, or some com-bination thereof. The trick is to get theseheterogeneous systems to interoperate inan elegant fashion.

36 Xcell Journal Second Quarter 2007

Implementing a Real-TimeBeamformer on an FPGA PlatformImplementing a Real-TimeBeamformer on an FPGA PlatformWe designed a flexible QRD-based beamforming engine using Xilinx System Generator.We designed a flexible QRD-based beamforming engine using Xilinx System Generator.

Page 37: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

In this article, we’ll describe the devel-opment of a flexible, optimized, adaptivebeamforming engine that you can easilycontrol through software. The DSP-inten-sive tasks run on the FPGA, while the com-mand and control run on an externalprocessor. The beamforming engine is acompact QR decomposition (QRD-basedcircuit) with a novel construction. Theinterface between the engine and the hostprocessor is implemented by the sharedmemory abstraction in the Xilinx® SystemGenerator design flow.

MVDR BeamformerAdaptive beamforming is the application ofadaptive filters to spatial signal processing.Time series collected from uniformlyspaced array elements are weighted andsummed to form a signal component froma selected direction of arrival while sup-pressing signal components from otherdirections of arrival (Figure 1). When thedirections of arrival of the undesired signalcomponents are unknown or vary withtime, the filter weights must be adaptivelyadjusted to steer nulls to their directions.The adaptation process is performed sub-ject to a constraint that the steering vectorhas unity gain in the signal direction. Thesteady state weights of such a beamformerform the minimum variance distortionlessresponse (MVDR) from the array elements.

For reasons of numerical robustness andcomputational complexity, a commonmethod for computing the required weightvector without directly inverting the corre-lation matrix is based on QR decomposi-tion; this is the approach adopted here. Fordetails of the procedure, consult “AdaptiveFilter Theory” by Simon Haykin.

The QRD Matrix Inversion ProcessThe QRD process is formed by a sequenceof two operators: the unitary rotations thatconvert complex input data to real data andassociated angle-and-element combinersthat individually annihilate the selected ele-ments of the input data set. The QRDprocess is most compactly represented inFigure 2’s signal flow diagram. This repre-sentation is the systolic array realization ofthe QRD least-squares solution processor.

Second Quarter 2007 Xcell Journal 37

wM-1(n)xM-1(n)

w0(n)

w1(n)

w2(n)s(t,θ)Σ

x0(n)

Array Elements

Direction

of Arrival

Steering VectorAdaptive

Algorithm

x1(n)

x2(n)

0,0 0,1 0,2 0,3

x(2)

x(1)

x(0)

x(1)

x(0)

0

x(0)

0

0

d(2)

d(1)

d(0)

Δ Δ

Δ

0

1

2

1,1

xc

s

c

s

c

s

xin

x

xin

xii

xii

zi

pi

pi

wi^

wi =^

^wk^

^

wk

xin

xout

xout cxin - sλ1/2 x

1,2

x

1,3

2,2 2,3

Linear Array

Triangular

Array

Triangular Array

Processing

Node Definitions

Linear Array Processing Node Definitions

xinif = 0 then

c 1

s 0

x λ1/2 x

Otherwise

λ1/2 x

(i)

zi(i)

zi(k)

= zi + x* ik wk(k)

zi(k-1)

zi(k-1)

λx2 + |xin|2

cx'

x'

sx'

x sxin + cλ1/2 x

x x'

Figure 2 – Systolic array implementation of QRD matrix inversion for a 3 x 3 array

Figure 1 – Adaptive beamformer structure

Page 38: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

The array contains three types of pro-cessing cells: boundary cells, internal cells,and output cells. The boundary cells per-form the “vectoring” operation on complexinput samples to nullify their imaginaryparts and form rotation angles used byinternal cells. The internal cells performGivens rotations of the input values by theangles passed from the boundary cells toannihilate the non-upper-triangular entriesof the transformed data matrix. The outputcells in the linear array process the elementsof the upper triangular array to perform therequired back substitution that producesthe beamformer weights.

FPGA Implementation of QRDOur goal was to produce a compact QRDFPGA implementation. The design com-prises a single boundary, internal and backsubstitution cells. The systolic array inFigure 2 is folded onto this set of process-ing resources. The boundary cell is requiredto compute two angles. The first angle

Φ = arctan(ℑ(xin)/ℜ(xin))

transforms the complex input samples pre-sented to the boundary-cell input port intoreal-valued data. The transformation thatforces the imaginary component of xin to 0must be applied to all elements in the samerow associated with the boundary cell; thisoperation is one of the tasks performed bythe internal cells.

Now that the data in the leading posi-tion of two adjacent rows are real-valued, asecond angle is formed as

Θ = arctan(xine -jΦ/x)

which is used to annihilate a term of theinput data set in an ordered manner thateventually produces the upper-right trian-gular matrix R. The arithmetic employed inthe boundary cells could be realized inhardware by literally implementing theequations indicated in Figure 2. This wouldrequire hardware support for performingsquare roots and divisions. Although thesecircuits are commonly implemented inFPGA hardware, we sought alternativemethods for computing the required anglesthat had a lower resource cost over directand obvious implementations.

One well-known and relatively simplemethod for computing angles is the vector-ing mode of the Coordinate RotationDigital Computer (CORDIC) algorithm.The CORDIC algorithm is an iterativeprocedure capable of computing a rich setof mathematical functions. The elementaloperations required in the CORDIC algo-rithm are addition, subtraction, bit-shift,and table lookup. All of these functions areefficiently supported by FPGA architecturessuch as the Virtex™ series of devices fromXilinx, so the vectoring mode of the algo-rithm is a good candidate for the founda-tion of QRD processor boundary cells. As

shown in Figure 3, two CORDIC enginesare in use in the boundary cell: one for com-puting φ and the other for computing θ.

The CORDIC algorithm is iterative innature, with each iteration refining theangle estimate by approximately 1 bit ofprecision. For a process employing an N-iteration CORDIC process, a new output isgenerated every N clock cycles and a new setof operands presented every N clock cycles.To increase the throughput of the boundarycell, we employed a fully parallel, orunrolled, architecture for the CORDIC (notshown here). After the initial start-up laten-cy of the circuit is absorbed, the initiation

38 Xcell Journal Second Quarter 2007

( )

Figure 4 – Systolic array internal cell architecture employing three MAC-based Givens rotation engines

Figure 3 – Boundary-cell architecture based on two vector-mode CORDIC processing engines

Page 39: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 39

and completion rate of the cell is one newinput/output per clock cycle.

Each data element xin entering an inter-nal cell (Figure 4) in row m must be rotat-ed by the angle φ computed by theboundary cell for the mth row:

One option that has been commonlyused for the rotation task in QRD proces-sors is the rotation mode of the CORDICalgorithm. An alternative is to simplyimplement the rotation in the obviousmanner using multiply accumulate (MAC)functional units; this is the approachadopted in our implementation. The targetFPGA technology for the design is theVirtex-4 FPGA. These devices have a vast

array of embedded MAC units referred toas DSP48 slices.

The DSP48 slice supports a rich set ofopcodes – capable of being updated on aper-clock-cycle basis – that define the arith-metic operation computed by the tile dur-ing a given clock cycle. The fourmultiplications implied in the precedingequation are folded onto a pair of DSP48slices, with each DSP48 slice computingone of the two output terms ℜ(υ) andℑ(υ). Two clock cycles are required to com-pute the two output terms. Each DSP issupplied with a unique opcode for eachclock period. Consider computing the termℜ(υ). During the first clock period, theproduct cos(Φ) x ℜ(xin) is computed andstored in the DSP48 product or p register.

During the second clock cycle, sin(Φ) x ℑ(xin) is formed and subtracted

from the value in the p register to generatethe final output term. A similar sequence ofcomputations is performed to produceℑ(υ). Using the DSP48 embedded blocksrather than a CORDIC-based approach forthe internal cell reduces the latency of thisphase of the computation and also mini-mizes the amount of FPGA logic fabric(look-up tables [LUTs] and registers) requiredfor the implementation. Table 1 provides abreakdown of the area for the major func-tional units in the QRD implementation,along with the total area of the design.

The cos(Φ), sin(Φ), cos(Θ), and sin(Θ)terms required by the internal cells arecomputed using a simple LUT that mapsthe angles Φ and Θ computed by the vec-toring units in the boundary cell to theircorresponding sine and cosine. Linearinterpolation is applied to the output sam-ples of the LUT to increase the accuracy ofthe mapping from angle to amplitude,while keeping the LUT itself constrained toa single block RAM.

The row and column dimensions of theinput array for the QRD processor can bedynamically adjusted at runtime by writingthe new dimensions to control registersthat are part of the FPGA control plane.

Table 2 provides timing information forseveral configurations of the input data set.

Design FlowOur QRD implementation uses theXilinx System Generator for DSP model-based design flow. In addition to provid-ing a natural development environmentfor developing FPGA signal-processingimplementations, System Generator has arich set of features that support the devel-opment of heterogeneous applicationscomprising not just the FPGA elementbut a processor. The processor could bethe embedded PowerPC™ 405 hard IPblock, the MicroBlaze™ soft-processorcore, or a processor external to the FPGA.

The beamformer developed for thisproject was partitioned between the hostPC and the FPGA platform. In our imple-mentation the host application running onthe PC might be considered more of an ele-ment of the beamformer verificationprocess (test bench), but the host applica-

Functional Unit LUTs FFs DSP48 Slices Block RAM Slices

Boundary Cell 2,145 2,057 3 1 1,266

Inner Cell 216 329 6 0 176

Back Substitution 2,862 3,286 4 1 1,932

QRD Total 5,411 5,916 13 6 3,530

M N Cycles for Cycles for Total Cycles Time (μs) for Triangularization Back Substitution 250-MHz Clock

3 3 792 147 939 3.76

8 3 2,112 147 2,259 9.04

5 5 2,540 255 2,795 11.18

9 5 4,572 255 4,827 19.31

7 7 5,656 371 6,027 24.11

10 7 8,080 371 8,451 33.80

9 9 10,476 495 10,971 43.88

11 9 12,804 495 13,299 53.20

10 10 13,630 560 14,190 56.76

( )( )

( ) ( )( ) ( )

( )( )⎥⎦

⎤⎢⎣

ℜ⎥⎦

⎤⎢⎣

⎡ −=⎥

⎤⎢⎣

in

in

x

x

θφ

θφυ

υ

sincos

sincos

Table 2 – Execution time for the triangularization and back substitution phases of the FPGA QRD implementation for an M x N matrix.

Table 1 – FPGA resource utilization for folded QRD and back substitution array

Page 40: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

tion could be as arbitrary and complex asrequired by the task at hand.

Our beamformer host application is aMATLAB script (m-code) that simulatesthe sensor array for the beamforming net-work. The script simulates a dynamic targetand generates the samples of the far-fieldradiation pattern for the moving target. Thesamples of the electric field at each sensorare generated in MATLAB and forwardedto the FPGA QRD processor. A new esti-mate of the beamformer weight vector isproduced and returned to the MATLABenvironment for further processing.

In this case, the additional processinginvolves plotting the polar radiation pat-tern for the updated complex valuedweight vector. Note that the host applica-tion does not necessarily have to be associ-ated with the MATLAB environment; theapplication could be a program written inC, for example.

An interesting element of the beam-former application is the management ofthe interface between the host application,running on a PC in this case, and the QRDprocess executing on the FPGA platform.System Generator provides a suite of sharedmemory library objects (ROM, RAM,FIFO) that abstracts virtually all of thedetails of the processor/FPGA interface

and enables the host software and FPGAhardware to be somewhat insulated fromeach other (Figure 5).

Each new update of the beamformer isessentially a three-step procedure:

1. New input samples from each antennaelement, as generated by the MATLABhost application, are forwarded to theQRD engine in situ on the FPGA.

2. The QRD process is triggered.

3. The new weight vector is returnedfrom the FPGA to the host.

The shared memory library modulesand associated application programmerinterface transform the transfer of databetween the FPGA and the host PC intosimple assignment statements based onname/space references in MATLAB (or C).For example, the new weight vector w, res-ident in the MATLAB workspace, is updat-ed with the new beamformer coefficients,FPGAWeights, as computed by the FPGAQRD process using the simple assignmentw = FPGAWeights. (FPGAWeights is thename assigned to a shared-memory bufferin the System Generator description of theQRD engine.)

The management of this type of hostprocessor/FPGA interaction by the System

Generator framework makes the develop-ment of heterogeneous applicationsstraightforward, rapid, less error-prone, andenables an FPGA accelerator engine (likethe QRD module in this case) to be easilyported between different hardware plat-forms without needing to modify theFPGA source code – the System Generatormodel itself.

The interface abstraction supportstransactions between the host applicationand the System Generator source model, aswell as the host application and the finaldesign running on the FPGA platform.This latter element significantly contributesto the validation process of the softwareand hardware (FPGA) dimensions of thesystem, as both components can bebrought online rapidly using the sharedmemory abstraction.

ConclusionIn this article, we’ve described the FPGAimplementation of a flexible QRD proces-sor that enables the run-time definition ofthe input matrix dimensions. The designemploys a mixture of CORDIC-based pro-cessing (array boundary cell) and MAC-based (array internal cell) arithmetic that iswell matched to the computationalresources of an FPGA like the XilinxVirtex-4 family.

All of the boundary- and internal-cellprocessing were projected onto a singleboundary-cell functional unit and internal-cell functional unit; however, it should benoted that the abundant resources of FPGAplatforms support the realization of a fullyparallel systolic array, should the through-put requirements of the target applicationdemand extremely high performance.

The System Generator programmingenvironment enables the rapid develop-ment of heterogeneous systems (processorsand FPGAs) while insulating programmersfrom the frequently complex and error-prone programming associated with hard-ware/software partitions.

This work was performed by the Xilinx AdvancedSystems Technology Group (ASTG), the R&Dorganization within the Xilinx DSP Division,together with our partner organizations SignumConcepts and San Diego State University.

40 Xcell Journal Second Quarter 2007

Matlab or C Application

Matlab or C API

Input

Memory

Buffer

“foo”

System

Generator

design flow

insulates

the host

program from

the details of the

FPGA platform

FPGA

Processing

Kernel

Shared

Memory

Object

Output

Memory

Buffer

“bar”

API is auto-

generated by

System

Generator

Figure 5 – Hardware/software abstraction enabled by the System Generator shared memorylibrary elements. The host application performs transactions with the FPGA storage elements

using simple name/space references – in this case to the memories named “foo” and “bar.”

Page 41: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

©2006 Mentor Graphics Corporation. All Rights Reserved. Mentor Graphics, Accelerated Technology, Nucleus is a registeredtrademarks of Mentor Graphics Corporation. All other trademarks and registered trademarks are property of their respective owners.

Accelerated Technology, A Mentor Graphics Division [email protected] • www.AcceleratedTechnology.com

With a rich debugging environment and a small, auto-

configurable RTOS, Accelerated Technology provides the

ultimate embedded solution for MicroBlaze development.

Our scalable Nucleus RTOS and the Eclipse-based EDGE

development tools are optimized for MicroBlaze and

PowerPC processors, so multi-core development is easier

and faster than ever before. Combined with world-

renowned customer support, our comprehensive product

line is exactly what you need to develop your FPGA-based

device and get it to market before the competition!

For more information on the Nucleus complete solution, go towww.acceleratedtechnology.com/xilinx

The Ultimate Embedded Solutionfor Xilinx MicroBlaze™

Page 42: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Sunita Jain Associate Design Engineer Xilinx India Technology Services Pvt. Ltd. (Xilinx Hyderabad, XHD) [email protected]

Guru PrasannaTechnical LeadXilinx India Technology Services Pvt. Ltd. (Xilinx Hyderabad, XHD)[email protected]

Data corruption is the principal problemassociated with data transmission andstorage. Whenever data is transmitted overchannels, there is always a finite probabilitythat some errors will occur.

It is essential that the receiving modulecan differentiate between an error-freemessage and an erroneous one. There arevarious methods to detect errors; mosterror-checking methods do so by intro-ducing redundant bits exclusively for thispurpose. Commonly used methods forerror detection in data communicationinclude parity codes, Hamming codes,and cyclic redundancy check (CRC); ofthese, CRC is the most widely used.

42 Xcell Journal Second Quarter 2007

Using CRC Hard Blocks in Virtex-5 FPGAsCRC blocks provided as hard macros speed up the process of error detection.

Page 43: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

CRC is computed on a given set of databits and appended at the end of the dataframe before transmission or storage.When the frame is received or retrieved, itsvalidity is verified by recalculating the CRCfor the contents of the frame to ensure thatthe data is error-free.

In this article, we’ll take a quick look atthe theory behind CRC calculation and itshardware implementation using linear feed-back shift registers. Then we’ll direct ourattention to the CRC hard block present inXilinx® Virtex™-5 LXT/SXT devices.

TheoryAddition and subtraction operations areperformed using modulo 2 arithmetic; thatis, they are the same as the exclusive OR(XOR) operation. Adding two numbers inpolynomial arithmetic is the same asadding numbers in ordinary binary arith-metic, except there is no carry.

For example, the binary message stream11001011 is represented as x7+x6+x3+x+1.The transmission and reception pointsagree on a fixed polynomial called a gener-ator polynomial; this is the key parameterof a CRC calculation.

The data is interpreted as the coeffi-cients of a polynomial, which are dividedby a given generator polynomial. Theremainder of this division is the CRC.Given an m-bit message sequence and agenerator polynomial of degree r, the trans-mitter creates an n-bit (n = m+r) sequencecalled the frame check sequence (FCS) sothat the resultant (m+r)-bit frame is divisi-ble by a predetermined sequence.

The transmitter appends r 0-bits to them-bit message and divides the resulting poly-nomial of degree m+r-1 by the generatorpolynomial. This produces a remainder poly-nomial of degree (r-1) or less. The remainderpolynomial has r coefficients, which formthe checksum. The quotient is discarded.The data transmitted is the original m-bitmessage followed by the r-bit checksum.

At the receiver, you can follow one oftwo standard approaches to assess the valid-ity of the received data:

• Compute the checksum again for thefirst m-bits received and compare

der calculated (over m+r bits) at the receivernon-0. The remainder, which is obtained atthe receiver in such cases, is a fixed valueknown as the residue value of a polynomial.

A little mathematics helps illustrate thisidea more clearly.

Assume that the % symbol denotes amodulo operation in the following repre-sentation:

For the case where checksum is append-ed without inversion:

(Mxr – R) xr % G = 0

where the receiver again performs the sameoperation of shifting as the transmitter.

Now consider the case where the check-sum is inverted and appended to the mes-sage stream at the transmitter:

(Mxr – Rc) xr % G

where Rc denotes complemented checksum.This can also be written as:

(Mxr – R + (xr-1 + ... + x + 1)) xr % G

The complement of a bit is the same asXOR with 1. Here, the + sign representsaddition in modulo 2 arithmetic (also notethat addition and subtraction are the samein modulo 2 arithmetic).

In which case, the remainder is thesame as:

(xr-1 + ... + x + 1) xr % G

which works out to be a constant for agiven generator polynomial.

The most commonly used CRC-32generator polynomial is

G(x) = x32+x26+x23+x22+x16+x12+x11+x10

+x8+x7+x5+x4+x2+x+1

which is 04C11DB7 in hexadecimal.The constant residue value correspon-

ding to CRC-32 is C704DD7B in hexa-decimal. For a given generator polynomialG, the residue value remains a constant forany data pattern provided at the input.

against the received checksum (the lastr-bits received)

• Calculate the checksum for all of the(m+r) received bits and compareagainst a remainder of 0

To see how the second method results ina remainder of 0, let’s use the followingconvention:

M = polynomial representation of the message

R = polynomial representation of theremainder computed at the transmitter

G = generator polynomial

Q = quotient obtained by dividing M by G

The transmitted data corresponds to thepolynomial Mxr – R. The variable xr signi-fies an r-bit shift of the message to accom-modate the checksum.

We know that:

Mxr = QG + R

Appending the checksum R to themessage at the transmitter is equivalent tosubtracting the remainder from the mes-sage. The transmitted data then becomesMxr – R = QG, which is clearly a multipleof G. This is how we obtain a remainderof 0 in the second case.

However, this procedure is insensitive tothe number of leading and trailing 0-bits inthe data transmitted. In other words, if amessage has trailing 0-bits inserted ordeleted, the remainder will still remain 0and the error will go undetected, whichmanifests that the same bit sequence willnot be reproduced back. Let’s explain a tac-tical method to overcome this drawback.

Residue ApproachIn practice, the checksum is complementedbefore appending. This makes the remain-

Second Quarter 2007 Xcell Journal 43

CRC is based on polynomial codes defined over a field with two elements. Polynomial codes treat

bit streams as polynomial representations with coefficient values of either 0 or 1.

Page 44: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Hardware ImplementationThe computation of a CRC checksum is apolynomial division process. Implementingit in hardware makes use of a shift register(also called the CRC register). The lengthof the shift register is the same as the degreeof the generator polynomial.

The procedure for CRC calculation is to:

1. Initialize the CRC register.

2. Get the message bit and continue aslong as message bits exist. If thehigher order bit in the CRC registeris 1, shift left one position and XORthe result with G. Otherwise, shiftleft one position.

When all of these steps are completedfor a given message, the CRC register holdsthe remainder.

These steps can be implemented with acircuit known as the linear feedback shiftregister (LFSR). Figure 1 shows an LFSRimplementation for calculating CRC usingthe CRC32 polynomial. Note that theplacement of XOR gates depends on thecoefficient of the corresponding termsbeing 1 in the generator polynomial. Eachof the numbered blocks in the figure repre-sents a memory element (flip-flop).

CRC Block The hardware implementation of CRCmakes use of a simple LFSR. Although sucha circuit is simple to implement, it takes n-clock cycles to calculate CRC values for an

n-bit data stream. This latency is intolerablein high-speed data networking applicationswhere data frames must be processed athigher speeds. The implementation of CRCgeneration and checking on a parallelstream of data become desirable in suchhigh-speed networking applications.

The CRC block implemented inVirtex-5 LXT/SXT devices helps designersby speeding up checksum calculations.

The CRC hard block in Virtex-5LXT/SXT devices is based on the CRC32polynomial. Virtex-5 FPGAs containCRC32 and CRC64 hard blocks, whichprovide a latency of one clock cycle forCRC generation for 4- and 8-byte datainput. The interface is easy and simple touse. The hard block functions as a CRCcalculator on a given message stream,along with some CRC-specific parameters

as input. The functionality for CRCcomparison is beyond the scope of thehard block and should be built into theFPGA fabric.

Each CRC hard block in the FPGAcomputes a 32-bit checksum asynchro-nously. Figure 2 is a block-level diagram

depicting the hard block’s architecture.The CRC hard block provides an output,which is bit-inverted and byte-reversed.

Figure 3 provides an applicationoverview of the CRC hard block. CRC iscalculated and appended at the end for agiven data packet at the transmitter. Atthe receiver, CRC is re-calculated over theentire received packet along with theappended CRC from the transmitter.

The validity of the received packet isestablished by the residue method. In thiscase, for the CRC32 polynomial, the

44 Xcell Journal Second Quarter 2007

31

Message

Input

30 29 28 27 26

18 17 16 15 14 13 9 812 11 10

7 6 45 3 1 02

25 24 23 21 20 1922

Register

Bank

XOR Logic

for

CRC Calculation

Byte

Rotation &

Bit

Conversion

MUX

CRCIN

CRCDATAVALID

CRCDATAWIDTH

CRCOUT

Figure 1 – LFSR implementation for the CRC32 polynomial

Figure 2 – CRC hard block implementation

Page 45: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 45

residue works out to be 1CDF4421 inhexadecimal, as it is a bit-inverted andbyte-reversed value for C704DDB7. Theconcept of byte rotation and bit inversionis shown in Figure 4.

Figure 5 illustrates a waveform for nor-mal CRC operation.

We have also come up with aLogiCORE™ CRC wizard, which provides

a LocalLink wrapper for the CRC hardmacro present in Virtex-5 devices. The corealso provides an example design demonstrat-ing the use of the CRC hard block.Additionally, the core provides variousoptions such as pipelining, complementing,and transposing.

For more information, visit www.xilinx.com/crcwizard.

ConclusionThe presence of CRC blocks in XilinxFPGAs makes the inclusion of error-detec-tion mechanisms in various designs easierand effortless for designers. You can usethe CRC Wizard IP, available in COREGenerator™ software, to incorporateerror-detection in various protocols likeAurora and PCI Express.

For more information on the embeddedCRC blocks in Virtex-5 FPGAs, see theVirtex-5 RocketIO™ GTP TransceiverUser Guide (UG196) at www.xilinx.com/bvdocs/userguides/ug196.pdf.

The theory on CRC presented in this article is asummary of the CRC theory published in varioustechnical journals and textbooks. Please contactthe authors for applicable references and URLs.

Data Packet

FPGA

CRC

FPGA

CRC

Transmitter Receiver

Data Packet CRC

Data Packet CRC

Data Packet

Residue

Check

1C DF 44 21

E3 20 BB DE

Bit Inversion

Byte Rotation

C7 04 DD 7B

CRCCLK

CRCIN Donít Care Donít Care Next Frame

3íb011

32íhFFFFFFFF

CRC (Prev Frame)

f(CRC (Prev Frame))

CRC_INIT

f(CRC_INIT)

CRC_INIT

f(CRC_INIT)

CRC (Prev,D0)

f(CRC(Prev,D0))

CRC (Prev,D1)

f(CRC(Prev,D1))

CRC (Prev,D2)

f(CRC(Prev,D2))

CRC (Prev,D3)

32íh04C11DB7

3íb011

D1 D2

CRCOUT

CRCINTREG

CRC_INIT

CRC_POLY

CRCRESET

CRCDATAVALID

CRCDATAWIDTH

D0D0

f(CRC(Pref(CRC(Prevv,D3)),D3))

33ííbb000000

D3D3

Last Word of the Frame -Only CRCIN[31:24] Valid

Notes:1. Prev = CRCINTREG value from the previous cycle.2. f(x) = bit inverted and byte reversed x.

First Four Bytes of the Frame

CRC result for the last valid data of the frame is the CRC result for the frame.

Figure 3 – Application overview of CRC block Figure 4 – Byte rotation and bit inversion depiction

Figure 5 – Waveform for CRC operation

Page 46: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Angela SuttonSr. Marketing ManagerSynplicity, Inc. [email protected]

In May 2006, Synplicity and Xilinx formedthe Ultra-High Capacity Task Force todevelop improved methodologies for high-capacity designs. The first deliverable fromthis task force is a set of new incrementalmethodologies – SmartGuide™ technologyand Partitions technology. These method-ologies are available with Xilinx® ISE™software version 9.1i and Synplicity’sSynplify Pro v8.8/v8.8.1 and beyond. Thetwo are often jointly referred to as“SmartCompile™” technology.

These methodologies are particularlyuseful for high-capacity FPGAs where thelarge size of the design results in long itera-tion times or for stages of the design cyclewhen you want to make small incrementalchanges to localized portions withoutupsetting other pieces of the design thatalready work. Table 1 summarizes the newmethodologies; in this article, I’ll providedetails about how to use them.

46 Xcell Journal Second Quarter 2007

Implementing IncrementalChanges EfficientlyImplementing IncrementalChanges EfficientlySynplify Pro and Xilinx ISE software offer new SmartCompile methodologies.Synplify Pro and Xilinx ISE software offer new SmartCompile methodologies.

Page 47: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

SmartGuideSmartGuide minimizes changes in a design’splace and route (PAR) implementation com-pared with previous implementations used asa reference. Typically, you would synthesizethe design to get an output netlist, which ISEsoftware places and routes. You would thenmake changes to the RTL, rerun synthesis,and rerun ISE incremental place and routewith SmartGuide enabled.

If you elect to use SmartGuide, ISEsoftware place and route attempts to dothe minimal changes based on whatchanged in the netlist, while still main-taining a legal netlist and meeting timing.This saves significant PAR runtime. Inthis methodology, there is no need tomanually create partitions, nor are con-straint or synthesis changes involved whenusing this capability (Figure 1).

How Does SmartGuide Work?ISE software runs fully automatic incre-mental place and route on an entire designbased on a netlist “name match” compari-son (the current netlist output bySynplicity’s Synplify Pro tool versus the ini-

Xilinx results with Synplify Pro v8.8software and ISE software version 9.1ishowed that, for small RTL changes wherethe change was not on the critical path,place and route runtimes were halved andin many cases as much as 70% shorter.More than half of the test designs preserved97% of placement and routing, leading tovery consistent designs.

SmartGuide is best used when you needto make small changes to your RTL andthose changes are not on a critical path. Insuch cases, SmartGuide can cut PAR timein half on average compared with runningplace and route again on the entire design.

Examples of a small changes includechanging things that result in less than10% of the netlist changing – a change inthe value of an RTL constant, changing alogical OR to an AND, changing the logiccondition on a “if ” statement, or minorchanges to a state machine. Also, this flowis recommended for roughly 10-15 consec-utive incremental runs, after which youmay want to rerun place and route on theentire design.

Partition Partition is an incremental block-baseddesign methodology – it allows place androute to run incrementally on blocks thatare being modified or tuned while preserv-ing other blocks that have not changed.This preservation can occur all the waydown to the routing level. This approachresults in shorter runtimes.

Blocks, also referred to as “partitions,”are defined at the RTL level before runningsynthesis. The block hierarchy is specified

tial netlist). The Synplify Pro software’sability to generate consistent and repeat-able netlists has been key for this to workwell. Synplify Pro software v8.8 employsmany new techniques to maximize netlistconsistency. These include:

• Repeatability of instance names fromone synthesis run to the next

• Output netlist changes that are local-ized to the very specific paths wherethe RTL changed from one synthesisrun to the next

Second Quarter 2007 Xcell Journal 47

Incremental Methodology When to Use How it Works Advantage

SmartGuide

Partition

Small incremental RTLchanges to non-criticalpaths

When you make a smallchange to your RTL and re-run synthesis. ISE softwaredetects netlist changes andruns incremental PAR basedon those netlist changesalone and timing goals.

Design preservation and consistency from one run to the next

Saves PAR runtime (up to 70%)

Easy to use – no need tocreate partitions or block-based constraints

Useful for block-based orteam-designed methodolo-gies. Allows you to get morestable results from one runto the next by locking downindividual blocks.

Fully integrated block-baseddesign methodology, synthe-sis through place and route.Blocks (“partitions”)defined at RTL level inSynplify Pro. Block defini-tions and subsequent blockchanges automatically com-municated between Synplifyand ISE software.

Teams can work on and tunespecific RTL blocks in parallel

Blocks that already work canbe preserved

Figure 1 – SmartGuide can halve your place and route runtime. Small changes to RTL/constraintsresult in small changes to the physical layout, especially when the changes were not on the critical path.

Table 1 – New Synplicity/Xilinx incremental methodologies

Page 48: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

as Synplify Pro compile points. Each time ablock is re-synthesized it is date-stampedand updated by Synplify Pro in the outputEDIF netlist and communicated to ISEsoftware. ISE software automaticallydetects which blocks have changed by read-ing the EDIF time stamp.

To run the partition methodology (illus-trated in Figure 2):

1. Set compile points. Define blocks/sub-blocks (partitions) as compile points inSynplify Pro. For example:

#

# Define Compile points in Synthesis .sdc file.

# You need to separately define time budgets

#

define_compile_point {v:work.or1200_cpu} -type {locked, partition} -cpfile {}

define_compile_point{v:work.or1200_dc_top}-type {locked, partition} -cpfile {}

You can specify your compile points inthe GUI using the SCOPE editor orthe Synplify Pro command line.

2. Implement the design (original PARlayout) and run synthesis. At the endof synthesis, Synplify Pro writes anextra property into the EDIF thatincludes the time stamp for when thecompile point (block/partition) waslast modified. For example:

(instance or1200_dc_top (viewRef verilog(cellRef or1200_dc_top))

(property PARTITION (string “1169730682”))

)

In subsequent synthesis runs, the oldtime stamp will be inserted in theEDIF for blocks/partitions that did notchange. A new time stamp is insertedin blocks/partitions that did change.

You can then run ISE place and routeon the output from Synplify Pro soft-ware. ISE software automatically reads

the Synplify Pro compile points as par-titions. ISE software also notes andstores the EDIF date stamp for eachblock/partition.

3. Make changes. Tune individual blocks(for example, RTL or constraints) inSynplify Pro and resynthesize, thenrerun ISE place and route. ISE softwareand Synplify Pro have been integratedso that ISE software automaticallydetermines which blocks/subblockschanged using partition date-stampinformation. ISE runs place and routeon the changed partitions and leavesother placed/routed blocks untouched(as long as timing can still be met).

You can control whether ISE soft-ware leaves placement alone on theunchanged blocks or both placementand routing using the partition prop-erty setting. Each partition has a cus-tomizable level of preservation withfour values: “synthesis” (preserve thenetlist), “placement” (preserve thenetlist and placement), “routing”(preserve netlist, placement, androuting), and “inherit” (use in sub-blocks only when the preservationlevel of the sub-block must be thesame as that of its parent block). Thedefault setting is “routing,” for whichthe synthesis netlist, placement, androuting will be exactly preserved

from one run to the next (providedthat the RTL within the compilepoint block is unchanged).

In Synplify Pro software, individual blockresults can be timing-driven. To get the bestquality of results within a block, it is usuallybetter to create timing constraints for eachcompile point individually.

If you change a top-level constraint inSynplify Pro software, only the impacted par-titions are marked as changed (and thereforeresynthesized).

When Synplify Pro’s partition methodolo-gy is used, you can create multiple instancesof a module.

ConclusionThe latest generation of 65-nm FPGAs hasenabled complex system-on-chip designs tobe implemented on multimillion-gate-capacity FPGAs. As FPGA designs grow incapacity, it becomes increasingly importantto deliver flows that are fast and flows thatconverge. Users are looking for better“divide-and-conquer” approaches to quick-ly tune their designs and assimilate smalldesign changes.

Synplicity and Xilinx formed a jointUltra-High Capacity Task Force to tacklethis very challenge. SmartGuide andPartition are the first of several new solu-tions that Synplicity and Xilinx will bedelivering to provide users with more effi-cient design methodologies.

48 Xcell Journal Second Quarter 2007

ISE software and Synplify Pro have been integrated so that ISEsoftware automatically determines which blocks/subblocks

changed using partition date-stamp information.

Figure 2 – The Synplify Pro/ISE Partition is a fully integrated (RTL through physical layout) block-based design methodology.

Page 49: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

www.agilent.com/find/logic-offer

© Agilent Technologies, Inc. 2006 Windows is a U.S. registered trademark of Microsoft Corporation

Logic analyzers up to 50% offLogic analyzers up to 50% offA special limited-time offer from Agilent.

Now you can see inside your FPGA designs in a way that will save weeks of development time.

The FPGA dynamic probe, when combined with an AgilentWindows®-based logic analyzer, allows you to access differentgroups of signals inside your FPGA for debug—without requiringdesign changes. You’ll increase visibility into internal FPGA activityby gaining access up to 128 internal signals with each debug pin.

Our 16800 Series logic analyzers offer you unprecedented price-performance in a portable family, with up to 204 channels, 32 M memory depth and a pattern generator available.

And now for a limited time, you can receive up to 50% off ournewest 16901A modular logic analyzer mainframe when youpurchase eligible measurement modules. Offer valid February 1,2007 through August 15, 2007. Reference promotion 5.564.

• Increased visibility with FPGA dynamic probe • Customized protocol analysis with

Agilent's exclusive packet viewer software• Low-cost embedded PCI Express

packet analysis• Pricing starts at $9,450

Agilent portable and modularlogic analyzers

Page 50: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Jameel Hussein Applications Engineer Xilinx, [email protected]

Frank TothSenior Marketing Manager, EasyPathXilinx, [email protected]

The flexible and fully integrated FPGAconfigurations from Xilinx take advantageof FPGA capabilities like remote updatingand reconfiguration. Available configura-tion silicon options (see Figure 1) includePlatform Flash, byte peripheral interface(BPI) supporting parallel NOR flash, andserial peripheral interface (SPI flash).

Each of these configuration optionshave different speed and complexity trade-offs and can offer you the flexibility of shar-ing configuration memory with othersystem-level tasks, thus greatly reducingtotal cost. For example, some applicationsuse a microprocessor that interfaces withBPI flash on board. In these cases, you canconfigure the FPGA from the same BPIflash that the microprocessor uses to savecost and board space.

50 Xcell Journal Second Quarter 2007

Picking the Right PROM for Your SystemPlatform Flash PROMs deliver at every stage in the product lifecycle.

Page 51: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Other applications are more cost-sensi-tive; you may have to make certain sacri-fices to keep the project within yourbudget. If you can afford a slower configu-ration time, SPI PROMs can provide thelowest cost solution in many cases.

Because Xilinx gives you the flexibilityto use multiple memory sources for FPGAconfiguration, you will need to considerseveral factors at the start of the design: theconfiguration speed required to meet sys-tem specifications (for example, PCIExpress); ease of use and design (such as asimple power-up configuration); and mem-ory, with value-added features like multipledesign revisioning (for product customiza-tion) and safe-update capabilities (for flaw-less remote configuration). Other featuressuch as compression can also save costs.

Before the introduction of the Xilinx®

Spartan™-3E, Spartan-3A, and Virtex™-5families, designers who configured XilinxFPGAs with commodity flash would use athree-chip configuration solution: a com-modity flash PROM, a CPLD, and of coursethe FPGA. The newer FPGAs eliminate theneed for a CPLD (or other controller) byproviding a direct interface for leading com-modity flash devices, thus reducing the chipcount to one memory device.

Safe-Update Solutions with Platform Flash PROMsIn-system safe-update solutions, alsoreferred to as remote-update solutions, arebecoming more and more popular; in manycases they are essential to a product’s success.There are several variations of a safe-updatesolution, but the basic idea is to update theFPGA bitstream in-system, which involvesupdating the image inside the PROM andreconfiguring the FPGA from that updated

back” logic already built into the device;only the FPGA and PROM are necessary.This fallback logic is among the key partsto a safe-update solution, which comprises:

• A file-generation flow for updatedimage: iMPACT + XAPP972

• An embedded in-system programmer:XAPP058 or XAPP424

• Non-volatile configuration memory: a Platform Flash XCF00P PROM

• An FPGA with a fail-safe configurationfeature: Virtex-5 fallback feature

The first step is creating a file that con-tains the information to update the PROMwith the new bitstream for the FPGA; youcan do this with iMPACT software tools.The file created will contain the new bit-stream as well as the instructions to inter-face with the PROM. For more details, seeXilinx application note XAPP972, “UsingSerial Vector Format (SVF) Files to Updatea Platform Flash PROM Design RevisionIn-System,” at www.xilinx.com/bvdocs/appnotes/xapp972.pdf.

The next step is to run or play this file(Figure 3); you can do this with anembedded in-system programmer avail-able in Xilinx application note XAPP424,

image. The “safe” part of a safe-update solu-tion comes from the ability to recover froma failed attempt at updating the FPGA,which is only possible if the FPGA alwayshas a known-good or golden bitstreamstored in the PROM, from which it canalways configure in case of an error. Withcompeting FPGAs, these solutions requireda third device, typically a CPLD, to controlthe update process and help the FPGArecover in the event of an error.

For example, where there is a need toreconfigure multiple devices (see Figure 2)from a base station, it is important that theupdate be flawless. If there is any corrup-tion of the new configuration data, there isa fail-safe fallback position.

Whereas competing solutions required athird device, Virtex-5 FPGAs have “fall-

Second Quarter 2007 Xcell Journal 51

FPGA

SCK

MOSI

CCLK

DIN/D[7:0]

DONE

INIT_B

MISO

CS

SPI

Flash

SPI Serial

Flash

BPI Parallel

NOR Flash

Platform Flash

PROM

PlatformFlashPROM

BPIFlash

FPGA FPGA

CS

WE

OE

DATA[15/7:0]

ADDR[25:0]

Central Controller

((( )))

)))

)))

((((

((((

FPGA Platform

Flash

PROM

Figure 1 – Available configuration silicon options in Spartan-3E/3A/3AN and Virtex-5 devices

Figure 2 – Reconfiguring multiple systems from a remote base station

Page 52: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

“Embedded JTAG ACE Player,”(www.x i l i n x . c om /bvdo c s / a ppno t e s /xapp424.pdf ) or XAPP058, “Xilinx In-System Programming Using an EmbeddedMicrocontroller,” (www.xilinx.com/bvdocs/appnotes/xapp058.pdf ). XAPP424 is thelatest in-system programmer available fromXilinx and is capable of significant per-formance gains over previous solutions inmany cases. The programmer uses JTAG tointerface with the Platform Flash PROM,which is the next key step in the process.

The Platform Flash XCF00P PROM(Figure 3) is capable of storing as many asfour different designs or revisions, but manysafe-update solutions only use two: theknown-good or golden bitstream and theupdated or enhanced bitstream. The key tomaintaining safety is to only erase and re-program the enhanced bitstream and leavethe golden bitstream completely unmodi-fied from original programming. XAPP972describes a method to ensure safety, which ismade possible with Platform Flash PROMs.

The last step is the fallback logic insideVirtex-5 FPGAs. Once the Platform FlashPROM has been reprogrammed with theupdated bitstream, the FPGA must bereconfigured. The danger is that if there wasan error reprogramming the PROM, theFPGA will not be able to configure fromthe updated bitstream. Fallback logic helpsto eliminate this danger. If the FPGA is trig-gered to reconfigure from the updated bit-stream and configuration fails, the FPGAwill automatically attempt to reconfigurefrom the golden bitstream. The golden bit-stream is the known-good bitstream thatwas unmodified during the update process;the FPGA should regain functionality withthe golden design. The functionality of thegolden bitstream is at your discretion.

This safe-update solution is a two-chipsolution made possible with Platform FlashPROMs and Virtex-5 FPGAs. Theresources to build such a solution are avail-able on www.xilinx.com and can signifi-cantly reduce time to market, as well as addvalue and flexibility to the project.

Configuration Speed Another advantage of using Platform FlashPROMs is the configuration speed. For

most applications, configuration speed isnot an issue; however, for several projects,the configuration speed is the differencebetween success and failure.

For example, applications using PCIExpress require the FPGA to configure inless than 100 ms (Figure 4). For some SPIand BPI NOR flash devices, it can be dif-ficult to configure larger FPGAs in thatamount of time. The FPGA provides theconfiguration clock (CCLK) when config-uring from a third-party flash device. TheCCLK is generated by internal circuitryand on some FPGAs can have 50% varia-tion, which means that if the CCLK is setto 60 MHz, the output clock could rangefrom 30 to 90 MHz. You must set theCCLK to ensure that the maximum clockrate on the PROM is not exceeded.

Finding the correct CCLK speed is easywith a few simple formulas. The FPGACCLK + 50% should be less than the maxi-mum clock rating on the PROM. The sameFPGA CCLK – 50% is the worst-caseFPGA CCLK. The worst-case FPGA CCLKis critical when determining the maximumconfiguration time, which is roughly calcu-lated by dividing the number of configura-tion bits for a device by the number of bitsper second being delivered to the FPGA.

Platform Flash has the ability to runfrom an external stable clock source.When the FPGA is in slave mode, it toocan run from a high-speed stable clocksource to deliver the fastest and most con-sistent configuration times. In most casesconfiguration times are less than 100 ms,even for mid-sized Virtex-5 devices.(Numbers are approximate; actual num-bers may vary across different devices.Refer to the appropriate data sheets formore information at www.xilinx.com/xlnx/xweb/xil_publications_index.jsp.)

ConclusionDesigners must make trade-offs in speed,complexity, functionality, and cost duringconfiguration. Existing on-board BPI andSPI devices can be put to dual-mode usefor both configuration and system-levelmemory. Platform Flash PROMs provideunique money-saving features, includingthe ability to store and select multiple bit-streams. Virtex-5 and Spartan-3E/3Adevices provide you with integral supportfor SPI and BPI flash, in addition toPlatform Flash.

For more information, please visitwww.xilinx.com/products/silicon_solutions/proms/pfp/index.htm.

52 Xcell Journal Second Quarter 2007

System

JTAG

ConfigurationData

RevisionControl

Virtex-5FPGA

In-SystemProgrammer

ConfigurationFallback

Logic

Update File XCF00PPROM

Revision 0:Known-Good

Bitstream

Revision 1:UpdatedBitstream

(XAPP058 or XAPP424)

NewBitstream

Platform Flash

PROM

Processor

High-SpeedConfiguration inless than 100 ms

FPGA

PCIe or

Co-Processor

Figure 3 – Safe-update system overview with Virtex-5 FPGAs and Platform Flash XCF00P PROM

Figure 4 – High-speed configuration with Platform Flash PROMs in less than 100 ms for all Spartan-3E/3A/3AN and Virtex-5 devices up to LX85 and LX85T

Page 53: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Ed Trexel Senior Applications Engineer Impulse Accelerated Technologies, [email protected]

Dov Stamler Hardware Engineer Vigilant Technology [email protected]

Bilateral filtering is used in many videoapplications to smooth images while pre-serving edge detail. One such applicationuses a bilateral filter to reduce randomnoise during video pre-processing and toimprove video compression efficiency in areal-time video environment.

Vigilant Technology provides intelligentIP surveillance and security solutions.Vigilant currently supports tens of thou-sands of cameras in airports, governmentoffices, financial institutions, correctionalfacilities, casinos, and town centers.

In this article, we’ll describe how oneportion of an advanced closed-circuit tele-vision (CCTV) system – the bilateral filter– has progressed from an initial softwaremodel to being implemented in actualhardware using C-to-FPGA tools. Startingwith a verified C-language model, wedeveloped a hardware implementationthrough the use of C-to-hardware tools anditerative design methods. To allow forquick prototyping and algorithm experi-mentation, we used common and familiarC-language tools for debugging and soft-ware/hardware design analysis.

Second Quarter 2007 Xcell Journal 53

Accelerating Bilateral Filtering in Xilinx FPGAsC-to-FPGA methods speed the development and optimization of complex filters.

Page 54: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Iterative hardware-level testing was animportant part of this conversion. Weaccomplished this testing with an FPGA-embedded software test bench. For thispurpose, the embedded processor featuresof the Xilinx® Virtex™-4 FX device wereemployed for direct, hardware-in-the-loop embedded testing of the image fil-tering algorithm.

Bilateral FilteringThe bilateral filter is representative of manyother critical image algorithms. It works likethis: for each pixel, a new value is calculatedusing those pixels immediately around thepixel of interest, using the equation:

where each coefficient is calculated by:

New coefficient coeffn = ORIGCOEFF × KERNEL

The pixel being recalculated and thosearound it form a 3 x 3 matrix referred to asthe “input vector.” The input vector is pre-sented to the filter with all of the pixels inparallel forming a 9-byte-wide data path.From each input vector, the filter calculatesthe new value for the center pixel, referredto as the “output vector,” which is 1 bytewide. Calculation of the output pixel isdefined by a proprietary kernel algorithmdeveloped at Vigilant Technology, and isthe heart of the bilateral filter.

Figure 1 summarizes the input and out-put vectors.

Starting from a C-Language ModelVigilant Technology wrote the originalimplementation of the bilateral noise filterin C, using floating-point math. As a first-pass optimization, we converted all mathe-matical operations to unsigned fixed point.We accomplished this conversion in partby using C-language macros included withthe Impulse C tools.

As a second level of optimization, wereorganized the C code to minimize thenumber of loops and simplify data han-dling, using buffered streams for all filterI/O. We defined this I/O using stream-

∑ ×

n n

n nn

coeff

coefffNew pixel =

functionality of the application. Throughoutdevelopment, we performed real-world vali-dation using actual video clips representingvarious CCTV scenarios.

Figure 2 illustrates how we implementedand tested the bilateral noise reduction filterusing the Impulse C streaming interfacesand a Xilinx ML410 development board.

On the software side of the application(in this case on the PowerPC™ 405processor used for hardware-level testing),Impulse C functions open and close datastreams, read or write data on the streams,and, if desired, send status messages or pollfor results. In the case of the Virtex-4 FXdevice, stream reads and writes can be spec-

ified as operations that take advantage ofthe auxiliary peripheral unit (APU) inter-face, which provides a high-performancesoftware-to-hardware serial interface capa-ble of transferring as many as 128 bits ofdata in a single transaction.

ing-related interface functions providedwithin Impulse C.

In certain cases, we replaced array refer-ences in the C code with individual variablesin order to avoid data-timing dependenciesand produce parallel calculations. C-codedependencies included summations thattook a linear approach of summing as coef-ficients were calculated, as well as a multi-tiered approach. During this manualoptimization, we used macros to keep the Ccode relatively clean and portable.

As the code was optimized and modifiedfor hardware generation, we used a standardC debugger (part of the Borland C++Builder tools) to step through both the orig-inal and modified C code, allowing for vali-dation of the math and the overall

1 2 3

4 5 6

7 8 9

1 2 3

4 6

7 8 9

5

1 2 3

4 5 6

7 8 9

1

2

3

4

5

6

7

8

9

Bilateral Filter

= Pixel to be passed

through unprocessed

Input Vector

First Pixel

Input Vector

Last Pixel

Input Vector

Output

VectorOutput Pixels as read by Reader

One Line (col=0 ..719)

Processed

5

=

5 5 5 5 5 5 5 5 5

col0

col2

col3

col4

col5

col716

col717

col718

col719

Reg4

Vector In

Reg5

Reg6

Reg1

Reg2

Reg3

Reg7

Reg8

Reg9

Row

0R

ow 1

Row

2

in_stream_row0

RegOut

Vector Out

[7:0]

[23:0]

[23:0] [7:0] [7:0]

[23:0]

[23:0]

[23:0]

[23:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

in_stream_row1 out_streambileateral_filter_proc()

in_stream_row2

Figure 2 – Block diagram showing streams and processes

Figure 1 – Bilateral filter processing flow

54 Xcell Journal Second Quarter 2007

Page 55: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 55

Generating and Testing the FPGA HardwareWhen generating hardware, the Impulse Ccompiler extracts low-level parallelismfrom otherwise sequential C-languagestatements. Structures such as loops areevaluated at a high level and either unrolledor pipelined to increase throughput. Thesecompiler optimizations save applicationdevelopers a great deal of time.

The Impulse C compiler generateshardware in either VHDL or Verilog. Thishardware is synthesized using XilinxISE™ software. On the processor side, thecompiler generates run-time libraries readyfor use on the embedded PowerPC orMicroBlaze™ processor.

Iterative, hardware-level unit testing wasan important part of this project. Theembedded PowerPC allowed us to exercisethe filter at speed, using the processor as atest generator. Using this approach allowedus to validate the correctness of the data, aswell as to directly measure the filter per-formance and throughput.

In fact, an entirely C-language testingmethod using the embedded PowerPCproved to be far more practical than HDLsimulation given the much faster, hard-ware-speed operation of tests involvingvideo data. Xilinx Platform Studio (XPS)tools proved invaluable during thisprocess. XPS includes a board supportpackage for the Xilinx ML410 board,allowing the embedded PowerPC to beset up as a test generator and communi-cate with the hardware filter through theVirtex-4 FX APU interface. The Impulsetools include XPS hardware and softwareexport features that greatly simplified thecreation of a custom, APU-connectedperipheral from the generated bilateralfilter hardware.

ISE tools provided the primary meas-urements of size and clock speed perform-ance, while the ML410 board provided thefinal determination of whether the gener-ated filter worked properly in hardware.HDL simulation was used at various

points in the development process to vali-date the functionality of the generatedHDL code, but more importantly toobserve cycle-level operation and findopportunities for C code optimization.

HDL simulation was of limited value forheavily pipelined sections of the generatedhardware, but proved quite useful whenanalyzing sequential blocks andstages of generated HDL code.This HDL-level analysis couldbe related directly back to the Csource code through the use ofthe Impulse interactive opti-mizer, which provides visualiza-tions of C-code parallelism andhelps to match the generatedhardware code to correspon-ding sections of the original C-language source.

Iterative OptimizationAfter we developed an initial, non-optimalhardware prototype, we began a process ofiterative optimization to meet system tim-ing and size constraints. Based on previousexperience with hand-optimized HDLimplementations of similar filters, availableresources, and real-time video require-ments, Vigilant specified a budget of 1,000FPGA slices, along with a target video datarate of 60 frames per second.

An initial restructuring of the algorithmplaced the critical loops in two parallelindividual processes. This method of opti-mization at first appeared promising, witha relatively high clock rate of 143 MHz,but was limited by a noncontiguouspipeline. This characteristic caused the sys-tem performance to be limited by thelatency of the slowest pipelined process.Even with the two processes running inparallel in a system-level pipeline, the bestrate was substantially slower than therequired pixel rate.

After additional analysis, we made vari-ous modifications to the C code, includingthe use of a single pipelined process for the

main filtering loop. The result of this seriesof optimizations was a filter that met theslice-count budget and achieved a single-frame rate of just over 15 ms (66 frames persecond) using an FPGA clock rate of just 27MHz, which is the input clock rate of a D1video stream received in BT.601 format.These results are summarized in Table 1.

The widely differing performance resultsseen in this example highlight the impor-tance of using appropriate C programmingmethods for FPGAs. Although C-to-FPGAtools can dramatically reduce the amount ofdevelopment time required to convert com-plex algorithms to hardware, these toolsmust be applied carefully. You may need touse iterative optimization methods toachieve results comparable to hand-craftedHDL. Fortunately, these techniques are wellwithin the experience and skills of embed-ded software developers.

ConclusionSoftware-to-hardware design tools can playan important role in the fast prototypingand development of complex FPGA algo-rithms. However, some level of hand-opti-mization will be required to obtainperformance comparable to hand-craftedHDL. Iterative design methods combinedwith interactive design and optimizationtools are an efficient way to manage thecomplexity of large embedded and high-performance computing applications.

For additional information about C-to-FPGA programming and optimization,visit www.ImpulseC.com/xilinx.

Filter Version Occupied Slices Frame Rate

Parallel Multi-Process 3,592 12.11 Hz

Pipelined Multi-Process 1,570 10.10 Hz

Pipelined Single Process 930 65.31 Hz

Table 1 – Iterative C-language optimizations resulted in incremental improvements

The embedded PowerPC allowed us to exercise the filter at-speed, using the processor as a test generator.

Page 56: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Bruno Stuber Professor University of Applied Sciences Northwestern Switzerland, School of Engineering, Institute of Automation [email protected]

Dino ZardetDevelopment EngineerUniversity of Applied Sciences Northwestern Switzerland,School of Engineering, Institute of [email protected]

A fast Fourier transform (FFT) is the ulti-mate way to accomplish spectral analysis indigital signal processing. Many real-timeapplications require seamless operationwith continuous data throughput, a high-speed data rate, excellent spectral resolu-tion, and a large dynamic range. Today,many standard solutions are offered to per-form real-time FFTs, either with DSPs orFPGAs. However, to go beyond establishedlimits, a thorough analysis of the systemarchitecture is necessary, as is a designapproach that focuses on timing.

FPGAs offer the potential for a hugeamount of parallelism. For an FFT applica-tion, the number of multipliers and RAMblocks determines the overall performance,expressed in terms of data throughput andthe number of spectral lines or bins.

56 Xcell Journal Second Quarter 2007

Ultra-High-Speed Spectral Analysis in Xilinx FPGAsA 32k-point FFT with 2-GSPS data throughput in a single FPGA establishes a new level of performance.

Page 57: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

In this article, we will discuss the imple-mentation of a 32k-point FFT (32,768points) with power spectrum accumulation.The 8-bit input data flows continuously at arate of 2 gigasamples per second (GSPS),therefore providing an analog bandwidth of1 GHz. The spectrometer is implemented ina single Xilinx® Virtex™-II Pro XC2VP70FPGA operating at an overall clock speedof 125 MHz.

We developed the unit for spectroscopyapplications in millimeter and terahertzradio astronomy. However, many otherapplications are possible, including low-fre-quency radio astronomy, atmosphericphysics, physical chemistry, radio surveil-lance, and spectral measurements.

High-Speed FFT Architecture in an FPGATo achieve the enormous throughput of 2GSPS, a strictly synchronous and pipelinedarchitecture is necessary along the signal pro-cessing path. A “pipeline FFT” is best suitedfor a high-speed parallel implementation.

In typical FFT signal flow graphs, anumber of r input data, where r is a powerof two, are processed within the first but-terfly stage. The butterfly outputs r inter-mediate results; r is the radix of the FFTimplementation. The butterfly involvessome additions and (r-1) complex multipli-cations with twiddle factors. This process isvisualized in Figure 1.

M stages are needed for an N-pointFFT, M = logr(N). All stages (with theexception of the last one) perform twiddlemultiplication and data shuffling, whichare necessary to implement the well-known “butterfly” signal flow graph of theFFT algorithm.

Therefore, a radix-r FFT pipeline for-wards r data streams with the pipeline’sclock frequency, fc. The data throughput isthus r x fc.

The hardware resources involved in theFFT processing are adders, multipliers,memory for the twiddle factors and datashuffling, and a number of multiplexers.We recommend using dedicated multipli-ers, preferably in conjunction with theassigned internal registers. Twiddles arestored in Xilinx block RAM instantiated asROM. You can also use block RAM for

and weeks. To extract the chosen signal,the accumulation process is carried out“on source,” with the antenna directed tothe object, and “off source,” with theantenna deflected. The resulting spectraare then subtracted.

The frequency range observed is spreadfrom tens of kilohertz to approximately 2terahertz. The bandwidths of the spectrom-eters should therefore be as large as possiblewhile the frequency resolution is reason-ably in the range of 100 kHz. Currently,broadband spectrometers are mainly basedon acousto-optical devices with band-widths as high as 3 GHz. They have, how-ever, a resolution of about 1 MHz and posesome problems in dynamic range, stability,and cost.

New generations of FPGAs and high-speed digital-to-analog converters enableFFT-based solutions that are competitivewith or even superior to acousto-opticaldevices, with the advantages of an all-digital,reconfigurable, and programmable system.

Radio spectroscopy based on FFT typi-cally comprises three processes:

• Digitizing the receiver’s intermediate signal

• FFT computation

• Accumulation (averaging) of the powerspectra; that is, the squared magnitudeof complex FFT values

most of the data shuffling stages, employ-ing the dual-port capability, whereas somesmaller memory blocks are more efficientlyrealized with distributed RAM.

A closer analysis of the pipeline FFTshows that there are (N-r) complex twid-

dle factors needed and a total number ofN complex data memory for the datashuffling. The number of multipliers is 4x (M-1) x (r-1) for a radix-2 or radix-4FFT pipeline.

A crucial factor with respect to thethroughput is fc. The dedicated signal pro-cessing elements, multipliers, and RAMblocks in Virtex-II FPGAs have a propaga-tion delay of as much as 5 ns. Togetherwith the signal path delays – assuming thatsuch a complex circuit is spread over a largearea – an overall delay of approximately 8ns is feasible, resulting in an overall systemclock of 125 MHz.

Several pipelines may operate in parallelwith the aid of an extended input, bufferingand scheduling to further enhance overalldata throughput.

Applying the FFT in Radio AstronomySpectrometers in millimeter astronomyare used to identify and explore atomicand molecular lines of cosmic sources.Because the signals are very weak, the sig-nal-to-noise ratio may be less than -60 dB;the power spectra will be accumulatedover a period of milliseconds up to hours

Second Quarter 2007 Xcell Journal 57

Adder

Stage

FFT

Input

Data

fc

Twiddle

ROM

Data

Shuffling

(RAM,

MUX)

Butterfly

FFT Pipeline (Radix 4)

Stage 1

FFT Pipeline

Last Stage

Next

Stages

fc

Adder

Stage

FFT

Output

Data

Figure 1 – Pipeline FFT structure

Page 58: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

We attempted to establish a new levelof performance with an FFT-based spec-trometer – one with an analog bandwidthof 1 GHz subdivided in 16,384 channels.

For more information, measurementresults, and the features of the spectrome-ter system, visit www.astro.phys.ethz.ch/instrument/argos/argos_nf.html.

Designing the Spectrometer UnitA detailed block diagram of our spectrom-eter unit is shown in Figure 2. The analoginput signal is fed to a pre-amplifier andanalog-to-digital (A/D) converter, whichdigitizes the signal at a rate of 2 GSPS. Thedata are transferred to the FPGA through a32 x 8-bit-wide bus. Next, the FFT frames(32k data samples) are multiplied with aprogrammable window function andbuffered. The complex spectral values arethen computed with the aid of two parallel,radix-4 32k-point FFT pipelines. The dataare transported continuously in 2 x 4streams of complex numbers with the sys-tem clock fc = 125 MHz. A 32k FFT frameis processed every 16.384 µs.

At the input, the streams have a widthof 2 x 9 bits. They are expanded to 2 x 18bits later within the pipelines, thus main-taining an adequate level of precision andmatching the FPGA’s RAM and multiplierstructures.

After computing the power spectrum,the resulting values are accumulated in 36-bit-wide registers. The number of accumu-lation cycles is programmable. Hence, asingle shot is possible as well.

When an accumulation cycle is com-pleted, the data are flushed to an outputbuffer. The data is transferred throughinterrupt or polling over the PCI interfaceto the host computer.

A number of control and status registersare accessible through PCI, allowing forflexible programming of the whole unit.

FPGA UtilizationThe XC2VP70 comprises 328 block RAMand as many multipliers. In the first designstep, we analyzed the need for processingresources. With the guidelines presentedhere, you can easily determine the numberof multipliers.

A rough estimate of memory is also pos-sible, taking into account all buffer stagesat the pipeline’s input and output, the data-shuffling stages within the pipelines, andthe twiddle and window ROM. However,determining the exact number of RAMblocks requires a detailed stage-by-stageanalysis, taking into account the wordwidths involved. A crucial factor is thegranularity of RAM blocks with respect tomemory size, number of ports, and config-uration possibilities.

Detailed analysis shows that for ourspectrometer application, RAM blocks arethe most limiting resource.Table 1 summarizes deviceutilization. All functionalunits, as shown in Figure 2,are included.

Managing TimingSuch a complex and time-critical application requiresa detailed inspection ofevery signal path withrespect to propagation delay. Throughoutthe design process, we allocated multipliersand RAM blocks directly in VHDL. Allarithmetic elements, RAM blocks, andlogic stages are registered consecutively.

We developed the FFT unit in VHDLusing the Mentor Graphics tool chain andXilinx ISE™ software for place and route.

Taking precautions very early in the designstage, we did not apply any location con-straints for synthesis and place and route.

Using an XC2VP70 Virtex-II Pro devicewith speed grade -6, we obtained a mini-mum clock period of 7.495 ns. Thus, aclock speed of even 133 MHz is feasible.

Flexible Hardware PlatformFigure 3 shows the hardware platform, anAcqiris AC240 2-GSPS analyzerCompactPCI card. Two interleaved 8-bitA/D converters, with a pre-amplifierstage, provide a configurable sampling

rate as high as 2 GSPS. The signal pro-cessing core comprises a Virtex-II ProFPGA. Communication to a host com-puter is established through theCompactPCI bus.

A firmware development kit (FDK) isincluded with the AC240. With this con-cept, you have all of the interfaces avail-

58 Xcell Journal Second Quarter 2007

Number of Occupied Slices 32,379 out of 33,088 98%

Number of Slice Flip-Flops 41,549 out of 66,176 63%

Number of Block RAMs 309 out of 328 94%

Number of MULT18X18s 196 out of 328 59%

Analyzer Board AC240

Preamp

&

ADC

8-bit

2 GSPS

RX

Signal

Control

&

Status

fs = 2 GHz

Power

Specturm

FFT Pipeline 1

32k Point

Radix 4

FFT Pipeline 2

32k Point

Radix 4

Twiddles

Power

Specturm

Input

Buffer

&

Window

Multipli-

cation

Accu-

mulation

&

Output

Buffer

Host

PC

Control & Status Signals

PCI

Inter-

face

Virtex-II Pro 70 FPGA

Fc = 125 MHz

Figure 2 – Spectrometer unit block diagram

Table 1 – FPGA device utilization

Page 59: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 59

able in a uniform description at theVHDL level. Data input streams, I/Ostreams through PCI, and front-panelcontrol and status signals are accessible inthe VHDL environment. A number ofregisters are accessible through PCI withdriver software. A number of trigger andinterrupt mechanisms are also available.

To implement our application, we hadto connect the FFT unit to the FDK’sinterfaces. We found embedding applica-tions in the FDK environment both practi-cal and convenient.

For more information about the hard-ware, visit www.acqiris.com.

ConclusionWith our implementation, we established anew level of performance in the field of FFT-based spectrometers. This is possible throughan architecture that is highly adapted to thehigh-speed requirements and signal process-ing resources available in FPGAs. BlockRAM and multipliers are the key elements.

The spectrometer unit has been testedwith complete success in the laboratory andin radio astronomy applications. Taking intoaccount the latest developments in FPGAtechnology in conjunction with faster A/Dconverters (10 bits at 4 GHz), we may beable to further increase performance.

Figure 3 – AC240 analyzer board with FPGA

Page 60: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Greg BrownSr. Manager, IC Marketing, Processing Solutions GroupXilinx, [email protected]

The search for the best DSP solution for agiven application usually leads designersinto a maze of price, performance, andpower consumption trade-offs, oftenrequiring a compromise of one or more toaccommodate the others. With the intro-duction of Spartan™-3A DSP, the latestaddition to the Xilinx® XtremeDSP™portfolio and the first Spartan FPGA to beDSP optimized, Xilinx has broken throughthe maze to deliver the most efficient com-bination of these three critical values for ahost of applications, including wireless basestations, mobile defense communicationssystems, surveillance, automotive, video,and medical imaging technologies.

Spartan-3A DSP delivers 32 or moreGMAC/s (32 billion multiply accumulateoperations per second), up to 2,200 Mbpsmemory bandwidth, and size-reducedpackaging. This represents a compellingprice/performance breakthrough that hitsthe mark for applications such as digitalfront-end (DFE) and baseband solutions ina single-channel picocell wireless base sta-tion; military mobile software-definedradios (SDRs); ultrasound systems; driverassistance/media systems; high-definitionvideo; and Smart IP cameras.

Moreover, with as much as 53,712 logiccells, 2,268 Kb of block RAM, 373 Kb ofdistributed RAM, 519 I/O pins,DeviceDNA for security, and newly devel-oped hibernate/suspend power-manage-

ment features, Spartan-3A DSP offersenough integration capacity to driveprice/performance/power ratios evenlower. Add to this the inherent benefitsafforded by FPGA-based DSP solutions –lower risk through design flexibility andfaster time to market – and the value of theSpartan-DSP platform becomes increas-ingly apparent (see Table 1).

DSP-Optimized Spartan FPGAs At the heart of the Spartan-3A DSP is amodified version of the XtremeDSPDSP48 slice – the DSP48A. Initially intro-duced with the release of Virtex™-4FPGAs, the DSP48 slice has anApplication Specific Modular Block(ASMBL™) architecture that powers theDSP functions in Virtex DSP devices.

These XtremeDSP slices enable designersto implement solutions to complex chal-lenges, such as hundreds of IF-to-basebanddown-conversion channels, 128x chip-rateprocessing for 3G spread spectrum systems,and high-definition H.264 and MPEG-4encode/decode algorithms in a cost- andpower-efficient manner.

The DSP48 slices support many inde-pendent functions, including multiplier,multiplier accumulator (MAC), multiplierfollowed by an adder, three-input adder,barrel shifter, wide bus multiplexers, mag-nitude comparator, or wide counter. Thearchitecture also supports connecting mul-tiple DSP48 slices to form wide math func-tions, DSP filters, and complex arithmeticfunctions without the use of general FPGAfabric, thereby reducing power consumption

Spartan-DSP Takes Aim atAffordable DSP Performance

60 Xcell Journal Second Quarter 2007

The latest addition to the XtremeDSP platform portfolio takes price/performance to a new level.

XC3SD1800A XC3SD3400A XC4VSX25 XC4VSX35 XC4VSX55 XC5VSX35T XC5VSX50T XC5VSX95T

21 32 64 96 256 106 158 352

1,736 2,196 4,608 6,912 11,520 3,326 5,227 9,662

250 250 500 500 500 550 550 550

84 126 128 192 512 192 288 640

19 x 19 19 x 19 27 x 27 27 x 27 27 x 27 27 x 27 27 x 27 27 x 27

260 373 160 240 384 520 780 1,520

1,512 2,268 2,304 3,456 5,760 3,024 4,752 8,784

37,440 53, 712 23,040 34,560 55,296 34,816 52,224 94,208

Max DSP Performance(GMAC/s)

Max Memory Bandwidth (Mbps)

Max DSP Frequency (MHz)

XtremeDSP DSP48 Slices

Min Footprint (mm)

Distributed RAM (Kb)

Block RAM (Kb)

Logic Cells

Spartan-3A DSP

Spartan-DSP Virtex-DSP

Virtex-4 SX FPGA Virtex-5 SXT FPGA

Table 1 – The Spartan-DSP platform fills an important performance slot – the <40 GMAC/s range – in the XtremeDSP platform portfolio.

Page 61: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

while delivering very high performance andefficient silicon utilization.

The Spartan-DSP DSP48A slice is a sim-plified and cost-reduced version of theVirtex-4 DSP48 slice. To reduce cost, therounding modes, 17-bit shifters, and 3-inputadder modes were removed from theDSP48A slice; you can implement thesefunctions in the FPGA fabric if necessary.Two additional enhancements in theDSP48A slice are a fly-independent C-portand a pre-adder. The independent C-portprovides increased flexibility in implement-ing DSP algorithms. The pre-adder increasesdensity for common DSP filters and FFTs.Specifically, the pre-adder can be used toreduce the number of DSP48A slices that arerequired by 50% for sym-metric FIR filters and by25% for FFT algorithms,respectively. In theSpartan-3A DSP plat-form, the optimizedDSP48A slice achieves250 MHz operation inthe slowest speed grade.

Application ImpactTypical of the price/per-formance/power efficien-cy made possible bySpartan-3A DSP devices,a single XC3SD1800Adevice can replace two$25 DSP processors in aSmart IP camera application, absorbing theentire video pipeline section in the process.Besides the immediate $25 cost reduction,this enables you to place the remaining con-trol functions in a smaller, less expensiveDSP processor, thereby dropping the cost ofmaterials by another $10.

For high-volume consumer applicationssuch as the Smart IP camera or high-defini-tion video, this represents tremendous valueto the manufacturer that goes directly to thebottom line. Add to this the savings inpower, footprint, and bill of materials(BOM), and Spartan-DSP can have a directpositive impact on profitability, reliability,and product migration.

Similar studies in multi-stream videoservers have shown that a design employ-

requirements while meeting size, cost, andpower budgets.

DSP engineers are more focused on thecreation and refinement of DSP algo-rithms. Typically unfamiliar with detailedhardware design, they rely on tools toabstract away the details of hardware designso that they can focus on higher leveldesign exploration and validation.

Hardware engineers usually work inVHDL or Verilog to extract the maxi-mum performance from their designs.They often require the capability to worksimultaneously with higher level func-tional blocks and their own RTL-leveldesign from within the same design envi-ronment, running test benches to validate

function and performance.Thus, an essential factor in

the success of the XtremeDSPInitiative is the degree to whichthe design tools accommodateall three profiles. Since thelaunch of the initiative,XtremeDSP tools such asSystemGenerator and AccelDSPhave evolved to provide systemmodeling, algorithmic develop-ment and exploration, automat-ic generation of test benches,design verification and debug-ging, and HDL generation andsimulation. Whether you preferto work with VHDL, Verilog,C/C++, MATLAB, Simulink,

HDL, or any combination of these,XtremeDSP tools provide fast and efficientaccess to the full power of the FPGA.

Conclusion In the DSP marketplace, it is not alwaysthe fastest, the least expensive, or the mostpower-efficient processor that wins: it isthe platform that provides the best fit inevery category. For a large and growingnumber of applications, Spartan-3A DSPprovides the most efficient combinationof price, performance, and power con-sumption, backed by a robust set of tools,IP, and support infrastructure. If your nextDSP design needs extreme efficiency, youmay need the latest XtremeDSP solutionfrom Xilinx.

ing six $25 DSP processors can be reducedto three $25 Spartan-3A DSP devices, lit-erally cutting device costs in half. Hereagain, the positive impact on power, foot-print, and BOM is highly appealing.

In some cases, such as SDR for mobiledefense communications, the Spartan-3ADSP can serve as a reconfigurable co-proces-sor to a discrete DSP, providing both theaforementioned price/performance/powereconomies as well as eliminating the needfor duplicate circuitry to support multiplewaveforms.

Clearly, in every application for whichSpartan-DSP is an appropriate platform, theresult is greater efficiency in performance,power, and price.

XtremeDSP SolutionsThe XtremeDSP Initiative, launched inNovember 2000, gave light to a corporatecommitment shared by Xilinx and its partnersto make all of the available FPGA-based DSPhorsepower and flexibility accessible to threedistinct designer profiles: the system designer,the DSP engineer, and the FPGA/hardwareengineer. Each profile represents a unique setof responsibilities (and preferences) that dic-tate the requirements for their particulardesign environment.

System designers must quickly determinehow best to partition the various functions ofa system-level design across the available pro-cessing resources. Their concerns focus on theselection of processing resources that meetproduct performance and throughput

Second Quarter 2007 Xcell Journal 61

BCOUT

D:A:B comcat

PCOUT

D

CE

D REG

Q

D

CE

B1 REG

Q

D

CE

M REG

Q

D

CE

P REG

QD

CE

A REG

Q

D

CE

B0 REG

Q

D

CE

C REG

Q

0

1

D

B

A

C

BCIN PCIN

P

Z

X

18

18

18

18

48

2-Deep

36

48

48

48

48

SUBPA

SUBA

OpMode[3:2]

OpMode[1:0]

CIN

Figure 1 – The pre-adder saves nine slices for every Spartan-DSP DSP48A slice used in a FIR filter design.

Page 62: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,
Page 63: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Second Quarter 2007 Xcell Journal 63

High-Volume Design Application NotesThese FPGA and CPLD application notes can help you design for high-volume markets.

R E F E R E N C E

Spartan-3 Generation FPGAs

Audio, Video, and Image Processing

XAPP485 – 1:7 Deserialization in Spartan-3E FPGAs at Speeds Up to 666 Mbps www.xilinx.com/bvdocs/appnotes/xapp485.pdf

XAPP486 – 7:1 Serialization in Spartan-3E FPGAs at Speeds Up to 666 Mbps www.xilinx.com/bvdocs/appnotes/xapp486.pdf

XAPP491 – Inverting LVDS Signals for Efficient PCB Layout in Spartan-3 Generation FPGAs www.xilinx.com/bvdocs/appnotes/xapp491.pdf

XAPP930 – Color-Space Converter: RGB to YCrCb www.xilinx.com/bvdocs/appnotes/xapp930.pdf

Digital Signal Processing

XAPP634 – Analog Devices TigerSHARC Link www.xilinx.com/bvdocs/appnotes/xapp634.pdf

XAPP753 – Interfacing Xilinx FPGAs to TI DSP Platforms Using the EMIF www.xilinx.com/bvdocs/appnotes/xapp753.pdf

Communication and Networking

XAPP551 – Viterbi Decoder Block Decoding – Trellis Termination and Tail Biting www.xilinx.com/bvdocs/appnotes/xapp551.pdf

XAPP569 – Digital Up and Down Converters for the CDMA2000 and UMTS Base Stations www.xilinx.com/bvdocs/appnotes/xapp569.pdf

XAPP942 – Reference System: OPB Ethernet MAC www.xilinx.com/bvdocs/appnotes/xapp942.pdf

Memory Interface and Storage Elements

XAPP229 – Wider Block Memories www.xilinx.com/bvdocs/appnotes/xapp229.pdf

XAPP291 – Self-Addressing FIFO www.xilinx.com/bvdocs/appnotes/xapp291.pdf

XAPP482 – MicroBlaze Platform Flash/PROM Boot Loader and User Data Storage www.xilinx.com/bvdocs/appnotes/xapp482.pdf

CoolRunner-II CPLDs

Handheld Consumer Electronics

XAPP394 – Interfacing to Mobile SDRAM with CoolRunner-II CPLDs www.xilinx.com/bvdocs/appnotes/xapp394.pdf

XAPP905 – Using CoolRunner-II with OMAP, Xscale, i.MX & Other Chipsets www.xilinx.com/bvdocs/appnotes/xapp905.pdf

XAPP914 – Connecting Intel PXA27x Processors to Hard-Disk Drives with a CoolRunner-II CPLD www.xilinx.com/bvdocs/appnotes/xapp914.pdf

Industrial, Scientific, and Medical

XAPP384 – Interfacing to DDR SDRAM with CoolRunner-II CPLDs www.xilinx.com/bvdocs/appnotes/xapp384.pdf

XAPP385 – CoolRunner-II CPLD I2C Bus Controller Implementation www.xilinx.com/bvdocs/appnotes/xapp385.pdf

XAPP386 – CoolRunner-II Serial Peripheral Interface Master www.xilinx.com/bvdocs/appnotes/xapp386.pdf

XAPP387 – PicoBlaze 8-bit Microcontroller for CPLD Devices www.xilinx.com/bvdocs/appnotes/xapp387.pdf

XAPP940 – Using Xilinx CPLDs as Motor Controllers www.xilinx.com/bvdocs/appnotes/xapp940.pdf

Page 64: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

by Wil Florentino Starter Kits Marketing ManagerXilinx, [email protected]

Many development systems claim to offera “quick start” toward feature evaluationand a simple out-of-the box experience.Getting the device set up sometimes takesmore energy than running your first pro-gram. Unfortunately, getting up and run-ning could mean loading CD files, settingjumpers, downloading code, or hookingup banks of headers. You may need third-party tools or external reference designs.You may even have to go to your localelectronics store to get cables and a powersupply – all to get to “step one.”

To keep things a lot simpler, Xilinx hasintroduced an “Out-of-the-Box Guarantee”starting with the Spartan™ family of starterkits. When you open the box, you will haveall of the key hardware components thatwill quickly highlight some features of theFPGA device as soon as you turn it on. Forexample, the newly released Spartan-3AStarter Kit contains a development boardpre-programmed with demo software, aglobal power supply, a download cable, anda resource CD that contains exampledesigns as well as documentation.

When you open the box, all you need todo is plug in the power supply, switch it on,and connect the board to a display; a pro-gram already built into the board will runautomatically and allow you to quicklysample the device features. For example, it

shows how the multi-boot feature works byseamlessly toggling between configurationsamples, or the fractal display generator’scomplex polynomial mathematical capabil-ities. There is also a demo that reads thechip’s DeviceDNA – a unique production-

stamped identifier that gives you an addition-al level of intellectual property security.

In addition, these demonstrations do notjust run out of the box. They also are availablein source code so that you are free to use andreconfigure the samples as you wish.

As customers link their technology expertiseto FPGA devices, it is best to equip them withhardware and software tools for easy develop-ment and provide them a glimpse into devicefunctionality when they start up a system. TheSpartan-3A Starter Kit is just one example ofhow Xilinx provides customers a true “quickstart” by providing predictable and com-plete components for you to see the device’scapabilities out-of-the-box.

64 Xcell Journal Second Quarter 2007

A Quicker Quick Start Spartan-3 kits now come with an Out-of-the-Box Guarantee.

Spartan-3A Highlights:• Power management modes

• Enhanced multiboot configuration loads multiple designs from a single PROM

• DeviceDNA identifier is unique to each FPGA

• Spartan-3A Starter Kit speeds development

R E F E R E N C E

Page 65: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Choose from a variety of flexible trainingoptions, including instructor-led classes(both in person and online) and recordede-learning that allows you to learn at yourown pace.

For tuition and registration informationas well as class schedules, please contact anauthorized training provider at www.xilinx.com/support/training/atp.htm. For acomplete listing of all Xilinx® trainingopportunities, visit www.xilinx.com/support/education-home.htm.

FPGA Design Courses• Fundamentals of FPGA Design –

This course covers ISE™ software 9.1ifeatures, such as the architecture wiz-ard and floorplan editor. Other topicsinclude design planning, implementa-tion options, and global timing con-straints. For more emphasis onimproving overall design performance,

take the follow-up course, Designingfor Performance, which builds on thebasic principles covered in this course.

On the Web at www.xilinx.com/support/training/abstracts/fundamentals.htm

• Design Techniques for Lower Cost –This course will appeal to engineerswho have an interest in developinglow-cost products, particularly inhigh-volume markets. The course andexercises cover several different designtechniques, which will be exciting andchallenging for any digital designerregardless of the final application.

On the Web at www.xilinx.com/support/training/abstracts/low-cost.htm

CPLD Design Courses• Fundamentals of CPLD Design –

This comprehensive course provides

you with an introduction to design-ing with Xilinx CPLDs by using ISEsoftware tools. You will learn thebasics of ISE software flow and howto interpret CPLD reports for opti-mum performance designs.

On the Web at www.xilinx.com/support/training/abstracts/cpld.htm

• Designing for Performance forCPLDs – This intermediate-levelcourse provides a comprehensiveoverview on CPLD software flow. By applying the techniques presented in this course, you will be able to enhance the performance of your designs and make the best possible use of Xilinx CPLDarchitectures.

On the Web at www.xilinx.com/support/training/abstracts/dfp_cpld.htm

Second Quarter 2007 Xcell Journal 65

Educational OpportunitiesXilinx delivers training that helps you develop low-cost, efficient programmable logic designs.

R E F E R E N C E

Page 66: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

Here are 6 of the new, faster, bigger, Virtex-5 FPGAs on a 12 Million ASIC Gate Boardthat offers unmatched performance to ASIC Prototypers, IP Designers, and FPGADevelopers. The V5 65nm process, with 6 input LUT and advanced interconnect,enables 30% faster clock speeds in your application. The Dini DN9000k10PCI capturesthis performance on an easy to use board with these handy features:

•33/66 MHz PCI bus or stand-alone operation•6 - DDR2 SODIMM Sockets•7 - Global Clock Networks•3 - 400pin FCI-MEG Connectors for daughter cards• Easy configuration via CompactFlash, USB or PCI

All necessary operating software, including referencedesigns and Synplicity Certify™ models to simplifypartitioning, is supplied with the board. If yourneed is speed — visit The Dini Group web site www.dinigroup.com for complete details on the fastest FPGA board ever.

30% faster than lastyear’s model…

1010 Pearl Street, Suite 6La Jolla, CA 92037 (858) 454-3419 [email protected]

Page 67: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,
Page 68: Xcell - Xilinx · get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers this complete solution to simplify high-speed serial design. Lowest-power,

IntegrationFor security and performance,get the best of both worlds.

www.xilinx.com/spartan

The Ultimate Low-Cost Applications Platform

©2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Introducing Spartan-3AN, a non-volatile FPGA platform forhighest system integration.

With all the features and performance of SRAM-based FPGAs,the new Spartan-3AN devices also provide the board space savingsand worry-free configurability of non-volatile FPGAs. Each FPGAhas up to 11Mb user-Flash, a Device DNA serial number and aFactory Flash ID, enabling customizable security solutions to deterreverse engineering, cloning and overbuilding.

Driving the next wave of high-volume applicationsThe Spartan-3 Generation of FPGAs gives you all the choice youneed to solve any design challenge:

• Spartan-3AN platform – Non-volatile

• Spartan-3A platform – I/O optimized

• Spartan-3E platform – Logic optimized

• Spartan-3 platform – Optimized for high density and pin count

Visit www.xilinx.com/spartan today, and find out how theSpartan-3 Generation of FPGAs gives you the best of all worlds.

Nearest CompetitorSpartan-3AN

1000X MoreSystem Flexibility!

11Mb

8Kb

User Flash

PN 2023