Virtex-5 Special Edition - Xilinx · 2018-08-02 · W Welcome to this special edition of Xcell Journal, featuring a broad array of articles on Xilinx® Virtex™-5 FPGAs. In this

ISSUE 59, FOURTH QUARTER 2006XCELL JOURNAL

XILINX, INC.

R

Issue 59Fourth Quarter 2006

T H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

Xcell journalXcell journalT H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

INSIDE

Achieve Higher Performancewith Virtex-5 FPGAs

HDL Coding and DesignPractices for Improving Virtex-5 Utilization,Performance, and Power

A Multi-Gigabit Transceiverfor the Masses

Introducing the Virtex-5 PCI Express Endpoint Block

Meeting Memory InterfaceDesign Challenges withVirtex-5 FPGAs

INSIDE

Achieve Higher Performancewith Virtex-5 FPGAs

HDL Coding and DesignPractices for Improving Virtex-5 Utilization,Performance, and Power

A Multi-Gigabit Transceiverfor the Masses

Introducing the Virtex-5 PCI Express Endpoint Block

Meeting Memory InterfaceDesign Challenges withVirtex-5 FPGAs

www.xilinx.com/xcell/

Virtex-5SpecialEdition

Virtex-5SpecialEdition

SILICA I The Engineers of Distr ibution

www.si l ica.com

© Avnet, Inc. 2006. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Silica designs, manufactures, sells and supports a wide

variety of hardware evaluation, development and reference

design kits for developers looking to get a quick start on

a new project.

With a focus on embedded processing, communications

and networking applications, this growing set of modular

hardware kits allows users to evaluate, experiment,

benchmark, prototype, test and even deploy complete

designs for field trial.

By providing a stable hardware platform that enhances system

development, design kits from Silica help original equipment

manufacturers (OEMs) bring differentiated products to market

quickly and in the most cost-efficient way possible.

For a complete listing of available boards, visit

www.silica.com

Support Across The Board.™

Design Kits Fuel Feature-Rich Applications

Build your own system bymixing and matching:

Processors

FPGAs

Memory

Networking

Audio

Video

Mass storage

Bus interface

High-speed serial interface

Available add-ons:

Software

Firmware

Drivers

Third-party development tools

•

•

•

•

•

•

•

•

•

•

•

•

•

The Programmable Logic CompanySM

www.xilinx.com/virtex5

Achieve highest system speed and better design margin withthe world’s first 65nm FPGAs.

Virtex™-5 FPGAs feature ExpressFabric™ technology on 65nm triple-oxide

process. This new fabric offers the industry’s first LUT with six independent

inputs for fewer logic levels, and advanced diagonal interconnect to enable

the shortest, fastest routing. Now you can achieve 30% higher

performance, while reducing dynamic power by 35% and area by 45%

compared to previous generations.

Design systems faster than ever before

Shipping now, Virtex-5 LX is the first of four platforms optimized for

logic, DSP, processing, and serial. The LX platform offers 330,000 logic

cells and 1,200 user I/Os, plus hardened 550 MHz IP blocks. Build deeper

FIFOs with 36 Kbit block RAMs. Achieve 1.25 Gbps on all I/Os without

restrictions, and make reliable memory interfacing easier with enhanced

ChipSync™ technology. Solve SI challenges and simplify PCB layout with our

sparse chevron packaging. And enable greater DSP precision and dynamic

range with 550 MHz, 25x18 MACs.

Visit www.xilinx.com/virtex5, view the TechOnline webcast, and give

your next design the ultimate in performance.

The Ultimate System Integration Platform

Ultimate Performance…Pe

rfo

rman

ce

LogicFabric

Performance

On-chipRAM

550 MHz

DSP32-Tap Filter

550 MHz

I/O LVDSBandwidth750 Gbps

I/O MemoryBandwidth384 Gbps

Virtex-5 FPGAs Virtex-4 FPGAs Nearest Competitor

Numbers show comparision with nearest competitorBased on competitor’s published datasheet numbers

1.4 x

1.3 x 1.6 x

2.4 x

4.4 x

Industry’s fastest 90nm FPGA benchmark

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

WWelcome to this special edition of Xcell Journal, featuring a broad array of articles on Xilinx®

Virtex™-5 FPGAs. In this issue you’ll find executive and industry viewpoints; articles onengineering solutions, design challenges, tools, customer successes, and vertical markets; anda technical reference section covering application notes, boards, and IP.

As exciting as this is, I’d also like to let you know about a couple of announcements from Xcell Publications.

Xcell Publications Honored with APEX 2006 Award of ExcellenceXcell Publications was recently awarded the APEX 2006 Award of Excellence in two categories –magazine and journal design and layout and custom-published magazines and journals – for twoof its flagship Xcell Publications, Xcell Journal and I/O Magazine.

APEX 2006 – the 18th Annual Awards for Publication Excellence – is an international compe-tition that recognizes outstanding publications, including newsletters, magazines, annualreports, brochures, and websites. According to APEX judges, this year’s competition was excep-

tionally intense, with nearly 5,000 entries. Awards were granted based onexcellence in graphic design, quality of editorial content, and the success ofthe entry in conveying the message and achieving overall communicationseffectiveness.

“We’re honored that Xcell magazines have been selected for excellence in publishing among such a stellar list of companies by the APEX panel of

judges,” said Sandeep Vij, vice president of worldwide marketing at Xilinx. “Over the past 18years, our custom publications have served as a foundational tool, delivering ‘how-to’ informationto a growing base of engineers using Xilinx programmable chips to design a wide variety of electronicsystems, ranging from the Mars Rover to high-volume consumer handsets, flat-panel displays andautomotive infotainment systems. Being ranked among the industry’s best underscores the valueand quality of our company’s portfolio of custom magazines.”

Xilinx joins a prestigious list of award-winning companies from a variety of industries in theAPEX competition for custom-published magazines and journals, including Blue Cross BlueShield, CMP Media/Digital Connect, DaimlerChrysler, IBM Journal of Research andDevelopment, Mac Publishing, National Football League, National Foundation for Advancementin the Arts, Penton Custom Media, and Time Inc. Strategic Communications.

New Digital Editions AvailableWe now offer digital editions of our magazines. Now you can subscribe for free to the new Xcell Journal Digital, requiring no software downloads and visible on any standard Internet browser.This updated publishing technology lets you browse, search, make notes, e-mail authors, and clickthrough to advertisers’ websites.

To receive Xcell Journal Digital, you have to subscribe. In additionto Xcell Journal, we also now offer digital subscriptions of all ofour magazines. Please visit our website at www.xilinx.com/xcelland click on “Subscriber Services.”

I hope you enjoy reading this issue.

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. PowerPC is atrademark of IBM, Inc. All other trademarks are theproperty of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

Forrest CouchPublisher

PUBLISHER Forrest [email protected]

EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan Teie1-800-493-5551

TECHNICAL COORDINATOR Greg Lara

INTERNATIONAL Dickson Seow, Asia [email protected]

Andrea Barnard, Europe/Middle East/[email protected]

Yumi Homura, [email protected]

SUBSCRIPTIONS All Inquirieswww.xcellpublications.com

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

O N T H E C O V E R

1919P E R F O R M A N C E

4242S E R I A L C O N N E C T I V I T Y

1616P E R F O R M A N C E

Achieve Higher Performance with Virtex-5 FPGAsNew architectural elements can help you attain higher system-level performance.

4545S E R I A L C O N N E C T I V I T Y

Introducing the Virtex-5 PCI Express Endpoint BlockWith PCI Express quickly becoming the standard high-bandwidth interconnect, the Virtex-5 LXT PCIe Endpoint block enables a configurable single-chip solution.

7373M E M O R Y I N T E R F A C E S

Meeting Memory Interface Design Challenges with Virtex-5 FPGAsVirtex-5 devices support the latest generation of high-speed memory interfaces.

HDL Coding and Design Practices for Improving Virtex-5 Utilization, Performance, and PowerThese tips and techniques can lead to better Virtex-5 designs.

A Multi-Gigabit Transceiver for the MassesThe Virtex-5 GTP transceiver brings versatility, ease of use, power efficiency, and cost-effectiveness to high-volume mainstream applications.

Introducing the Virtex-5 FPGA FamilyThe first 65-nm advanced FPGAs raise the bar in performance, power efficiency, capacity, and value.

Viewpoint

88

F O U R T H Q U A R T E R 2 0 0 6 , I S S U E 5 9 Xcell journalXcell journalVIEWPOINTSIntroducing the Virtex-5 FPGA Family ....................................................................................8Serial Everywhere – The Triple-Play Challenge .....................................................................12Virtex-5 Serial Connectivity Solutions .................................................................................13FPGAs for Serial Interconnections .......................................................................................14

PERFORMANCEAchieve Higher Performance with Virtex-5 FPGAs .................................................................16HDL Coding and Design Practices for Improving Virtex-5 Utilization, Performance, and Power......19Getting the Best Results from Virtex-5 FPGAs ......................................................................23Maximizing Design Performance for Virtex-5 FPGAs ..............................................................28Clock Management in Virtex-5 Devices ...............................................................................31

POWERReduce Power with Virtex-5 FPGAs ....................................................................................33Applying Compact Thermal Models .....................................................................................38

SERIAL CONNECTIVITYA Multi-Gigabit Transceiver for the Masses ............................................................................42Introducing the Virtex-5 PCI Express Endpoint Block ..............................................................45PCI Express Markets, Trends, and Applications ......................................................................49Designing with Virtex-5 Embedded Tri-Mode Ethernet MACs ....................................................54Asynchronous Sample-Rate Conversion Between AES Audio Streams........................................57Implementing Integrated Video Connectivity Solutions with Virtex-5 LXT Devices .......................61Enhancing System Management and Diagnostics with the Virtex-5 System Monitor ...................64Real-Time Debugging for Virtex-5 FPGAs ..............................................................................68

MEMORY INTERFACESMemories are Made of This ... ...........................................................................................70Meeting Memory Interface Design Challenges with Virtex-5 FPGAs ..........................................73Implementing Memory Controllers Using the Memory Interface Generator Tool..........................76Micron Memory Interface ..................................................................................................79Designing Virtex-5 DDR2 Memory Interfaces for Signal Integrity..............................................83

SOURCE-SYNCHRONOUS INTERFACESImprove System Reliability with SPI-4.2 LogiCORE Solutions and Virtex-5 FPGAs .......................87

VERTICAL MARKET SOLUTIONSUsing Virtex-5 FPGAs in COTS Board-Level Products ...............................................................90Tackling Serial Backplane Interface Design Challenges ...........................................................97Enabling Multi-Port 1 Gbps and 10 Gbps TCP/iSCSI Protocol Offload Solutions........................101Implementing Encryption Algorithms with the Virtex-5 LXT Platform.......................................103

GENERALVirtex-5 Configuration Options Offer Designers a Choice .......................................................107Introducing Virtex-5 EasyPath FPGAs .................................................................................108

REFERENCEConnectivity Solutions .....................................................................................................110Intellectual Property Offerings ..........................................................................................111Virtex-5 Boards and Kits ..................................................................................................112

by Steve DouglassVice President Product Development,Advanced Product DivisionXilinx, [email protected]

Welcome to theVirtex™-5 issue of

Xcell Journal. The Xilinx® Virtex-5 familyis not only the industry’s first 65-nmFPGA – it also offers some of the mostadvanced architecture and highest per-formance in the world. Continuing ourhistory of developing groundbreakingtechnology, we listened to leading designengineers in various markets and built onkey characteristics that made our Virtex-4FPGA family a tremendous success:

• Higher performance

• Higher logic density

• Lower power consumption

• More advanced features

The fundamental value propositions ofFPGAs include faster time to market, ver-satility, support for evolving standards, riskmitigation, field upgradability, and lowersystem costs. Our FPGAs accommodateyour demands for continued improve-ments in performance, capacity, powerconsumption, and cost.

8 Xcell Journal Fourth Quarter 2006

Introducing the Virtex-5 FPGA FamilyIntroducing the Virtex-5 FPGA FamilyThe first 65-nm advanced FPGAs raise the bar in performance, power efficiency, capacity, and value.The first 65-nm advanced FPGAs raise the bar in performance, power efficiency, capacity, and value.

V I E W P O I N T

The Virtex-5 family combines the inher-ent advantage of state-of-the-art 65-nmprocess technology with an innovative designthat is based on a deeper understanding of theapplications our products serve. In this arti-cle, I’ll provide an overview of the new fea-tures in Virtex-5 devices, explain theunderlying technology, and offer a glimpse ofthe design decisions that led to our world-leading FPGA architecture.

Process Technology and Architectural InnovationsVirtex-5 FPGAs are built on 65-nm triple-oxide technology using our Advanced SiliconModular Block (ASMBL™) architecture andproviding additional levels of system integra-tion. This new family offers an advanced plat-form that meets the growing need forprogrammable systems with higher perform-ance, higher density, lower power consump-tion, and lower overall system cost.

It might be easy to deliver on one or twoof these items, but our challenge was todeliver all of them at the same time.

We successfully met those challengesthrough a combination of advanced ICprocess development and innovative archi-tecture and circuit design. Introduced in theVirtex-4 family, our proven ASMBL chiplayout architecture allows us to provide theoptimal mix of required device resources(logic, memory, arithmetic, I/O, and IP),thus creating the ideal combination for fournew platforms:

• The LX platform, optimized for high-performance logic

• The LXT platform, optimized for high-performance logic with low-power serial I/O

• The SXT platform, optimized for high-performance arithmetic- and mem-ory-intensive DSP with low-power serial I/O

• The FXT platform, optimized forembedded processing and very high-speed serial I/O

Compared to our Virtex-4 family,Virtex-5 devices offer 30 percent higheraverage speed and 65 percent higher capac-ity in the largest device. Dynamic powerconsumption is reduced by 35 percent and

(notably, the interconnect). A six-inputLUT (6-LUT) with four times more bitsthus increases the CLB area by only 15% –but packs, on average, 40% more logic intoeach LUT. This higher logic density oftenreduces the number of cascaded LUTs andcan improve the critical path delay, asshown in Figure 2.

chip area is 45 percent smaller, resulting ina lower cost per function.

Higher Performance and DensityExpressFabric™ technology implementslogic and local interconnect routing. Itincorporates look-up tables (LUTs) with sixindependent inputs, plus a new diagonalinterconnect structure, as illustrated inFigure 1. ExpressFabric technology imple-ments combinatorial logic in fewer LUTlevels and uses fewer concatenated connec-tions to neighboring building blocks, ascompared to the Virtex-4 architecture. Thisreduces datapath delays and thus increasesdesign performance.

Advanced 6-LUT Logic StructureFor many years, four-input LUTs were theindustry standard. However, at 65 nm, theregular structure of the LUT can be shrunkeven more than the remaining circuitry

Fourth Quarter 2006 Xcell Journal 9

6-LUT

6-LUT

6-LUT

6-LUT

Virtex-5

Slice

Direct

CLB

Virtex-5

Interconnect

Capability

Virtex-4

Interconnect

Capability

1 Hop

2 Hops

3 Hops

Making the RightTrade-Off

Number of LUT Inputs

Architectural Evaluation of Typical Designs

2 3 4 5 6 7 8

Critical P

ath

Dela

y

Used D

ie A

rea

Figure 1 – Virtex-5 ExpressFabric technology

Figure 2 – Optimal performance/area trade-off

V I E W P O I N T

We took a suite of customer designs andimplemented them using ISE™ 8.1i soft-ware. For each design, we compared thenumber of LUTs used with Virtex-4 andVirtex-5 device implementations and cor-related this information with the perform-ance increase in megahertz. The scatterplotgraph in Figure 3 shows the percentage ofperformance improvement on the X axisand percentage area savings in terms ofLUT count reduction on the Y axis. Thenew 6-LUT ExpressFabric technology pro-vides a win-win solution in both perform-ance gain and resource savings.

Unlike competing FPGAs, Virtex-5FPGAs provide real 6-LUTs that you can useas logic or as distributed memories, where aLUT can be a 64-bit distributed RAM (evendual- or quad-ported) or a 32-bit program-mable shift register. Each LUT can have twooutputs, thus implementing two logic func-tions of five variables, storing 32 x 2 RAMbits, or acting as a 16 x 2-bit shift register.

New Diagonally Symmetric InterconnectA new diagonally symmetric intercon-nect pattern enhances performance byreaching more places in fewer routinghops. A comparison between Virtex-5and Virtex-4 FPGA interconnect pat-terns (with each box representing a CLB)is illustrated in Figure 1. The color codesshow that with the Virtex-5 FPGA, thepattern is more symmetric, with moreCLBs reached in fewer hops. The sym-metry thus achieves better results fromplace and route software tools.

These features are transparent to Virtex-5FPGA users and are automatically exercisedby ISE software, resulting in easier routabili-ty and higher overall performance.

Lowest Power Advanced FPGA SolutionThe Virtex-5 device family uses ouradvanced 65-nm, triple-oxide, 11-layercopper CMOS process technology.“Triple oxide” refers to the number of dif-ferent transistor gate-oxide thicknessesused. The I/O transistors must be 3.3Vtolerant and use relatively thick oxide, butthe very fast transistors used for logic andother core functions use very thin oxide.

Unfortunately, very thin oxide andvery low threshold voltage unavoidablycause high leakage current. There are,however, many transistors in an FPGAthat need not be very fast (notably theconfiguration storage cells). Starting withthe Virtex-4 family, Xilinx pioneered athird, intermediate gate thickness forthose transistors. This triple-oxideapproach allows us to fine-tune the per-formance and power in the device circuit-ry; it enables Virtex-5 devices to deliverindustry-leading performance while dra-matically lowering leakage current andthus static power consumption.

Additionally, the new 6-LUT logicstructure combines more logic per LUT,uses fewer local interconnect nodes, andfewer high capacitance nodes betweenlogic functions, reducing the levels oflogic and thus the path delay. The newsymmetric routing also uses more direct

connects between adjacent logic, againlowering routing capacitance.

VCCINT, the core supply voltage, isnow 1.0V. All of these factors contributeto a reduction in overall dynamic powerconsumption. With the success of theVirtex-4 family, we know that many engi-neers view performance and power con-sumption as two equally importantconstraints in their system designs; there-fore, we need to offer both high perform-ance and low power.

We completely reengineered theVirtex-5 logic fabric to fully take advan-tage of the 65-nm triple-oxide CMOSprocess, resulting in the highest perform-ance fabric ever, with system clock rates inexcess of 550 MHz. At the same time,static power is comparable to that of the90-nm Virtex-4 devices, while dynamicpower has been reduced by at least 35%.Just like its predecessor, the Virtex-5 fam-ily again provides the lowest power solu-tion of any advanced FPGA family.

Advanced Features for System IntegrationIn the Virtex-5 family, we have added aphase-locked loop (PLL) to each clockmanagement tile (CMT), which now con-tains two digital clock managers (DCMs)and one PLL. The CMT thus offers thebest of both worlds: the robust versatilityand precise incremental phase shift capa-bility of a digital clock manager combinedwith the jitter reduction from the analogPLL. The largest device in the family hassix CMTs capable of generating andmanipulating 550-MHz clocks, support-ing the performance of Virtex-5 logic andblock functions.

Synchronous dual-ported block RAMis an important function. The size of eachblock RAM has been increased to 36 Kb,but you can also use it as two independent18-Kb block RAMs. The data bus width isprogrammable from 1 bit to 36 bits. Insimple dual port mode (one port write,one port read) the data bus width can beas high as 72 bits, effectively doubling thedata bandwidth. You can turn off unused18-Kb blocks to save power.

The block RAM has integrated FIFOcontrol logic, simplifying the design of


-60

-40

-20

020 60 60

Performance Increase (%)

Most designs show

more than 20%

improvement in

both size and

performance.

LU

T C

ount R

eduction (

%)

Faster

Sm

alle

r

Figure 3 – Virtex-5 FPGA versus Virtex-4 FPGA design suite benchmarks

V I E W P O I N T


asynchronous (or synchronous) FIFOsrunning as fast as 550 MHz without con-suming any logic resources.

The 72-bit-wide block RAM nowincludes 64-bit error checking and correc-tion (ECC) control logic. Like the inte-grated FIFO support, the integrated ECCimproves memory performance and elim-inates the cost associated with traditionalfabric-based solutions. You can also usethe dedicated ECC logic to augmentexternal memory interfaces.

Interfacing to external devices andespecially external memory such as DDR,DDR2, QDR II, and RLDRAM II is dra-matically enhanced and simplified by ournew ChipSync™ technology. A memorydevelopment system (ML561) based onour LX50T devices contains fully func-tional and hardware-proven referencedesigns for all of today’s most popularmemory technologies.

In the DSP domain, we are now provid-ing 25 x 18-bit multipliers, mainly formore efficient floating-point designs. TheseDSP48E slices can be directly cascaded forhigher performance in digital filtering orvideo broadcast applications. Direct cas-cading also saves power – as much as 40%compared to competing solutions.

Virtex-5 SelectIO™ technology contin-ues to lead the industry. Every pin supportsvirtually every I/O standard in use todayand offers up to 1.25 Gbps LVDS and 800Mbps single-ended I/O performance.

Beyond the IDELAY option, whichoffers programmable input delay in stepsof 75 ps, the new ODELAY option nowoffers the same fine granularity at theFPGA output. Either of these functions isindividually programmable on everydevice pin.

The IODELAY function is an impor-tant feature to enhance reliable transmitand receive of high-speed source-synchro-nous data and clocks. The intended appli-cation includes compensation forboard-level skews, bit alignment in a bus,and alignment between data and clock sig-nals. This enables LVDS I/Os to achievespeeds as fast as 1.25 Gbps per pin pair.

Virtex-5 LXT, SXT, and FXT devicesalso offer embedded serial transceivers –

as many as 24 in the largest LXT device.In designing our fourth-generationRocketIO™ technology of high-speedserial transceivers, we invested significantengineering effort to lower power con-sumption. At the top speed of 3.2 Gbps,the LXT RocketIO transceiver consumestypically less than 100 mW, making it thelowest power transceiver in any FPGAproduct (see Figure 4).

Each Virtex-5 LXT RocketIO trans-ceiver is programmable and can imple-ment a myriad of speed and serialstandards. Link-layer IP is available for

such standards as Ethernet, HD/SDI,Serial RapidIO, FibreChannel, andAurora. Finally, we anticipated the popu-larity of PCI Express (PCIe) endpointapplications and integrated the completePCIe endpoint protocol in hard logic. TheVirtex-5 LXT PCIe Endpoint block isfully compliant to PCIe standard specifi-cation version 1.1 and can support x1, x2,x4, and x8 lane implementations. Theintegrated hard IP saves logic resourcesand improves performance for increasing-ly popular PCIe applications. For an x4PCIe lane implementation, the Virtex-5PCIe subsystem block saves as many as

8,500 LUTs compared to implementationwith soft IP.

Virtex-5 devices offer more and small-er I/O banks. The outer I/O banks (asmany as eight banks in the largest device)also are arranged to provide a PCB rout-ing advantage that in some cases mightsave board layers.

To ensure the best simultaneouslyswitching output (SSO) performance andprovide the best signal integrity (SI) solu-tion in the FPGA industry, all Virtex-5devices use Xilinx sparse chevron technol-ogy pinout assignments. This ensures that

each I/O pin is closely surrounded bypower and ground pins, thus minimizingcurrent loop inductance and improving SI.

Conclusion I hope that you have enjoyed readingabout Virtex-5 devices and the factors thatdrove their design. At Xilinx, we havetruly enjoyed the excitement in the systemengineering community about this newarchitecture. We look forward to seeingyour next-generation systems benefit fromthe Virtex-5 enhanced performance andfunctionality, taking your complex designsto the next level.

DriverParallel

toSerial

PMA PLLDivider

Equlizerand

RX-OOBCDR

Serialto

Parallel

PMA PLLDivider

Polarity PhaseAdjust

FIFO andOver-

Sampling

PRBSGenerator

8B/10B

TX PIPE Control

Over-Sampler

Polarity

PRBSChecker

CommaDetect

andAlign

8B/10B

Loss of Sync

ElasticBuffer

RX Status Control

RX Pipe Control

FPGAFabric

TX-PMA

RX-PMA

TX-PCS

RX-PCS

Pre-Emphasis

From PMA PLL

TX

RX

Figure 4 – RocketIO GTP transceiver

V I E W P O I N T

The electronics industry is pressed to its lim-its as it strives to develop solutions to feedthe insatiable appetites of the consumer andenterprise markets for voice, video, andcomputer data communications on a singlenetwork. To the global broadcast andtelecommunications industries, the triple-play opportunity is at once a potentiallyinexhaustible source of revenue and a con-stant source of frustration. Despite theimmeasurable reward for successfully deliv-ering triple-play services to the masses, sub-stantial obstacles continue to impede access.

Perhaps the most central of these obsta-cles is the inadequacy of legacy infrastruc-ture equipment to support the massiveincreases in bandwidth. Evolving fromvoice-only, the legacy infrastructure is acomplex web of overlaid networks that rep-resents both a financial and technologicalburden to service providers. In short, it isneither technologically feasible to delivertriple-play services with existing equipmentnor economically practical to replace itwith a completely new network. Moreover,legacy customers will not tolerate any inter-

ruption of existing services, nor will theypay extra for poor service quality.

Motivated by the promise of substantialrewards to those that enable this massivebusiness food chain, the electronics indus-try is marshaling every possible resource tofind solutions at all levels to the triple-playchallenge. It is no surprise that the semi-conductor industry endeavors to keep pacewith system manufacturers.

Xilinx Serial I/O Solutions: Crossing the ChasmThe evolution of serial I/O solutions inXilinx® FPGAs is the result of our high-speed serial initiative, which we announcedin 2002. The aim of the initiative was (andis) to accelerate the industry’s move fromparallel to high-speed serial I/O by deliver-ing a new generation of connectivity solu-tions for system designs that meetbandwidth requirements from 3.125 Gbpsto 10 Gbps and beyond.

We began by adding up to twenty-four3.125 Gbps serial transceivers in ourVirtex™-II Pro family, accompanied by IPsoft cores for numerous serial connectivitystandards, reference designs, hardwaredevelopment platforms, design software,characterization data, and an in-depthdesign support program.

The Virtex-4 FX family followed suitin 2005 with a similar complement ofbroad-range transceivers, this time deliv-ering 622 Mbps to 6.5 Gbps perform-ance, as well as an equally robust set of IP

and design support software, hardware,and services.

In each case, one of the key objectives inthe introduction strategy of these products –with their attending high-speed serial I/Osolution packages – was to reach the earlyadopters and innovators within the FPGAcustomer base with a viable alternative tocustom ASIC and ASSP serial I/O solutions.

Having successfully proven the viabilityof FPGA-based serial I/O solutions withthese previous product families, thereremained a single yet extremely importantevolutionary step. To cross the chasm intothe mainstream FPGA customer base andtruly create equivalency between Xilinx serialI/O solutions and custom solutions requiredthe delivery of fully verified, fully integrated,hard IP-based, turnkey serial I/O solutions.

With our newest 65-nm Virtex-5 LXTplatform, we believe that we have indeedcrossed the chasm. By offering the indus-try’s first FPGA to deliver hard-coded PCIExpress Endpoint and tri-mode Ethernetmedia access controller (MAC) blocks,Virtex-5 LXT devices are addressing thebandwidth, power, and cost challenges fac-ing equipment vendors working to enablethe emerging triple-play services market.The Virtex-5 LXT platform is optimized toenable FPGA designers across a wide rangeof applications to benefit from serial con-nectivity by delivering a comprehensive,fully compliant protocol solution with thegreatest ease of use.


Serial Everywhere – The Triple-Play ChallengeSerial Everywhere – The Triple-Play ChallengeXilinx is helping to empower the next innovation in the triple-play race.Xilinx is helping to empower the next innovation in the triple-play race.

Viewfrom the top

by Wim RoelandtsCEO and Chairman of the BoardXilinx, Inc.

by Sandeep VigVice President, Worldwide MarketingXilinx, [email protected]

Although “tripleplay” may be one ofthe hottest buzz-

words and growth drivers in the semicon-ductor industry, it is insightful tounderstand the evolution of the technologythat was required to realize triple play, theforces behind its explosive growth, chal-lenges that will occur along the way, andthe critical role of Xilinx® Virtex™-5products in the development and deploy-ment of triple-play products and services.

Central to the Virtex-5 platform’s valueis the recent emergence of two serial I/Ostandards: Gigabit Ethernet (GbE) andPCI Express (PCIe). In the last three years,these two interfaces have become the de-facto connectivity standards for networkand computing applications; according toElectronic Trend Publications, GbE andPCIe will account for 80% of all port ship-ments in 2009.

Disruptive TechnologyIP is clearly the preferred protocol in the net-work market as telecom vendors and serviceproviders transition to an all-IP-based infra-structure supporting Voice over IP, Videoover IP, and Data over IP (also known astriple play). Designing carrier-grade to end-user products that support triple play is verychallenging, as these products must achievehigh levels of performance, manage qualityof service (QoS), and be power-efficient and

flexible enough to adapt to the seeminglyendless evolution of standards and protocols.

In the computing infrastructure mar-ket, PCIe has become the predominanthost interface for networking, graphics,and backplane connectivity because of itsquantum leap in performance, scalability,and pin-count efficiency over the legacyPCI bus. Designing products that spannetwork and compute infrastructures likethose in triple-play markets requires systemarchitects and engineers to be well-versedin these new domains, introducing newrisks. To this end, Xilinx embarked on aproject two years ago to mitigate designrisk by introducing a new generation ofPlatform FPGAs that substantially increaseperformance, functionality, and devicedensity while reducing cost per gate.

Next-Generation FPGAsLeveraging our core competence as the pre-mier FPGA vendor and working with ourworld-class customers and partners, Xilinxdeveloped the Virtex-5 FPGA architecture.With the introduction of the LXT family,Virtex-5 devices now feature integratedmulti-GbE and PCIe connectivity technol-ogy ideally suited to designs for the triple-play market.

This LXT family is equipped to supporthigh-speed serial connectivity, with fea-tures that include:

• Built-in GbE MAC – each Virtex-5LXT device features four hard-coreGbE MACs for multi-port Ethernetconnectivity

• Built-in PCIe block – an integratedstandards-compliant PCIe Endpoint

block supporting one to eight lanesprovides as much as 32 Gbps of full-duplex host I/O for extreme performance applications

These features reduce the engineeringeffort spent on resource utilization, trou-bleshooting connectivity issues, minimiz-ing power consumption, and optimizingperformance, thus giving our customersunconstrained Virtex-5 FPGA resources indesigning infrastructure and end-userproducts for delivering voice, video, anddata over IP.

As a programmable platform, theVirtex-5 family positions our customersand partners to enable value-added triple-play technologies such as:

• QoS – customer-specific traffic management solutions enabling tieredservices that can change with marketconditions

• Digital rights management – enablinghardware-based, adaptive, end-to-enddata security for the wide diversity ofstandards inherent to these markets

ConclusionIn the very dynamic consumer industrywhere time to market with flexible services isthe name of the game, companies are stilltrying to figure out the right mix of productsand services to generate substantial revenue.The Virtex-5 LXT family integrates world-class programmable logic architecture withembedded serial connectivity, providing theperformance, density, and connectivityrequired for delivering voice, video, and datain the emerging triple-play market.


Virtex-5 Serial Connectivity Solutions Virtex-5 Serial Connectivity Solutions Enabling unconstrained product development for the triple-play market. Enabling unconstrained product development for the triple-play market.

V I E W P O I N T

by Steve BerryPresident, Electronic Trend Publicationssaberry@electronictrendpubs.comwww.electronictrendpubs.com

For most of the last 15years, networking the world

for voice, video, and data has been the keydriver of the electronics industry. Thisworldwide network required that the com-munications industry connect and convergewith the computer processing industry. Thatconvergence has primarily settled onEthernet for the communications side andPCI for the computer side.

Since its inception, Ethernet has been aserial interface. It has been repeatedly scaledup in bandwidth. Today, 1 Gbps connectionsare ubiquitous, 10 Gbps connections arebecoming more common, and 100 Gbpsconnections have been proven in the labora-tory. Ethernet has vanquished all challengersin the LAN market and is rapidly conqueringthe MAN and WAN markets.

PCI started out as a parallel interface, and as such ran out of bandwidth when con-nection requirements exceeded 1 Gbps.Industry groups such as the InfiniBandTrade Association and the RapidIO TradeAssociation introduced new connections toreplace PCI. But PCI is much more than thephysical connection between system ele-ments. PCI represents an enormous globalinvestment in software that is not readilyreplaceable. Only PCI Express has met thechallenge of true compatibility with PCI.PCI Express bandwidth will be scaled uprepeatedly over the coming years to supportthe industry’s needs.

As a result of the nearly 10-year effort totransition the industry from parallel to seri-al interconnections, serial interconnections

will soon become dominant. Table 1 illus-trates the change from parallel to serial.In 2006, serial interconnections willmove into the majority. By 2009, serialwill represent more than 80 percent of allinterconnections.

Although standard semiconductor prod-ucts will supply the serial interconnectionneeds of high-volume markets, FPGAs areincreasingly important for a wide variety oftasks. There are some key reasons. First,before low-cost standards products are avail-able, FPGAs will provide a mechanism toget to market faster. Second, FPGAs enablesystem integration with customer algo-rithms and standards-based serial interfaces.Third, the ability to easily make multi-stan-dard serial connections to FPGAs will dra-matically simplify product design.

Thus, the new Xilinx® Virtex™-5 LXTplatform – with its built-in PCI ExpressEndpoint blocks, tri-mode EthernetMACs, and low-power RocketIO™ trans-ceivers – precisely fits the requirements oftoday’s FPGA market by giving designers asolution that not only saves time, but alsoreduces power consumption and conservesFPGA logic resources.

RapidIO and AuroraAlthough PCI Express and Ethernet will bethe overwhelming leaders in the number ofserial ports deployed by the industry, a hostof other serial interfaces have carved nichesfor themselves. The Virtex-5 LXT platform

also supports nearly all available serial inter-faces. Two of these interfaces – RapidIO andAurora – are emerging as most important tousers of FPGAs.

RapidIO is becoming a favorite forhigh-end, low-volume DSP applications. Anumber of implementations in this arenause FPGAs (rather than merchant silicon)to implement DSP functions as well asRapidIO interface and switching func-tions. This should continue to be the casein the future.

Similarly, the Aurora protocol has quietlygained a substantial following in certainhigh-end embedded markets. AlthoughXilinx created Aurora, it is an open protocol,free of charge, that designers can implementin any silicon device. Aurora is a scalable,lightweight, link-layer protocol that is usedto move data across point-to-point seriallinks. Aurora enables simple, high-speedconnections between fixed points either on asingle board or across multiple boards. Asmany applications in the board-level embed-ded market use fixed links between variouspoints in the system, there is no need for acomplex message-passing protocol.

ConclusionWith its hard-coded PCI Express Endpointand Ethernet blocks, I anticipate that manywill use the Virtex-5 LXT platform to bridgebetween PCI Express or Ethernet and numer-ous other interfaces. The Virtex-5 LXT plat-form is ideally suited for this task.


FPGAs for Serial InterconnectionsFPGAs for Serial InterconnectionsResearch by Electronic Trend Publications points to a key role for FPGAs in serial interconnections.Research by Electronic Trend Publications points to a key role for FPGAs in serial interconnections.

Serial vs. Parallel Ports 2004 2005 2006 2007 2008 2009Parallel 75.5% 56.3% 34.8% 25.5% 20.4% 15.9%Serial 24.5% 43.7% 65.2% 74.5% 79.6% 84.1%

Figure 1 – Serial interfaces are rapidly replacing parallel.

V I E W P O I N T

HighVELOCITY

L E A R N I N G

Nu Horizons Electronics Corp. is proud to present our newest education and training program - XpressTrack - which offers engineers the opportunity to participate in technical seminars conducted around the country by experts

focused on the latest technologies from Xilinx. This program provides higher velocity learning to help minimize start-up time to quickly begin your

design process utilizing the latest development tools, software and products from both Nu Horizons and Xilinx.

Visit our website and let us know where you reside and what you are interested in learning about and we’ll develop a curriculum just for you.

For a complete list of course offerings, or to register for a seminar near you, please visit:

www.nuhorizons.com/xpresstrack

Featured Seminars -

Power PC Embedded System Design This seminar provides the embedded systems developers with the necessary skills to develop a PPC System on a Programmable Chip system utilizing the Virtex 4 FPGA. Utilizing the Embedded Development Kit (EDK) the embedded systems developers will create a full system based on the Nu Horizons XC4FX12 evaluation board, labs provide hands on experience with the development, verifi cation, debugging, and simulation of an embedded system. Prerequisites: • Experience in C programming • Some HDL modeling experience • Basic microprocessor experience and understanding of PowerPC™ processor systems • A basic understanding of FPGA devices and the tools used to program them

Implementing New Features of Virtex-5This 3-hour seminar will introduce to you the fi rst 65-nm family of Platform FPGAs, the Virtex-5 LX from Xilinx. The Virtex-5 family of FPGAs is the 2nd generation of devices based on ASMBL architecture. Learn how the new features in Virtex-5 can increase logic performance by 30%, reduce area by 45%, and decrease dynamic power by 35% when compared to the 90 nm Virtex-4 family.

Course Outline • Virtex 4 versus Virtex 5 comparison • V5’s new PLL and Use with DCMs – Lab 1 – Introduction to the PLL/Architecture Wizard • Improved Features in V5 – Lab 2 – Leveraging Improved Features

DSP Imaging Seminar Course Outline • Interpreting images as 2D signals • Understanding the spectral content of images • The concept of correlation between target and scene • Image edge enhancement and its applicability to correlation • Tradeoffs between correlation calculation methods • Practical application of correlation to target tracking in video • Overview of the Xilinx Video Starter Kit (VSK) • Use of the VSK to perform video target tracking in real time MicroBlaze Seminar This 3-hour workshop will introduce you to MicroBlaze™: The Low-Cost and Confi gurable 32-Bit Soft Processor Solution from Xilinx. It will also introduce Xilinx Embedded Development Kit (EDK). As part of this class you will learn how to build a complete customized MicroBlaze soft processor system including user defi ned peripherals. You will also be introduced to the “Create and Import Peripheral Wizard” and guide you through process of creating a custom peripheral in the EDK environment and using it in a processor system.

Course Outline • Overview of MicroBlaze • Overview of the Embedded Development Kit (EDK) • Lab 1: Build and Optimize a MicroBlaze Soft Processor System in Minutes • Lab 2: Custom Hardware Interface Utilizing the MicroBlaze IPIF Interface

Fundamentals of FPGAsCourse Outline • Basic FPGA Architecture • Xilinx Tool Flow – Lab 1: Xilinx Tool Flow Demo • Reading Reports • Architecture Wizard and PACE – Lab 2: Architecture Wizard and PACE Demo • Global Timing Constraints – Lab 3: Global Timing Constraints • Implementation Options – Lab 4: Implementation Options • Synchronous Design Techniques • Summary

by Adrian CosoroabaMarketing ManagerXilinx, [email protected]

In FPGA system design, maximizing per-formance requires a balanced mix of per-formance-efficient components – logicfabric, on-chip memory, DSP, and I/Obandwidth. In this article, I’ll explain howyou can benefit from Xilinx® Virtex™-5FPGA building blocks, particularly thenew ExpressFabric™ technology, in yourquest for higher system-level performance.I will explore key features of theExpressFabric architecture with examplesthat quantify the anticipated performanceimprovements for logic and arithmeticfunctions. Benchmarks based on actualcustomer designs will show that Virtex-5ExpressFabric technology performs onaverage 30% better than previous-genera-tion Virtex-4 FPGAs.

With the new logic fabric (in whichyou can implement functions such ascounters, adders, and RAM/ROM stor-age) and available hard IP blocks, memo-ry, and DSP (optimized to operate atclock rates as fast as 550 MHz), theVirtex-5 FPGA is clearly the platform ofchoice for high-performance designs.

ExpressFabric PerformanceSince the first FPGA was introduced in themid 1980s, the logic fabric for mostFPGAs has been based on the same funda-mental four-input look-up table (LUT)

architecture. The Virtex-5 family is thefirst FPGA platform to offer a true six-input LUT (6-LUT) fabric with fully inde-pendent (not shared) inputs (Figure 1).Moving to a 6-LUT fabric architectureprovides the 65-nm Virtex-5 FPGA familywith the most effective trade-off betweencritical path delay – the determining factorfor logic fabric performance – and die size.

With process technology advance-ments, interconnect timing delay canaccount for more than 50% of the criticalpath delay. Xilinx has developed a newinterconnect pattern for Virtex-5 FPGAsto enhance performance by reaching moreplaces in fewer hops. The new patternincreases the number of logic connectionsachievable within two and three hops.Moreover, a more regular routing patternmakes it easier for Xilinx ISE™ softwareto find the most optimal routes. All of theinterconnect features are transparent toFPGA designers, but will translate to high-er overall performance and easier designroutability. Essentially, the Virtex-5 pat-tern provides fast, predictable routingbased on distance.

The combination of the new 6-LUTstructure and special functions like carrychains, dedicated multiplexers, and flip-flops (along with the unique methods bywhich these elements are connected) cre-ates unsurpassed performance and effi-ciency for implementing logic andarithmetic functions.

One example that clearly shows thebenefits of the ExpressFabric technology is

a multiplexer (MUX). Implementing a 4:1MUX requires two four-input LUTs and aMUXF block in the Virtex-4 architecture.The same 4:1 MUX can now be imple-mented in a Virtex-5 device with a singleLUT. Similarly, an 8:1 MUX requires fourLUTs and three MUXF blocks in a Virtex-4FPGA, while the new Virtex-5 architecturerequires only two 6-LUTs. The result isbetter performance and better logic utiliza-tion, as shown in Figure 2.

As in previous Xilinx FPGA families,the Virtex-5 Slice L (logic slice) can imple-ment logic functions, registers, and arith-metic functions using the dedicated carrychain. The slightly more complex Slice M(memory slice) adds the capabilities ofimplementing distributed RAM and shiftregisters within the LUT (SRL).

Among the various improvements pro-vided by the ExpressFabric architecture, thenew carry chain structure delivers substan-tially higher performance when used toimplement arithmetic operations. Its effecton critical path delay is readily seen for sev-eral examples listed in Table 1.

Distributed memory functions such asLUT RAM or ROM also benefit in severalways from the larger LUT structure. Thenew aspect ratio allows a much denserpacking of small memory functions leadingto significant performance benefits, asdepicted in Table 2.

The performance increases provided bythe improved logic fabric with its 6-LUTarchitecture and interconnect structure aresubstantial, but this is only the beginning.


Achieve Higher Performance with Virtex-5 FPGAs Achieve Higher Performance with Virtex-5 FPGAs New architectural elements can help you attain higher system-level performance.New architectural elements can help you attain higher system-level performance.

P E R F O R M A N C E

Most applications require more on-chipRAM than what LUT-based RAM can pro-vide. With the enhanced Virtex-5 blockRAM, you can achieve higher on-chipmemory performance.

Block RAM PerformanceWith the move to 65 nm, the Virtex-5 blockRAM inherited a 10% increase in clockingspeed to 550 MHz. However, to achieve thedesired performance for most applicationstoday, block RAMs need to be more thanjust faster. They need to be larger.

The Virtex-5 block RAM has doubledin size to 36 Kb. This larger block size(comprising two 18-Kb memories) willsupport 72-bit data words in simple dual-port mode, thereby doubling block RAMbandwidth. Moreover, the Virtex-5 FPGAprovides dedicated connections to enableyou to cascade two adjacent 36-Kb blockRAMs together in the block RAM column,thereby implementing a 72-Kb memoryrunning at the maximum 550-MHz rate.

The availability of ever-larger FPGAshas accelerated the trend toward integratingmore subsystems into a single device, mak-ing more common the necessity of interfac-ing multiple clock domains. Virtex-5devices accommodate this by providingintegrated logic to simplify the implemen-tation of flexible and efficient FIFOs.

Through this combination of enhance-ments, the Virtex-5 block RAM deliversmore on-chip memory, easier to buildFIFOs, and higher bandwidth.

DSP PerformanceThe growing acceptance of FPGAs as aviable solution for high-performance DSPapplications is well deserved. Whether as aco-processor or a stand-alone solution for

6-LUT

6-LUT

6-LUT

6-LUT

Virtex-5

Slice

ExpressFabric

Carry

Chain

6-LUT

6-LUT

6-LUT

6-LUT

CLB

Slice

6-LUT

6-LUT

6-LUT

6-LUT

Slice

CLB

CLB

CLB

CLB

CLB

I0I1

I2I3

I4I5

I6I7

A0

A1

A2

I0I1I2I3

I4I5I6I7

A0A1

A2

Virtex-4 Virtex-5

6

L

U

T

6

L

U

T

4

L

U

T

4

L

U

T

4

L

U

T

4

L

U

T

2

1.33 ns

Virtex-4

Logic Levels

Path Delay

8-to-1 MUX

100%

23%

Improvement

1

1.08 ns

Virtex-5

Function Virtex-4 FPGA Virtex-5 FPGA ImprovementPath Delay Path DelayAdder 64-bit 3.5 ns 2.4 ns 46%Ternary Adder 64-bit 4.3 ns 3.0 ns 40%Barrel Shifter 32-bit 3.8 ns 2.8 ns 37%Magnitude Comp. 48-bit 2.4 ns 1.8 ns 34%

Function Virtex-4 Virtex-5 Improvement

LUT RAM 64 x 1 Logic Levels 2 1 100 %Path Delay 1.76 ns 1.26 ns 40 %

LUT ROM 128 x 12 Logic Levels 3 1 200 %Path Delay 1.84 ns 1.20 ns 53 %


Figure 1 – Virtex-5 configurable logic blocks (CLBs) comprise two slices. Each slice uses four independent 6-LUTs that provide the benefits of fewer logic levels.

Figure 2 – 8:1 multiplexer implemented with Virtex-5 FPGAs versus Virtex-4 FPGAs

Table 1 – Arithmetic functions implemented with Virtex-5 FPGAs versus Virtex-4 FPGAs

Table 2 – LUT-based RAM/ROM implementations with Virtex-5 FPGAs versus Virtex-4 FPGAs


more demanding applications, FPGAs con-tinue to provide the best combination ofperformance, power, and cost.

To keep pace with the seemingly insa-tiable demand for more DSP performance,Xilinx is leading with Virtex-5 DSP capa-bilities in terms of both clock rate and pre-cision – the clock rate has increased to 550MHz and the precision has improved from18 x 18 bits to 25 x 18 bits.

Xilinx also optimized theVirtex-5 DSP48 slice foradder-chain implementations,a powerful capability thatenables the creation of veryefficient high-performance fil-ters. Dedicated routingresources on the inputs andoutputs of each DSP48 slicepermit any number of slices tobe chained together within acolumn. This dedicated rout-ing ensures that every DSP48slice in the chain will run atfull speed without consumingany of the fabric routing orlogic resources, as otherFPGAs require. Taken togeth-er, these improvements reduceby half the number ofresources needed to imple-ment common high-precisionfunctions. For example, for a35 x 25-bit multiply, fourDSP48 slices are needed withthe Virtex-4 FPGA. With thewider DSP block available inthe Virtex-5 FPGA, half asmany slices are used to imple-ment this multiply function.

I/O Bandwidth PerformanceAs performance benchmarks go, thespeed with which an FPGA can processdata is relevant only in context with thedevice’s I/O bandwidth, which is thespeed with which large amounts of datacan be moved on and off the device.When using external memory buffers, theinterface must be at least two times fasterthan the data-processing rate because datamust be both written out of and readback into the FPGA.

Virtex-5 FPGAs improve on Virtex-4bandwidth by increasing both the data rateper pin and the number of available I/Oswith larger packages. For example, for popu-lar memory interfaces like DDR2 SDRAM,the bandwidth has increased per pin from534 Mbps to 667 Mbps; the number of dataI/Os, when considering SSO requirements,has increased from 432 to 576.

Customer Design BenchmarksTo further evaluate the performanceimprovements provided by the Virtex-5FPGA logic fabric, we implemented a set ofcustomer designs using Xilinx ISE software.

These designs were all written inVHDL or Verilog. We implemented somespecific design units like memories andFIFOs using direct instantiation of librarycomponents or synthesis inference, butmany were implemented using EDIF

blocks generated by CORE Generator™software (a part of ISE software).

For these benchmarks, we performedsynthesis in a timing-driven fashion withSynplicity’s Synplify Pro, using tight, realis-tic constraints to effectively measure per-formance. This was done to ensure that allspecial optimizations and logic replicationswere employed.

Implementation in ISEsoftware was accomplishedwith the place and routeeffort set to high. Clockswere tightened iteratively by5% increments until thedesign failed to meet designconstraints.

The result was an averageperformance gain of 30%over designs implemented inVirtex-4 FPGAs, as shown inFigure 3.

Those designs thatimproved the most have largecones of logic; the critical pathimplements a large, oftencomplex logic equation. Forexample, ASIC prototypingdesigns will typically have veryfew registers for a largeamount of logic in their criti-cal path. These types ofdesigns exhibit a significantimprovement with Virtex-5ExpressFabric technology.

Those designs exhibiting amore moderate improvementeither have less levels of logicor provide little opportunityfor the use of hard IP blocksor carry-chain structure toimprove performance.

Figure 4 summarizes by category the per-formance improvements of Virtex-5 FPGAsover previous-generation Virtex-4 FPGAs.

ConclusionWith its new ExpressFabric technology andtight coupling to other high-performancehard-IP blocks and I/Os, the Virtex-5FPGA family represents a significant per-formance boost compared to previous-gen-eration architectures.

0

10

20

30

40

50

60

30% Average Advantage Designs with

many levels

of logic

Use of hard

IP blocks

Designs

with fewer

levels of

logic

Designs

with fewer

levels of

logic

Designs with

many levels

of logic.

Use of hard

IP blocks

Virtex-5 vs. Virtex-4 FPGA

Performance Advantage (%)

Virtex-5 FPGA Virtex-4 FPGA

Logic Fabric

Performance

I/O Memory

Bandwidth

385 Gbps

I/O LVDS

Bandwidth

750 Gbps

DSP

32-Tap Filter

550 MHz

On-chip RAM

550 MHz

Perf

orm

an

ce

1.3 X

1.1 X1.1 X

1.7 X1.6 X


Figure 3 – Comparison based on a suite of 74 customer designs using ISE 8.2i software

Figure 4 – Virtex-5 FPGA performance improvements


by Brian PhilofskyStaff Software Technical Marketing ManagerXilinx, [email protected]

FPGAs have been very flexible in accom-modating any HDL coding or design stylefor digital logic; Xilinx® Virtex™-5devices are no exception. Although Virtex-5 FPGAs can accommodate many differ-ent types of designs written in manydifferent methods, certain recommendedconstructs and manners can achieveimproved optimization in terms of area,performance, and power.

Know Your Target Architecture and Synthesis ToolBefore beginning any project, youshould understand the device architec-ture you are targeting. For Virtex-5FPGAs, I recommend reading theVirtex-5 Users Guide (http://direct.xilinx.com/bvdocs/userguides/ug190.pdf)before starting your first line of code.Once you have a better understandingand vision as to how your code will ulti-mately result in the base hardware, youcan make both large and small designand coding decisions confidently.

For example, if you know of and useBitslip technology within the ISERDES,you could save time, effort, and resources bycapturing input data rather than attemptingto describe and build similar circuitry.

In another example, if you know thestructure and capability of the DSP48E,you can make better choices as to whenand where to place pipeline registers.Dedicated features like the wider multipli-er or post adder can also help you achievebetter area, performance, and power.

Similarly, knowing the capabilities andcurrent limitations of your synthesis tool cannot only help when choosing coding stylesto properly infer primitives but can also giveyou greater insight as to when to instantiatea component or use inference. Review syn-thesis manuals, application notes, or otherrelevant materials before starting so that youknow the recommended coding styles forthe synthesis tool you are using.

You should also update and use the lat-est versions of synthesis and ISE™ toolsbefore beginning a project. Although ini-tial synthesis support for the Virtex-5architecture is strong, many improvementsin optimization and inference support arestill to come with new releases. One easyway to ensure more optimal designs in

terms of area, performance, and power is toinstall the latest version of the software.

Control Signal PolarityThe Virtex-5 architecture can support dif-ferent control signal polarity (clockenables, resets, or sets). However, to havethe most optimal design, I recommendconsistent use of active high control signalsin your design. The Virtex-5 slice controllogic is active high, and when described inthis same manner in the code should neverrequire additional LUT resources for a sim-ple signal inversion.

If the signal comes from an external pinand needs an active low polarity, I suggestinverting the signal in the top-level codeand using a positive polarity in all process-es and sub-modules requiring that signal.This is critical for designs that have severalcores, use bottom-up synthesis techniques,have KEEP_HIERARCHY constraints, oremploy the use of partitions (Figure 2).

Designs that fall into these categoriesare more susceptible to the use of addition-al LUTs per core/netlist/hierarchy/parti-tion for the sole purpose of inverting thesecontrol signals, which not only consumeextra LUT resources but may also havenegative effects on performance and slice


HDL Coding and Design Practices for Improving Virtex-5 Utilization,Performance, and Power

HDL Coding and Design Practices for Improving Virtex-5 Utilization,Performance, and PowerThese tips and techniques can lead to better Virtex-5 designs.These tips and techniques can lead to better Virtex-5 designs.


packing. As a general rule, always code sets,resets, and enables with an active high(logic 1 activates) polarity.

Use of ResetsIt is common practice to use a global asyn-chronous reset in the source HDL code toinitialize the design; however, in manycases this consumes additional resources.Instead, think synchronous and local. Isuggest describing a synchronous set/resetlogic to the portions of the design that doneed periodical resets. For those portions ofthe design that do not, you can initializethe signals defined to be registered in theHDL code at the time they are declared(for example, when defining a reg inVerilog or a signal in VHDL). Thismethodology allows for improved packingdensity, enhances timing analysis and per-formance, and can improve area resources.

In terms of FPGA behavior, without aglobal reset described in the code, a GSR(global set/reset) will occur upon comple-tion of the configuration cycle, initializingall registers to known specified values. Thissame cycle is also simulated in the gate-level simulation netlist, giving the sameknown starting point as in the FPGA.

In terms of RTL simulation, having theregisters initialized in the code allows forproper RTL or behavioral simulation; thissame initialization will be picked up by thesynthesis tool and applied to the imple-mented design. Therefore, for simulationat any stage, a global reset is redundantand unnecessary.

Using a synchronous reset instead of anasynchronous reset also allows for more pre-dictable behavior upon the assertion andrelease of the reset, because the synchronoussignals are automatically analyzed and theirbehavior is more deterministic when alltiming constraints are met. It also allows forthe possibility of greater logic optimizationand performance because it is not global.

When using synchronous control sig-nals, you can move portions of the logicfunction to the synchronous set or reset ofthe flip-flop; this is not possible with asyn-chronous signals. By only describing a resetwhere necessary, the synthesis tool can usealternative resource choices like SRLs (shift

The Virtex-5 device departs from thetraditional four-input LUT in previousFPGA families and has an enhanced six-input LUT (6-LUT), allowing for widerlogic functions between pipeline registerswhile maintaining top performance. Youshould keep this in mind, as logic functionscoded into HDL as optimal code shouldinclude six inputs to the logic functionbetween registers to get the most optimalpipelining and LUT resource management.

In cases where it is not practical or pos-sible to have exactly six inputs in a givenlogic function, the wider input 6-LUTstill allows for good performance by

register LUTs); distributed RAM (LUT-based RAM) memory; or block RAM forthe implementation, which would not beotherwise possible nor optimal. The synthe-sis tool has maximum flexibility to choosethe best resource for the described code.

PipeliningAs with previous FPGA generations, prop-erly pipelining your design is necessary toachieve top performance and improvedpower characteristics. With the introduc-tion of the Virtex-5 architecture, a newlogic structure dictates slightly differentrules regarding when and how to pipeline.

ClockEnable

Top

Old Netlist

Core

Partition

KEEP_HIERARCHY

LUT6

CE

Flip-Flop

LUT6

CE

Flip-Flop

LUT6

CE

Flip-Flop

LUT6

CE

Flip-Flop

LUT6

CE

ClockEnable

Top

Old Netlist

Core

Partition

KEEP_HIERARCHY

CE

Flip-Flop

CE

Flip-Flop

CE

Flip-Flop

CE

Flip-Flop

LUT6

CE

Figure 2 – How clock enable polarity affects LUT utilization in a design




.ALMOST_EMPTY_OFFSET(12'h080), // Sets the almost empty threshold .DATA_WIDTH(18), // Sets data width to 4, 9 or 18 .DO_REG(1), // Enable output register (0 or 1) // Must be 1 if EN_SYN = "FALSE .EN_SYN("TRUE"), // Specifies FIFO as Asynchronous ("FALSE") // or Synchronous ("TRUE") .FIRST_WORD_FALL_THROUGH("FALSE") // Sets the FIFO FWFT to "TRUE" or "FALSE ) FIFO18_inst ( .ALMOSTEMPTY(), // 1-bit almost empty output flag .ALMOSTFULL(), // 1-bit almost full output flag .DO(DATA_OUT[15:0]), // 16-bit data output .DOP(DATA_OUT[17:16]), // 2-bit parity data output .EMPTY(), // 1-bit empty output flag .FULL(FULL), // 1-bit full output flag .RDCOUNT(), // 12-bit read count output .RDERR(read_error), // 1-bit read error output .WRCOUNT(), // 12-bit write count output .WRERR(write_error), // 1-bit write error .DI(data_in_reg2), // 16-bit data input .DIP(parity[1:0]), // 2-bit parity input .RDCLK(CLK), // 1-bit read clock input .RDEN(READ_DATA), // 1-bit read enable input .RST(RST), // 1-bit reset input .WRCLK(CLK), // 1-bit write clock input .WREN(data_store_delay[2]) // 1-bit write enable input );

// End of FIFO18_inst instantiation

endmodule

`timescale 1ns / 1ps//////////////////////////////////////////////////////////////////////////////////// Company: Xilinx// Engineer: Brian Philofsky//// Create Date: 07:42:58 08/12/2006// Design Name: good_design// Module Name: good_code// Project Name: HDL Coding and Design Practices for Improving Virtex 5// Utilization, Performance and Power// Target Devices: Virtex 5// Tool versions: ISE 8.2i// Description: This is example code employing some good coding practices// when targeting a Virtex 5 device.//// Revision 0.01 - File Created////////////////////////////////////////////////////////////////////////////////////

module good_code #( parameter data_width = 16, parity_width = 2)

( input [data_width-1:0] DATA_IN, input DATA_STORE,

input CLK, RST, input READ_DATA,

output [data_width+parity_width-1:0] DATA_OUT, output reg RW_ERROR = 1'b0, output DATA_VALID, FULL);

// Always initialize registers to known values reg [data_width-1:0] data_in_reg = {data_width{1'b0}}; reg [data_width-1:0] data_in_reg2 = {data_width{1'b0}}; reg [2:0] data_store_delay = 3'b000; reg [2:0] data_valid_delay = 3'b000; reg [parity_width-1:0] parity = {parity_width{1'b0}};

wire read_error, write_error;

// Use resets only where necessary and make them synchronous // Make resets and clock enables active high always @(posedge CLK) if (RST) data_in_reg <= {data_width{1'b0}}; else if (DATA_STORE) data_in_reg <= DATA_IN;

// Do not use resets where not necessary // In this case an SRL can be used due to the fact no reset is described. always @(posedge CLK) begin data_store_delay <= {data_store_delay[1:0], DATA_STORE}; data_in_reg2 <= data_in_reg; data_valid_delay <= {data_valid_delay[1:0], READ_DATA}; RW_ERROR <= read_error | write_error; parity[1] <= ^data_in_reg[15:8]; parity[0] <= ^data_in_reg[7:0]; end

// In general, RAMs should be inferred however in this case, a FIFO is needed // and synthesis can not yet infer the dedicated Virtex 5 FIFO.

// FIFO18: 16k+2k Parity Synchronous/Asynchronous BlockRAM FIFO // Virtex-5 // Xilinx HDL Language Template, version 8.2.2i

FIFO18 #( .ALMOST_FULL_OFFSET(12'h080), // Sets almost full threshold

RDCOUNT => open, -- 12-bit read count output RDERR => read_error, -- 1-bit read error output WRCOUNT => open, -- 12-bit write count output WRERR => write_error, -- 1-bit write error DI => data_in_reg2, -- 16-bit data input DIP => parity, -- 2-bit parity input RDCLK => CLK, -- 1-bit read clock input RDEN => READ_DATA, -- 1-bit read enable input RST => RST, -- 1-bit reset input WRCLK => CLK, -- 1-bit write clock input WREN => data_store_delay(2) -- 1-bit write enable input );

-- End of FIFO18_inst instantiation

end XILINX;

RW_ERROR : out std_logic := '0'; DATA_VALID, FULL : out std_logic);end good_code2;

architecture XILINX of good_code2 is

-- Always initialize registers to known values signal data_in_reg: std_logic_vector(data_width-1 downto 0) := (others => '0'); signal data_in_reg2: std_logic_vector(data_width-1 downto 0) := (others => '0'); signal data_store_delay: std_logic_vector(2 downto 0) := "000"; signal data_valid_delay: std_logic_vector(2 downto 0) := "000"; signal parity: std_logic_vector(parity_width-1 downto 0) := (others => '0');

signal read_error, write_error: std_logic;

begin

-- Use resets only where necessary and make them synchronous -- Make resets and clock enables active high process (CLK) begin if (CLK'event and CLK='1') then if RST='1' then data_in_reg <= (others => '0'); elsif (DATA_STORE='1') then data_in_reg <= DATA_IN; end if; end if; end process;

-- Do not use resets where not necessary -- In this case an SRL can be used due to the fact no reset is described. process (CLK) begin if (CLK'event and CLK='1') then data_store_delay <= (data_store_delay(1 downto 0) & DATA_STORE); data_in_reg2 <= data_in_reg; data_valid_delay <= (data_valid_delay(1 downto 0) & READ_DATA); RW_ERROR <= read_error OR write_error; parity(1) <= (data_in_reg(15) XOR data_in_reg(14) XOR data_in_reg(13) XOR data_in_reg(12) XOR data_in_reg(11) XOR data_in_reg(10) XOR data_in_reg(9) XOR data_in_reg(8)); parity(0) <= (data_in_reg(7) XOR data_in_reg(6) XOR data_in_reg(5) XOR data_in_reg(4) XOR data_in_reg(3) XOR data_in_reg(2) XOR data_in_reg(1) XOR data_in_reg(0)); end if; end process;

-- In general, RAMs should be inferred however in this case, a FIFO is needed -- and synthesis can not yet infer the dedicated Virtex 5 FIFO.

-- FIFO18: 16k+2k Parity Synchronous/Asynchronous BlockRAM FIFO BlockRAM Memory -- Virtex-5 -- Xilinx HDL Language Template version 8.2.2i

FIFO18_inst : FIFO18 generic map ( ALMOST_FULL_OFFSET => X"080", -- Sets almost full threshold ALMOST_EMPTY_OFFSET => X"080", -- Sets the almost empty threshold DATA_WIDTH => 18, -- Sets data width to 4, 9, 18, or 36 DO_REG => 1, -- Enable output register (0 or 1) -- Must be 1 if the EN_SYN = FALSE EN_SYN => TRUE, -- Specified FIFO as Asynchronous (FALSE) or -- Synchronous (TRUE) FIRST_WORD_FALL_THROUGH => FALSE) -- Sets the FIFO FWFT to TRUE or FALSE port map ( ALMOSTEMPTY => open, -- 1-bit almost empty output flag ALMOSTFULL => open, -- 1-bit almost full output flag DO => DATA_OUT(15 downto 0), -- 32-bit data output DOP => DATA_OUT(17 downto 16), -- 2-bit parity data output EMPTY => open, -- 1-bit empty output flag FULL => FULL, -- 1-bit full output flag

------------------------------------------------------------------------------------ Company: Xilinx-- Engineer: Brian Philofsky---- Create Date: 07:42:58 08/12/2006-- Design Name: good_design-- Module Name: good_code2-- Project Name: HDL Coding Practices for Improving Virtex 5 Utilization,-- Performance and Power-- Target Devices: Virtex 5-- Tool versions: ISE 8.2i-- Description: This is an example code employing some good coding practices-- when targeting a Virtex 5 device.---- Revision 0.01 - File Created------------------------------------------------------------------------------------

library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;

library UNISIM;use UNISIM.Vcomponents.all;

entity good_code2 isgeneric ( data_width : integer := 16; parity_width : integer := 2);port ( DATA_IN : in std_logic_vector(data_width-1 downto 0); DATA_STORE: in std_logic;

CLK, RST: in std_logic; READ_DATA: in std_logic;

DATA_OUT : out std_logic_vector(data_width+parity_width-1 downto 0);

Figure 3 – Sound FPGA coding styles

Verilog Coding Example VHDL Coding Example



reducing the number of logic levels, thusrequiring fewer pipeline stages to achievethe same as or better performance thanprevious FPGA architectures.

A good goal is to aim for less than 10inputs to a given logic function betweenI/Os, registers, or synchronous blocks (likeblock RAM or DSP48Es), which generallywould represent two logic levels. When youneed a significantly higher number ofinputs for the design path to meet latencyor other requirements, you can attempt toreduce the fan-in to that logic function(when possible) if high performance or lowpower are your design objectives.

Coding MemoriesAmong other innovations within theVirtex-5 architecture, Xilinx has enhancedboth block RAM and distributed RAMmemories with greater capacity and capabil-ity. You must make different decisions earlyin the design process and while coding toget the most from these valuable resources.

General guidelines call for inferringRAMs when possible for easier codechanges, faster simulation, and moreportable code. However, even when behav-iorally describing the RAM, you shouldkeep some important things in mind. Thefirst and most obvious thought is RAMcapacity. In terms of block RAMs, the basememory block increased in Virtex-5 devicesto 36 Kb of memory storage space. You canconfigure this block to the wider but shal-lower 512 x 72 configuration, the deepersingle-bit width 32 Kb x 1, or several con-figurations in between. It is also possible tocascade two 36-Kb RAMs to form a 64-Kbx 1 configuration or break up the 36-KbRAMs into two separate 18-Kb RAMs capa-ble of 512 x 36 to 16-Kb x 1 configurations.

Distributed RAM have benefited fromthe larger LUT structure and can now effi-ciently accommodate 64-bit depths withoutany area or performance penalties. This isthe most optimal size for this type of RAMin the Virtex-5 device, although other sizescan be accommodated. The base RAM sizesare important to remember during memoryselection and coding to most efficiently usethe limited RAM resources in the deviceand achieve the best performance.

Both block RAM and distributed RAMmemories also have additional capabilitiesthat require different coding and design con-siderations. For performance, perhaps themost important is the proper use of outputregisters. For block RAMs, this meansenabling the output registers to the blockRAM whenever possible. By enabling theoutput registers, a reduced clock-to-out isrealized from the RAM, thus improving tim-ing for the data leaving the RAM. However,an extra clock cycle of latency is added dur-ing reads, for which you must account.

Similarly, when using distributed RAM,the output of the RAM can be asynchro-nous; however, coding it synchronously willallow the use of the register within the slice,providing better timing characteristics andreducing the chance of the RAM being partof the timing bottleneck.

There are more advanced features of theblock RAMs, such as FIFO and ECC(error correction circuitry) capabilities.The distributed RAM also has new capa-bilities such as a quad-port configuration.In some cases, these features cannot berealized by inference within synthesis andinstantiation is necessary. If you need suchfunctionality, I suggest instantiating theRAMs either by generating cores withinXilinx CORE Generator™ software or byinstantiating the base primitive. Takingadvantage of these advanced features cansave RAM and logic resources as well asimprove area, performance, and power.

Some General GuidelinesA few other general recommendations donot fall into any specific categories but canresult in better coding and design choices.First, you should make wise choices in termsof your design hierarchy right from thestart. Your choice of hierarchy can haveeffects on the synthesis and implementationtools’ ability to optimize the logic paths.

In general, do not allow timing paths tocross multiple boundaries of hierarchy. Thisnot only limits the tool’s ability to optimizelogic but may also limit your options fordesign implementation and design debug-ging. For instance, you may not be able touse partitions or KEEP_HIERARCHY oncertain hierarchies with this practice.

For designs in which some or most of thecode was created for an architecture otherthan Virtex-5 FPGAs, I suggest that youreview the code to ensure that it is well suit-ed for implementation into the new archi-tecture. A few minutes of time spent herecan save several hours later if you identifyand correct suboptimal code.

If your design contains cores or pre-compiled netlists (EDIF or NGC files)from a previous architecture, you shouldregenerate those targeting Virtex-5devices. Unless regenerated, netlists opti-mized for a previous architecture are morelikely than not far less optimal when tar-geting Virtex-5 architectures.

One last suggestion is to use the HDLlanguage templates within the ISE tools.They not only help with accelerating thegeneration of VHDL or Verilog code, butalso provide assistance in creating moreoptimal code for FPGAs. They also cutdown on the possibility of creating syntaxor other simple but common mistakes thatcan hold up the testing and verifying ofHDL code.

Figure 3 shows both Verilog and VHDLcode following the guidelines discussed here.

ConclusionCoding styles are very individual; howev-er, following these suggestions makes itmore likely that you will achieve a moreoptimal result. These guidelines do notrepresent absolutely everything you needto know to achieve the best Virtex-5design possible, but I have provided somecommon strategies that can help in achiev-ing more optimal designs.

Almost any set of valid HDL code likelywill result in a functioning design, but fol-lowing a few simple guidelines can help interms of improved density, performance,and power, and many times may reduce theamount of time it takes to ultimately com-plete a design.

For more information, see the Synthesisand Simulation Design Guide athttp://toolbox.xilinx.com/docsan/xilinx82/books/docs/sim/sim.pdf or White Paper 231,“HDL Coding Practices to Accelerate DesignPerformance,” at http://direct.xilinx.com/bvdocs/whitepapers/wp231.pdf.


by John GallagherSr. Director Outbound MarketingSynplicity, [email protected]

The revolutionary capabilities of Xilinx®

Virtex™-5 devices can be fully realizedonly if the EDA technologies to unlockthose capabilities are available when thedevice is. To achieve this, the FPGA archi-tecture and corresponding EDA designtools must be developed simultaneously.The scale of EDA development over previ-ous generations is similar to the differencebetween Virtex-5 devices and their previ-ous generations.

As the first FPGAs created at the 65-nmtechnology node, the Virtex-5 family ofdomain-optimized devices provides asmuch as 65% more logic cells and 25%more I/O pins than the Virtex-4 family. Atthe same time, devices in the Virtex-5 fam-ily provide 30% higher performance, 35%lower dynamic power dissipation, and con-sume 45% less silicon real estate whencompared to their Virtex-4 counterparts.


Getting the Best Results from Virtex-5 FPGAsGetting the Best Results from Virtex-5 FPGAsSynplicity applies new algorithms and heuristics for optimal Virtex-5 support.Synplicity applies new algorithms and heuristics for optimal Virtex-5 support.


Clearly, the Virtex-5 architecture deliv-ers revolutionary capabilities. To develop asynthesis flow that leverages these new fea-tures, Xilinx gave Synplicity early access tothe Virtex-5 architecture. By the time thefirst Virtex-5 devices were introduced, wehad been working side-by-side with Xilinxengineers for more than a year to enhanceour Synplify Pro synthesis engine. Thisinvolved making substantial algorithmicchanges to Synplify Pro software to maxi-mize the performance and logic density ofVirtex-5-based designs. Because of ourpartnership, we have the tools and method-ologies in place to enable you to rapidlydeploy these devices.

In this article, I will describe some of theways in which we enhanced the SynplifyPro FPGA synthesis engine to take fulladvantage of the capabilities offered by theVirtex-5 family.

New Algorithms to Synthesize 6-LUTsIncreases in the complexity of the FPGAfabric demanded corresponding increasesin the sophistication of the synthesis algo-rithms. If you applied the same algorithmsfor a fabric based on 4-input look-up tables(4-LUTs) to a fabric with 6-LUTs, forexample, synthesis run times would dra-matically increase. To take full advantage ofthe specialized architectural features in theVirtex-5 family, Synplicity had to eitherfine-tune or in some cases completely re-craft the underlying synthesis algorithms.

To reduce the levels of logic needed tomap functions like wide data paths andDSP, the ExpressFabric™ technology inthe Virtex-5 family features LUTs with sixindependent inputs (Figure 1). This signif-icantly reduces the number of logic levelsand LUT area required to implement widefunctions. Within the synthesis process,you can use each of these logical elementsas a true 6-LUT or as two 5-LUTs thatshare their five inputs.

Synplicity equipped Synplify Pro soft-ware (which features a unique direct-map-ping capability) with a variety ofsophisticated heuristic algorithms that aretailored to minimize the number of cuts,handle huge capacities, and address thesecomplex mapping and timing scenarios.

Timing Estimation with Diagonally Symmetric InterconnectAnother ExpressFabric feature is a radicallynew form of diagonally symmetric inter-connect that reaches more locations withfewer hops (Figure 2). This diagonally sym-metric interconnect pattern was designedto improve speed and predictability. Thecombination of ExpressFabric 6-LUTs anda diagonally symmetric interconnect pat-tern results in an average increase of logicperformance of 30% over Virtex-4 devices,which equals two speed grades.

The diagonally symmetric interconnectpattern also delivers significantly highercomplexity in timing analysis; previousphysical synthesis algorithms were basedon architectures with “Manhattan rout-ing” or 90-degree routes. To handle the

However, the combinatorial explosionassociated with mapping to 6-LUTs cancause memory utilization and run-timeproblems if not handled correctly. If youapplied the algorithms for a fabric based on4-LUTs to a fabric with 6-LUTs withoutsignificant modification, synthesis runtimes would be orders of magnitude longer(if they completed at all). In addition,when attempting to find an optimal map-ping, traditional algorithms run the risk ofbecoming trapped in local minima.

Unlike timing-driven engines, most con-ventional synthesis engines simply attemptto reduce the number of logic levels. This isproblematic for LUT architectures in whichdifferent input-to-output paths have asym-metric delays. Consider the five shared inputpins versus the sixth unique pin in Figure 1;these pins will have very different delays.

The fact that you can use Virtex-5 LUTsin a 2 x 5-input configuration furtherincreases the complexity of the mappingoperations. Synthesis tool vendors mustconduct a lot of research and developmentto use structures that share inputs but rep-resent different functions.

5-LUT

5-LUT

6-LUT

Figure 1 – Virtex-5 six-input LUTs

To reduce the levels of logic needed to map functions like wide data paths and DSP, the ExpressFabric technology in the

Virtex-5 family features LUTs with six independent inputs.




combination of both 90-degree and diag-onal routes, Synplicity custom-designednew algorithms and delay models. Asopposed to simple wire-load models, weengineered the Synplify Pro tool toemploy sophisticated netlist-based routingestimation (coupled with known routingvalues where applicable). In the case of fastcarry chains, for example, routing delaysare well known and can be directly“plugged in.” Similarly, in the case where acell, driver, load, and specific route areknown, an accurate routing delay associat-ed with this path can be plugged in to therouting and timing algorithms.

Synthesizing Fast High-Capacity RAM BlocksThe new block RAM structures (withpipeline) in the Virtex-5 family haveincreased to 32 Kb in size – twice the sizeof those found in Virtex-4 components.In addition to offering a simple dual-portmode that can double the RAM’s band-width, these blocks also contain addi-tional hard IP in the form of FIFO logicand new 64-bit error checking and cor-rection (ECC) logic (Figure 3).Implementing this logic as hard IP freesup other resources and minimizesdynamic power consumption.

As with all hard IP blocks in Virtex-5devices, these block RAMs have been

tuned for 550-MHz operation to providehigher on-chip memory bandwidth. The18-Kb block RAMs are constructed fromtwo physical 9-Kb memories, which areautomatically controlled to save power byenabling only one of the 9-Kb sub-blocksfor any given read or write operation inmost configurations.

For our part, the Synplify Pro synthesissoftware can perform automatic memoryinferencing, including single-port and dual-port implementations, single and multipleclocking schemes, and automatic retiming.Regarding the latter point, Virtex-5 blockRAMs are inherently synchronous; howev-er, the design’s RTL could describe thememory and registers in such a way as to betechnically asynchronous. In such a case,

the Synplify Pro tool has the ability to“push” the registers into the RAM.

Similarly, the software will recognizepotential conflicts and automatically gener-ate appropriate conflict-resolution logic. Incases such as dual-port RAMs, for example,in which the result of writing two words tothe same address may be undefined, theSynplify Pro tool automatically inserts theappropriate logic to resolve the issue suchthat the memory works in exactly the sameway as the RTL will simulate.

Furthermore, Synplify Pro software canautomatically analyze the memorydescribed in the design and recognizepotential issues in mapping it to preferredmemory resources. If you require blockRAM, for example, but have used morethan what is available on the physicaldevice, the software will automatically movesome of the memory into select RAM.

Optimal Use of Faster, Wider DSP BlocksThe Virtex-5 hard DSP slice – called theDSP48E – features a 25 x 18-bit multiplier(versus the 18 x 18-bit multiplier employedin Virtex-4 FPGAs). This increase can leadto fewer cascaded stages, thereby resultingin higher overall performance and utiliza-tion (Figure 4).

Tuned for 550-MHz operation, you canconfigure these high-precision, high-per-formance, highly flexible slices for DSP,arithmetic, and logic functions and cascadethem for adder-chain architectures. TheDSP48E slice has 40% lower power con-sumption compared to equivalent func-tions in Virtex-4 FPGAs (1.38 mW/100MHz at a 38% toggle rate).

The sophistication of these DSP slicesmeans that it is unlikely that a data pathdefined in RTL will exactly match the opti-mal DSP implementation structure. Forexample, rather than implementing a func-tion such as “(a + b) + (c + d)” by adding “a”and “b,” adding “c” and “d,” and then addingthe results generated by these operations, itmay be more efficient to cascade the DSPslices along the lines of “(((a + b) + c) +d).”

We equipped Synplify Pro software withextremely sophisticated mapping algo-rithms that perform a lot of data path mas-saging, creating data path structures that

Direct1 Hop2 Hops3 Hops

(a) Virtex-4: TraditionalInterconnect Pattern

(b) Virtex-5: Diagonally SymmetricInterconnect Pattern

18-kb RAM

18-kb RAM

FIFO Logic

EC

C a

ndIn

terc

onne

ct L

ogic

Figure 2 – Diagonally symmetric interconnect routing

Figure 3 – The Virtex-5 family features asmuch as 10 Mb of 550-MHz block RAM.



map efficiently onto – and take full advan-tage of – the DSP48E slices featured inVirtex-5 SXT devices.

Note that the DSP48E slice contains alarge number of registers (not shown inFigure 4 for simplicity). The Synplify Protool can use advanced pipelining and retim-ing techniques to take full advantage ofthese embedded register elements. Anotherconsideration is that these internal registerelements support only synchronous resets.Thus, if you employ an asynchronous resetin your code, the Synplify Pro tool auto-matically inserts the appropriate glue logicto restore the required functionality.

Synthesis Considerations with High-Speed I/Os The number and type of available I/O in aspecific device plays a critical role in designimplementation, particularly when the appli-cation of the FPGA is chip- or system-levelverification. With Synplicity tools likeCertify ASIC RTL prototyping and theSynplify Pro synthesis solution, we optimizedthe powerful I/O capabilities in Virtex-5devices to take into account both signal inte-gration as well as automatic I/O assignments.

Virtex-5 FPGAs offer as many as 1,200general-purpose input/output (GPIO) pinsthat you can use to implement industry-

standard and custom protocols. TheSelectIO™ technology behind these pinsprovides 1.25 Gbps differential I/O and800 Mbps single-ended I/O.

Second-generation ChipSync™ source-synchronous technology allows program-mable delays to be individually applied toeach input and each output. Furthermore,the unique power and ground pin patternof the Virtex-5 second-generation sparsechevron packaging technology simplifiescircuit board layout while minimizing sig-nal integrity and crosstalk effects.Combined, these I/O technologies ensurereliable operation for high-bandwidthinterfaces such as DDR2 and QDR II.

We engineered Synplify Pro synthesissoftware to automatically handle differen-tial signals (and bi-directional I/O signals).For example, if you apply an attribute thatidentifies the port as being a low voltagedifferential signal (LVDS) output port, theSynplify Pro tool will automatically insertthe appropriate LVDS primitive with oneinput and two outputs.

Next StepsAs devices move deeper into the submicrondomain, the symbiosis between FPGA ven-dors and EDA vendors increases. Working

with engineers at Xilinx for almost a yearto enhance our Synplify Pro synthesisengine has yielded tremendous benefits,including having the best synthesis tech-nology available immediately when thedevice was brought to market, testedagainst real designs.

Future FPGA platforms will provideeven greater densities and capabilities, fur-ther expanding the reach of advancedFPGA architectures across a wide range ofapplication domains. These new platformswill require ever-more-sophisticated designflows and synthesis solutions. For this rea-son, Synplicity and Xilinx formed anUltra-High-Capacity Timing Closure TaskForce. As part of this endeavor, engineer-ing teams from both of our companies willcollaborate to define and implement newdesign flows that maximize design produc-tivity and quality of results for ultra-high-density designs implemented usingnext-generation FPGAs.

The task force will initially focus ondramatically improving overall quality ofresults and run times, and ensure the sta-bility of results when small changes aremade to designs. Ultimately, the goal ofthe task force is for designers to realize thebenefits of near push-button results forultra-high-density designs, completingmultiple design iterations per day.

ConclusionThe Virtex-5 family from Xilinx has animplementation flow engineered by Xilinxand Synplicity to achieve the best possibleresults. These new FPGAs boast a widerange of new architectural features, such as6-LUTs and a diagonally symmetric inter-connect fabric.

The matching software solution fromSynplicity features new forms of timingestimation for diagonal interconnect, newsynthesis algorithms to deal with combi-natorial explosion, specialized RAMinferencing and I/O handling, andnumerous improvements that enable sta-ble results with minor design changes.Taken together, Virtex-5 devices andSynplify Pro software bring systemdesigners new capabilities that you candesign with today.

18

25

43

48

48

Sign-extended to48 bits by the MUXs

3-input+. -, AND, OR,XOR, NOT, etc.

DSP48E

X

Figure 4 – Virtex-5 DSP slices with 25 x 18 multipliers


Triple-OxideUltimate Power Optimization . . .

Reduce power without compromising performance.

Virtex™-5 FPGAs give you unbeatable power savings with the highestperformance. The unique combination of 65nm process, second-generation Triple-Oxide technology, ExpressFabric™ architecture, andpower-optimized hard IP extends the 1 to 5 Watt power advantagedelivered by previous-generation Virtex FPGAs. Achieve higher reliability and a smaller form factor. Save cost on power supplies,heat sinks, and fans. All this, plus the industry’s highest performance.No other FPGA vendor comes close.

Meet performance targets within your power budgetOur Triple-Oxide technology optimizes multiple oxide thicknesses to control leakage and keep static power on par with 90nm VirtexFPGAs while maximizing performance. New 65nm ExpressFabricarchitecture with real 6-input LUTs and diagonally symmetric routingreduces dynamic power by at least 35%. With power-optimized hardIP and automated, block-based power control, you can save evenmore. With Virtex-5 FPGAs, you can meet your most aggressive performance and power targets. No compromises.

Visit www.xilinx.com/virtex5/power, view the Virtex-5 power webcast, download the XPower Estimator tool, and read the power analysiswhite paper.


www.xilinx.com/virtex5/power

The Ultimate System Integration PlatformVirtex-5 LX is the first of four platforms optimized for

Logic, DSP, processing, and serial connectivity.

Performance

Note: Under worst-case operating conditions (85°C)

Power vs Performance

Tota

l Pow

er

Virtex-4 FPGA

Virtex-5 FPGA

Availa

bleCompeti

ngFP

GAPower Budget

Performance limitedby power budget

Max. performancewithin power budget

by Michelle FernandezSoftware Technical Marketing EngineerXilinx, [email protected]

As FPGAs push the performance envelope,maximizing design performance requiresknowledge of the device architecture anddesign software. The 65-nm Xilinx®

Virtex™-5 FPGA family delivers theindustry’s highest performance, with newExpressFabric™ technology, diagonallysymmetric routing, enhanced on-chipmemory, DSP slices, and high-speed I/O.To maximize system performance, youshould use proper design techniques suchas defining timing constraints and selectingoptions in synthesis and implementationthat work best for your design. In this arti-cle, I’ll describe how to achieve faster tim-ing in the fewest design iterations.

Understanding the ArchitectureWhen evaluating a new FPGA architec-ture like the Virtex-5 family, it is impor-tant to study the user guide and data sheetto understand the hardware features.

The Virtex-5 FPGA family is basedon a new ExpressFabric architecturethat delivers higher speeds, a new 6-input LUT structure that reduces logiclevels, and diagonally symmetric rout-ing that minimizes delays. Each CLBcontains two slices that have four 6-input LUTs and four registers config-urable in many ways. For maximumslice packing, it is imperative that youunderstand the slice interconnectivityand any shared resources.

Virtex-5 FPGAs contain hard IP suchas embedded memory (block RAM) andmath functions (DSP48E slices) tuned to550 MHz. Here are some design consid-

erations if any of these hard-IP blocksshow up as part of your critical paths:

• Check to see if your design is makingthe most of the block’s features andthat the synthesis tool is inferring thefeatures as expected from your RTL.

• When using the embedded blockRAM memory or the DSP48E slices, itis important to use their dedicatedpipeline registers when possible toreduce setup and clock-to-out timing.

• Another consideration is the mix ofblock RAMs or DSP48E slices in thedesign, and the trade-off between usingdedicated blocks or implementing thesame function in slices to allow forplacement flexibility.

The choice of clocking resources canalso affect a design’s performance. Virtex-5


Maximizing Design Performance for Virtex-5 FPGAs

Maximizing Design Performance for Virtex-5 FPGAsISE software gives you the tools to achieve the timing goals of a Virtex-5 design.ISE software gives you the tools to achieve the timing goals of a Virtex-5 design.


FPGAs have I/O, regional, and globalclocking resources. These devices aredivided into multiple clock regions, whichat most can contain 4 regional clocks and10 global clocks. During design planning,it is important to analyze how many clockregions you plan to use as well as specificclocks within a clock region. Placing yourI/Os so that their interface logic does notrequire all of the clock resources in aclock region gives ISE™ software greaterplacement flexibility.

Define Timing RequirementsSynthesis and ISE implementation tools aredriven by the performance goals that youspecify with timing constraints for internalclock domains, I/O paths, multi-cycle paths,and false paths (see Figure 1). Defining real-istic timing constraints will prevent excessivereplication and longer run times.

In your synthesis report, check for anyreplicated registers and confirm that thetiming constraints that apply to the orig-

• Explore the synthesis tool settings. (SeeFigure 2 for Synplicity and Figure 3 forXilinx Synthesis Technology [XST] sug-gested tool settings.) There are also avariety of attributes that can affect syn-thesis optimizations. These attributesare an easy way to affect synthesis without having to re-code (see Table 1).

Certain tool settings, such as retimingin Synplify Pro and register balancing inXST, can impact area. If your design isaffected by high fan-out nets and you wantthe synthesis tool to reduce that fanout, usefan-out attributes specifically on that netversus globally reducing the fan-out limit.Avoid maintaining hierarchy if criticalpaths cross over the hierarchical bound-aries. Before implementation, review thewarnings in your synthesis report.

inal register also cover the replicated reg-isters for implementation. When writingtiming constraints, group the maximumnumber of paths with the same timingrequirement before generating a specificconstraint to minimize implementationrun times and memory usage.

Driving Synthesis Here are some design considerations forgetting optimal results from synthesis tools:

• Use proper coding techniques toensure that the inference of your RTLby synthesis takes advantage of thearchitectural features.

• Add any lower level netlists to yoursynthesis project to better optimizeHDL that interfaces to those netlists.

• If critical paths in your implementa-tion are not seen as critical in synthesis,try Synplify Pro’s “-route” constraint toforce synthesis to focus on that path.


Figure 1 – Proper timing constraints

Figure 2 – Recommended Synplify Pro settings

Figure 3 – Recommended ISE synthesis (XST) settings


Choosing Implementation OptionsHaving obtained an acceptable timingestimate from the synthesis tool, you canuse the implementation tools to deter-mine the true performance of the design.The ISE default mode is the performanceevaluation mode, which enables you toget high-performance results out of yourimplementation tools without having tospecify timing goals.

The next step is to run timing-drivenmapping (MAP) and place and route(PAR). Timing-driven MAP performsclosed-loop packing and timing-drivenplacement, while PAR performs the rout-ing of the design. Both MAP and PARshould run with their effort levels set tohigh to achieve optimal results.

Physical synthesis options in imple-mentation can re-optimize and pack logicbased on knowledge of the critical pathsof a design, leading to better placementand routing. The physical synthesisoptions are implemented during theMAP process and include global netlistoptimization, localized logic optimiza-tion, retiming, register duplication, andequivalent register removal. Details oneach of these options can be found in theXilinx White Paper, “Physical Synthesisand Optimization with ISE 8.1i,” availableat www.xilinx.com/bvdocs/whitepapers/wp230.pdf.

Xplorer Utility Xplorer is a tool that helps to determinethe set of implementation options thatresult in the best performance for a design.Xplorer has two modes: timing closure andbest performance. The timing closuremode evaluates your timing constraintsand tries different sets of implementationoptions to achieve those goals. In best per-formance mode, you can give the tool aclock domain to focus on; the tool will tryto achieve the best frequency for the clock.This is helpful when benchmarking adesign’s maximum performance.

Evaluating Your Critical PathsBy understanding the characteristics of yourcritical path, you can make better decisionsabout what to do for your next design itera-

tion. A datapath comprises both logic andinterconnect delay. Individual componentdelays that make up logic delay are fixed. Youcan reduce logic delay by reducing the num-ber of logic levels or by redefining the struc-ture of the logic.

In comparison, interconnect delay ismuch more variable and is dependent on theplacement of the logic. Before running yourdesign through PAR, a timing analysis afterMAP is recommended. Although this timingreport will only have estimates for your rout-ing delays, it can give you an idea of the crit-ical paths the implementation tools areworking on. If the critical paths have a highnumber of logic levels, you may want towork on improving the logic levels versusrunning it through PAR.

If your design has an excessive amountof logic levels:

1. Try the physical synthesis options in MAP.

2. Go back to synthesis and verify that critical paths reported in imple-mentation match what is reported in synthesis.

3. Review the synthesis inference ofyour HDL code.

If there are few logic levels but certaindatapaths are not meeting timing:

1. Evaluate fan-out on routes with long delay.

2. If the critical path contains hard-IPblocks such as block RAMs orDSP48E slices, verify that the designtakes full advantage of the embeddedregisters. Also understand when tomake the trade-off between usingthese hard blocks or using slice logic.

3. Analyze clock skew.

4. If the logic appears to be placed farapart, floorplanning of critical blocksmay be required. Only floorplanwhere necessary.

5. If area groups were created for adesign with a previous version of soft-ware or before many design changes,consider removing those area groups.

6. Consider placing hard-IP blocks suchas block RAMs for DSP48E slices.

ConclusionVirtex-5 FPGAs are optimized for high-per-formance designs, while ISE software hasthe capabilities you need to quickly achievedesign closure, improve productivity, andefficiently verify your designs. Xilinx pro-vides a comprehensive suite of softwaretools (powered by ISE Fmax technology)that improves design performance.However, the more that you can do up-front with good coding styles, defining tim-ing constraints, and resource planning, theeasier it will be for downstream tools toachieve your timing requirements.

XST Synplify Pro

Fan-out Control max_fanout syn_maxfan

Directs Inference of RAMs to Block RAMs or SelectRAM ram_style syn_ramstyle

Directs Usage of DSP48 Slice use_dsp48 syn_multstyle/syn_dspstyle

Directs Usage of SRL16 shreg_extract syn_srlstyle

Controls % of Block RAMs Utilized n/a syn_allowed_resources

Preservation of Register Instances During Optimizations Keep syn_preserve

Preservation of Wires Keep syn_keep

Preservation of Black Boxes with Unused Outputs Keep syn_noprune


* You can find XST documentation at http://toolbox.xilinx.com/docsan/xilinx82/books/docs/xst/xst.pdf. Synplify Pro documentation is located in the tool help documentation.

Table 1 – Helpful synthesis attributes*


by Ralf KruegerSr. Staff Applications EngineerXilinx, [email protected]

As FPGAs grow in size, quality on-chipclock distribution becomes increasinglyimportant. Clock skew and clock delayimpact device performance; managingclock skew and clock delay with conven-tional clock trees becomes more difficultin large devices.

Traditionally, you would deploy solu-tions such as a Xilinx® Virtex™-4 digitalclock management (DCM) or mixed-signalphase-locked loop (PLL) to achieve clocktree deskew and frequency synthesis, amongother functions. Yet each solution has itsadvantages and disadvantages.

In Virtex-5 devices, for the first time inan FPGA, both digital DCMs and analogPLLs are implemented side by side in aclock management tile (CMT). You cannow select the clock management solutionbest suited for your particular applications.

Each Virtex-5 device has as many as sixCMTs. A CMT contains two DCMs andone PLL. You can use either of the twoDCMs or the PLL as a stand-alone mod-ule, or they can interact with each other. Ifused as a stand-alone module, the applica-tion requirements typically dictate whichclock management solution to use. TheDCM, for example, supports a fine phaseshift, a dynamic phase shift, and a multi-

ply/divide feature that does not dependon any maximum VCO frequency.However, the PLL filters input clock jit-ter, support a wide range of output fre-quencies with higher frequencies, andconsume less power.

The DCM and PLL are also designedto interact with each other. The PLL canhelp clean up input or output clocks to theDCM. Dedicated resources within eachCMT make the connections and still guar-antee a proper deskew of the FPGA clocks.The CMTs are located in the center col-umn of the Virtex-5 architecture. Thisenables well-matched clock routes to andfrom every DCM or PLL for enhancedsymmetry (see Figure 1).

DCMVirtex-5 DCMs provide a zero propaga-tion delay buffer, clock division and mul-tiplication capabilities, fixed anddynamic fine phase shift, and multiplephases of the input clock. Along withfully differential global clock trees andlow skew between output signals, theapplication’s various clocks are distrib-uted efficiently throughout the device.Each DCM can drive as many as 9 of the32 global clock routing networks withinthe device.

The global clock distribution networkminimizes skews caused by loading differ-ences. By monitoring a sample of theDCM output clock, the DLL compensates

for the delay on the routing network,effectively eliminating the delay from theexternal input port to the individual clockloads within the device.

In addition to providing zero delaywith respect to a user source clock, theDCM provides multiple phases of thesource clock. The DLL can also act as aclock doubler or divide the user sourceclock by as much as 16. The DCM canalso act as a clock mirror. By driving theDCM output off-chip and then back inagain, the DCM can deskew a board-levelclock between multiple devices.

Another submodule provides the abili-ty to phase shift the DCM’s output clockin small increments (1/256th of the peri-od). The versatile digital phase shift (DPS)operates in four different modes for maxi-mum flexibility: fixed, variable-positive,variable-center, and direct. The DCM’sdigital frequency synthesis (DFS) moduleprovides two outputs, CLKFX andCLKFX180, which are derived from theinput clock by frequency multiplicationand division. You provide valid multiply(M) and divide (D) values, which the DFSimplements through a frequency calcula-tor. For example, if you provide an Mvalue of 19 and a D value of 8, they wouldyield a 2.375 source-clock multiplier.

PLLThe CMT’s PLL is a mixed signal blockdesigned to support clock network deskew,


Clock Management in Virtex-5 DevicesClock Management in Virtex-5 DevicesVirtex-5 FPGAs give designers fresh choices.Virtex-5 FPGAs give designers fresh choices.


frequency synthesis, and jitter reduction. ThePLL block diagram in Figure 2 provides ageneral overview of the various components.

Input multiplexers (MUXs) are used toselect the reference and feedback clocksfrom the global clock pins, global clocktrees, or one of the DCMs. Each clockinput has a programmable counter. Thispre-scales the reference clock and allows awide range of frequency synthesis.

The phase frequency detector (PFD)compares both phase and frequency ofthe input clock and the feedback clock. Asignal is generated that is proportional tothe phase and frequency error betweenthe two clocks, which is then used todrive the charge pump and loop filter togenerate a reference voltage to the VCO.An up or down signal from the PFDdetermines if the VCO should operate ata higher or lower frequency.

After the PFD determines that the inputand feedback clocks are phase- and frequen-cy-aligned, a lock signal is raised, indicatingthat the PLL output clocks are valid. TheVCO continues to compensate for any vari-ations in voltage or temperature. The Mcounter in the feedback path controls thefeedback clock and multiplies the VCO fre-quency to the desired target frequency. TheVCO output clock drives six output coun-ters. Each can be independently pro-grammed to generate a variety of frequenciesfor the application design.

Additionally, clock switchover, phaseshifting, various duty cycles, and bandwidthcontrol are also supported. You can dynam-ically select one of two input clocks before orduring operation. In many cases, alternatephases of a clock are required. The VCOprovides eight phase-shifted clocks at 45degrees each. The higher the VCO frequen-cy, the smaller the phase-shift resolution ofthe clocks coming out of the 0 counter.

You can individually program eachoutput counter to provide a separatelyphase-shifted clock. The PLL can alsogenerate non-50/50 discrete duty cyclesin each output counter. The resolutionand possible output duty cycles dependon the divide value. The higher the out-put divide value, the higher the resolutionsetting of the output duty cycle.

plify and improve system-level designsinvolving high fan-out and high-perform-ance clocks. Virtex-5 devices have powerfulfrequency synthesis, phase-shifting, andclock deskew capabilities never offeredbefore in an FPGA. Along with compre-hensive software support, you can achievelarger, faster, and more complex designsthan in any previous-generation FPGA.

ConclusionVirtex-5 FPGAs give digital designers achoice of either digital or analog clockmanagement. Depending on your particu-lar application, either module – or a com-bination of both modules – provides youwith choices that you never had before.

Together with an abundance of clocktree resources, Virtex-5 devices greatly sim-


DCM1

DCM2

PLLclkout_pll<5:0>

From Global Clock Input Pins

From Global Clock Buffers

To Global Clock Buffers



GeneralRouting

ClockSwitchCircuit

CLKIN1

CLKIN2

D

Lock Detect

Lock Monitor

PFD

CP

LF

VC0

M

CLKFB

LOCK

O0

O1

O2

O3

O4

O5

Figure 1 – Virtex-5 CMT block diagram

Figure 2 – PLL block diagram


by Derek CurdSenior Staff Applications Engineer, Advanced Products DivisionXilinx, [email protected]

With the introduction of the Virtex™-5family, Xilinx is once again leading thecharge to deliver new technologies andcapabilities to FPGA consumers. The moveto 65-nm FPGAs promises to deliver bene-fits traditionally associated with smallerprocess geometries: lower cost, higher per-formance, and greater logic capacity. Andalthough these benefits present excitingopportunities for advanced system design-ers, the 65-nm process node brings with itnew challenges.

Power consumption, for instance,becomes increasingly important whenselecting an FPGA for your application.More than likely, your next-generationdesign will require you to integrate more fea-tures and higher performance within a simi-lar (or perhaps even smaller) power budget.

In this article, I’ll explore the benefits ofreduced power consumption. I’ll also illus-trate the many process and architecturalinnovations implemented in Virtex-5 devicesto offer you the lowest possible power solu-tion without compromising performance.


Reduce Power with Virtex-5 FPGAsReduce Power with Virtex-5 FPGAsThe world’s first 65-nm FPGAs offer the lowest power without compromising performance.The world’s first 65-nm FPGAs offer the lowest power without compromising performance.

P O W E R

Benefits of Reducing PowerImplementing a lower power FPGA designoffers advantages beyond simply adhering tothe device’s thermal operating requirements.Although meeting component specificationsis obviously critical for performance andreliability, how you achieve this has a signif-icant impact on system cost and complexity.

First, lowering FPGA power consump-tion allows you to use less expensive powersupplies, which have fewer componentsand consume less PCB area. The imple-

mentation cost for a high-performancepower system is typically between $.50 and$1 per Watt. Lower power FPGA opera-tion, therefore, contributes directly to low-ering overall system cost.

Second, because power consumption isdirectly related to heat dissipation, lowerpower operation allows you to use simpler,less expensive thermal management solu-tions. In many cases, designs will not needheat sinks, or they will need smaller, lessexpensive heat sinks.

Finally, because lower power operationmeans fewer components and lower devicetemperatures, overall system reliabilityimproves. A decrease of 10° C in deviceoperating temperature can translate to a 2xincrease in component life, which clearly

tremendous tool to fight leakage. In olderFPGAs, two gate-oxide thicknesses wereused: a thin one for the high-performance,lower operating voltage transistors in theFPGA core, and a thicker one for the larg-er, high-voltage-tolerant transistors in theI/O blocks. Simply put, “triple oxide”refers to the addition of a third, medium-thickness gate oxide (or “midox”) transis-tor that has much lower leakage than thethin-oxide core transistor.

The “midox” transistor is used exten-sively in the core of the device fornon-performance-critical circuits(like configuration memory) orcircuits that do not require fastswitching times in response to achanging gate voltage (like rout-ing pass gates). The thin-oxide,highest leakage transistors arereserved only for the portions ofthe speed path that require veryfast switching times. The netresult is that total device leakageis dramatically reduced, whilestill offering a substantial per-formance improvement over pre-vious-generation FPGAs.

The triple-oxide processallowed Virtex-4 devices toreduce static power consumptionby an average of more than 70%relative to competing 90-nmFPGAs. The results were so suc-cessful that the Virtex-5 family

again makes extensive use of this technol-ogy to reduce leakage at the 65-nmprocess node.

Figure 1 illustrates how the triple-oxideprocess enables 65-nm Virtex-5 devices toachieve comparable static power to similar-ly sized 90-nm Virtex-4 devices underworst-case (high-temperature) operatingconditions, despite industry predictionsthat 65-nm devices would see a dramaticrise in static power consumption. Thus, theVirtex-5 family retains a substantial staticpower advantage over competing high-per-formance FPGAs.

Dynamic PowerDynamic Power consumption presentsother challenges for 65-nm FPGAs. The

illustrates the importance of controllingpower and temperature for systems withhigh reliability requirements.

Power: Challenges and SolutionsTotal power in an FPGA (or any semi-conductor device) is the sum of two com-ponents: static power and dynamicpower. Static power results primarilyfrom transistor leakage current, the smallcurrent that “leaks” from either source-to-drain or through the gate oxide of the

transistor even when it is logically “off.”Dynamic power is the power consumedduring switching events in the core orI/O of the device and is therefore fre-quency-dependent.

Static PowerAs you shrink transistor size (for example,move from 90-nm to 65-nm devices), leak-age current tends to increase. The shorterchannel lengths and thinner gate oxidesgenerally used at the new process nodemake it easier for current to leak, eitheracross the channel region or through thegate oxide of the transistor.

In the 90-nm Virtex-4 family, Xilinxintroduced “triple-oxide” process technol-ogy, which gave Xilinx® circuit designers a

2500

2000

1500

Triple Oxide = Three Gate Oxide Thicknesses

1000

500

0

XC

4VLX

15

XC

4VLX

25

XC

4VLX

30

XC

4VLX

40

XC

4VLX

50

XC

4VLX

60

XC

4VLX

80

XC

4VLX

100

XC

4VLX

160

XC

4VLX

200

XC

5VLX

85

XC

5VLX

110

XC

5VLX

220

XC

5VLX

330

Virtex-4 LX Devices

Virtex-5 LX Devices

Po

wer

(m

W)

Device Density

Source

Gate

Drain

Figure 1 – Static power comparison at 85° C


P O W E R


equation governing dynamic power is:

dynamic power = CV2f

where C is the capacitance of the nodeswitching, V is the supply voltage, and f is theswitching frequency. The 65-nm processnode enables FPGAs that have significantlygreater logic capacity and higher performancethan older devices. In other words, morenodes are switching at higher frequencies. Allelse being equal, this tends to increasedynamic power.

However, there is good news with respectto dynamic power at 65 nm. The core FPGAsupply voltage (V) and node capacitance (C)generally reduce with each new process node,providing substantial dynamic power savingsover previous-generation FPGAs.

In Virtex-5 devices, the core supply voltage(VCCINT) decreases from the 1.2V used inVirtex-4 devices to 1.0V. Node capacitancetends to decrease because of smaller parasiticcapacitances (associated with the smaller tran-sistors) and shorter, less capacitive intercon-nects between logic. Additionally, Virtex-5devices use a reduced-K dielectric materialbetween metal interconnect layers to mini-mize routing capacitance.

The estimated reduction in average nodecapacitance for Virtex-5 devices is 15% com-pared to Virtex-4 devices. Taken together withthe voltage reduction benefit, this translates toa 35-40% reduction – at least – in coredynamic power for Virtex-5 devices.

Although the “process shrink” to 65 nmprovides an inherent 35-40% dynamic powerreduction, architectural innovations inVirtex-5 devices offer additional power sav-ings for every design. Most of the node capac-itance that contributes to dynamic power isattributed to the routing or interconnectbetween logic functions. The new Virtex-5architecture fundamentally reduces routingcapacitance in two ways:

• Virtex-5 configurable logic blocks(CLBs) are based on a six-input look-uptable (6-LUT) logic architecture, asopposed to the 4-LUT architecture usedin older devices. This means that morelogic is implemented within each LUT,translating to fewer levels of logic andthus a reduced need for higher capaci-tance routing between logic functions.

• The Virtex-5 routing architecture nowincludes diagonally symmetric routes,meaning that every CLB now has adirect “one hop” connection to all ofits neighbors, including diagonalneighbors. When a connection isrequired between logic functions, it isnow more likely that this connection isa less-capacitive “one hop” connection,whereas previous routing architecturesmay have required two or more hopsfor the same connectivity.

Together, the 6-LUT architecture andimproved routing pattern reduce coredynamic power by lowering average nodecapacitance beyond the level achievedpurely from 65-nm process scaling. Figure 2shows the core dynamic power measure-ments from a benchmark design in whicha Virtex-5 device and a Virtex-4 device areeach filled with 1,024 8-bit counters.These actual silicon measurements illus-trate that the combined process and archi-tectural benefits to dynamic powerreduction can exceed 50%.

Hard IP BlocksVirtex-5 devices contain more hard IPblocks (circuitry dedicated to commonlyused functions) than any other FPGA inthe industry. FPGA designs that utilizethese blocks see additional power savings in

comparison to implementing these func-tions in general-purpose FPGA logic.

Unlike the FPGA fabric, these dedicat-ed blocks contain only the transistors nec-essary to implement the requiredfunction. And there are no programmableinterconnects, so routing capacitance is assmall as possible. Fewer transistors andlower node capacitance benefit both stat-ic and dynamic power consumption. Thenet result is that these dedicated blockscan perform the same function in as little

as one-tenth the power of an equivalentimplementation using the general-pur-pose FPGA fabric.

In addition to adding new types of dedi-cated blocks, many blocks that existed inVirtex-4 devices have been redesigned inVirtex-5 devices to add features, improveperformance, and reduce power. For exam-ple, the 18-Kb block RAM memories in theVirtex-4 family have been sized up to 36-Kbblock RAMs in Virtex-5 devices; each ofthese block RAMs can be broken into twoindependent 18-Kb memories for backwardcompatibility to Virtex-4 designs.

Interestingly, from a power perspective,each of the 18-Kb sub-blocks is constructedfrom two 9-Kb physical memory arrays. Forthe majority of block RAM configurations,any given read or write request to the blockRAM only needs to access one of the 9-Kb

Virtex-5 FPGAs

Virtex-4 FPGAs

0

100

200

300

400

500

600

700

800

0 50 100 150 200 250

Po

wer

(m

W)

Frequency (MHz)

-55%

Figure 2 – Dynamic power comparison of counter benchmark design

P O W E R


physical memories at a time. The other 9-Kb memory can therefore be effectively“powered down” while it is not beingaccessed. This reduces power consumptionby nearly an additional 50% beyond thosereductions resulting from the 65-nmprocess migration. This “ping-pong”accessing of the 9-Kb blocks is inherent tothe new block RAM architecture, meaningthat no user or software control is requiredto take advantage of this capability. Itoccurs dynamically and automatically, pro-

viding dramatic power reductions for alldesigns that use block RAM without com-promising block performance.

The dedicated DSP elements in Virtex-5devices have also received a significantdesign overhaul to incorporate more func-tionality at higher performance and lowerpower consumption. In a slice-versus-slicecomparison, the new Virtex-5 DSP slicehas roughly 40% lower dynamic powerconsumption relative to the Virtex-4 DSPslice. This is mostly attributable to thevoltage and capacitance scaling factors ofthe 65-nm process discussed earlier.

However, because the new Virtex-5DSP slice has greater functionality andwider interfaces, many DSP operationsexperience even greater dynamic powerreduction by taking advantage of these

additional capabilities. In many cases, youcan achieve dynamic power reductions ashigh as 75% when utilizing the full capabil-ity of the new DSP slice. If you are notdesigning a DSP application, keep in mindthat you can use the DSP slices for manystandard logic functions (counter, adder,barrel shifter) at a substantial power savingscompared to implementing the same func-tion in standard FPGA logic.

As a final example of redesigned dedicatedblocks, the LXT platform of the Virtex-5

family includes integrated multi-gigabit serialtransceivers, running at rates as fast as 3.125Gbps. These “SERDES” blocks are imple-mented with an emphasis on reducing powerconsumption. Each full-duplex transceiver ina Virtex-5 LXT device consumes less than100 milliwatts of total power at 3.125 Gbps,representing roughly a 75% reduction rela-tive to Virtex-4 serial transceivers.

ConclusionXilinx has a long history of innovationdating back to the invention of the firstFPGA more than 20 years ago. So it is nosurprise that Xilinx was the first FPGAcompany to make reducing power a toppriority in deep sub-micron technologies.As with the Virtex-4 family, Virtex-5devices employ a number of process and

architectural innovations aimed at offer-ing the lowest possible power consump-tion, while still enabling performanceincreases of 30% or more.

As Figure 3 illustrates, with static powerlevels comparable to Virtex-4 devices, theVirtex-5 family provides a clear advantagerelative to competing FPGAs. As the onlyavailable 65-nm FPGA, Virtex-5 devicesalso offer a minimum of 35-40% coredynamic power reduction over other high-performance FPGAs on the market.Architectural innovations such as the new6-LUT and diagonally symmetric routingare likely to enable actual core dynamicpower savings up to 50% or more. Andtaking advantage of the unprecedentedlevel of dedicated blocks lowers power con-sumption even further.

To find out more about how you canharness the low power of Virtex-5 devices,visit www.xilinx.com/power.

7

6

5

4

3

2

1

CompetingDevice

Virtex-4 FPGAs Virtex-5 FPGAs

*Temp = 85° C

90-nm Static Power

90-nm Core Dynamic Power

65-nm Static Power

65-nm Core Dynamic Power

Po

wer

(W

atts

)

-10%

-70%

-35 to -40%Minimum

90 nm 65 nm

Figure 3 – Power comparison between available FPGAs for a typical design

Xilinx Power Estimator (XPE)Introduced in January 2006, the Xilinx® PowerEstimator (XPE) spreadsheet-based power toolsupports the Virtex™-4 and now Virtex-5 andSpartan™-3E FPGA families. XPE was designedto replace the Web Power tool as the premierpre-design power estimation tool for all newXilinx FPGA families. The key advantages ofXPE over previous power estimation tools arean improved user interface, improved accuracy,and better presentation of important data.

XPE’s summary page displays a completesummary of power usage, first by resourcetype and then by voltage supply. You can usethe navigation buttons on the summary pageto access more detailed information. XPE auto-matically displays several graphs to completethe power usage picture.

Since the initial release, Xilinx has intro-duced newer versions of XPE that include manyadditional features and improvements in accu-racy. These versions, plus those supportingVirtex-5 and Spartan-3E devices, are availableat www.xilinx.com/power.

– Kevin BixlerPower Tools Product Marketing EngineerXilinx, Inc.

P O W E R

by Abu EghanPrincipal EngineerXilinx, [email protected]

Although Xilinx has made substantialprogress at the silicon level to reduce staticand dynamic power in FPGAs, each suc-cessive family takes advantage of reducedfeature sizes to increase transistor densityand performance – thus resulting in ther-mal concerns for the top end of the family.You should not underestimate the impor-tance of power consumption mitigationfor these devices.

Designers cannot afford inaccurate tem-perature predictions when power consump-tion is high and thermal budget margins arelow. Flip-chip packages used in high-per-formance FPGAs have multiple heat-flowpaths and are thermally efficient. Using thebasic “one-resistor” figure of merit thermalresistance – Theta-ja (Θja) – in estimatingtemperature does not do justice to the ther-mal efficiency of the packages.

Thus, a need exists for an alternate andmore accurate approach to obtain Tj predic-tions on these components in an end-user’senvironment. This is where the boundarycondition-independent compact thermalmodel (BCI-CTM) becomes useful. You canconveniently use these models to make fasterand more accurate Tj predictions.

In this article, I’ll discuss better ways topredict temperature for these faster anddenser FPGA components in a system envi-ronment. I’ll also introduce the availabilityof and support for compact thermal modelsfor Virtex™-4 and Virtex-5 devices as oneway to help system designers and compo-nent selectors estimate temperatures in thepre-design and implementation phases.


Applying Compact Thermal Models

Xilinx gives you one more tool in the FPGA thermal management toolbox.

P O W E R

Motivation for Better Predictive ModelsIn a specific system implementation, theactual component Tj may be different fromthe arithmetic predictions using the pub-lished Θja. The prediction depends on theenvironment and the prevailing conditionsin the system. The following equation gov-erns the relationship:

Θja = Tj – Ta_______P

Or, stated in Tj prediction form:

Tj = Ta + P * Θja

where

Θja is the thermal resistance between thedevice junction and ambient

Tj = junction temperature of the device

Ta = ambient temperature

P = package power dissipation

Although you can easily determine Tj, Ta,and P, representing the thermal resistance in anapplication is not easy, particularly for pack-ages with multiple thermal paths. The singleparameter Θja is strongly influenced by theapplication environment and therefore doesnot represent a suitable thermal resistance.

Theta-ja – The Misunderstood Model Theta-ja has become the base thermal param-eter most engineers gravitate toward whenestimating component Tj with known Ta.But for a more demanding, higher wattagecomponent on a large multilayer systemboard – particularly with other componentsaround it – this approach often leads to anerroneous prediction of Tj.

In a design with loose margins in the ther-mal budget, the simple prediction using pub-lished Θja data may not be an issue. Indeed, itwill likely lead to a system running at a lowerthan predicted Tj, because most commonboard types are more efficient than the largeststandardized thermal board. Increasingly, withhigher wattage components where margins aretight, “conservative” data may be the differ-ence between selection and rejection of thecomponent in a specific program.

The key point here is that Θja was notmeant to be used in these types of predic-tions. JEDEC, the semiconductor engineer-

is 10.8° C per Watt. Although the Tj pre-diction expression will suggest a 43.2° Cabove ambient for 4W dissipation, actualdetailed simulation shows a much lowernumber – and thus suggests a lower effec-tive Θja – of close to 5° C per Watt.

Table 1 shows the corresponding Tj forthe same component dissipating 4W onvarious FR4 board sizes and layer counts.This illustrates the power of the environ-ment or boundary conditions on the effec-tive Θja, and the type of Tj prediction

discrepancy that can result. Note that while in general the

effective Θja tends to be lower onlarger board environments, it canalso trend higher and under-pre-dict Tj on small cards in confinedplaces like PDAs or cell phones.The same rationale is at play –Θja is not boundary condition-independent. A component withΘja = 22° C per Watt on aJEDEC board can easily exhibit a30° C per Watt-effective Θja on a30 mm x 30 mm card.

Some application engineershave suggested that becausemost high-performance devicesuse denser and larger PC boards,

component suppliers should provide Θjausing a larger “JEDEC/network board” – aboard that may be closer to network appli-cation boards. This seems like a good argu-ment and should be advocated at the nextJEDEC forum. However, regardless of theboard used for data gathering, the predic-tion will be wrong for some applications.Additional JEDEC boards and standard-

ing standardization body of the ElectronicIndustries Alliance, explains inEIA/JESD51-2 that “the intent of Theta-jameasurements is solely for a thermal per-formance comparison of one package toanother in a standardized environment.This methodology is not meant to and willnot predict the performance of a package inan application-specific environment.”

A typical implementation of a one-cubic-foot Θja still-air standardized envi-ronment is depicted in Figure 1. This is

clearly not a typical system environment.Ideally, you should use these numbers tocompare package efficiency, reserving anyserious Tj prediction for other tools usingmodels that are more relevant.

To illustrate the pitfalls and potentialdiscrepancies in Tj predictions, let’s look ata Virtex-5 flip-chip component –XC5VLX50T- FF1136. The published Θja


Xilinx 35 x 35mm Board Size FF1136-5VLX50T* 4" x 4" Board 10" x 10" Board 20" x 20" Board

4 68.2° C 64.3° C –

8 63.0° C 50.9° C 48.3° C

12 60.4° C 47.0° C 45.7° C

16 59.1° C 46.6° C 44.9° C

24 – 45.3° C 44.0° C

Layer Count of Mounted Board**

Figure 1 – The Analysis Tech implementation of Theta-ja standardized environment

Table 1 – Tj matrix for FF1136-XC5VLX50T on various boards

* Single component considered at 25° C ambient**All layers have 1oz Cu with 80% coverage except outer layers that have 2 oz with 20% coverage.

P O W E R

ized enclosures will only lead to more fla-vors of Θja, further confusing the issue.There ought to be a better way.

What Should an Engineer Do?Engineers should view Θja with cautionwhen predicting Tj in specific environ-ments. Xilinx will continue to publish Θjaand other thermal resistance data becausethose are the prevailing standards. Theyhave their uses and should be deployedwith their limitations in mind.

To address these limitations and tomake more accurate Tj predictions in a sys-tem environment, a more refined model ofthe package is needed. Recognizing thisneed, Xilinx now supports compact ther-mal model data for high-performanceFPGA devices.

What is a Compact Thermal Model?A compact thermal model is a behavioralmodel that seeks to accurately predict thetemperature of the package at selected

nodes: junction, case, top, bot-tom, and balls, for example. Itcannot predict the temperatureat any other part of the packagethat is not predefined. It can beviewed as a reduced nodeabstraction of the response of acomponent to various boundaryconditions. It is also more com-putationally efficient than thecorresponding detailed model.These models are supplied foruse in compatible computation-al fluid dynamics (CFD) toolsfor thermal simulations in placeof detail models.

Xilinx offers two model types forFPGA products:

1. Two-resistor (2-R) compact modelscomprising the familiar Theta-jc andTheta-jb for the package. There is nogeometrical information. Although 2-R models are useful and give betterpredictions than traditional Θja esti-mations, they are not as accurate asDelphi models.

2. A Delphi compact model comprisingseveral thermal resistors that connecta junction node (representing thedie) to several surface nodes. Thermallinks are also allowed between thesurface nodes. Figure 2 shows thetopology for a flip-chip BGA Delphicompact model. The matrix of resis-tors has been optimized through aDelphi optimization algorithm sothat they can be used in various envi-ronments without compromising pre-diction accuracy.

Table 2 depicts a typical Delphi half-matrix model for flip chip. The resistancedata is usually saved along with the nodedefinitions and package extents to completethe model.

JEDEC has proposed a neutral file for-mat in XML for CTM distribution. Xilinxplans to support the format when CFDtools adopt and support it. In the interim,Xilinx is offering the CTM files in twoCFD tool formats, Flotherm and Icepak,selected from a pre-introductory survey ofXilinx customers. These tools cover themajority of those end-users who answeredthe survey. If you do not use one of thesetools, you can request ASCII data for man-ual or script-based entry into your tool.

Application ExamplesFigure 3 shows a typical flow for a CTMapplication. Normally, the componentCTM data is stored in a library; as the user,you will bring in the CTM data as a libraryitem. You then specify the board attributesand boundary conditions of your assembly,adding other items like component powerand heat contributions from other compo-nents for the Tj prediction.


DELPHI BCI-CTM Topologyfor FCBGA Two Resistor Model

TI

Junction Junction

BI BO

SIDE

TO

RJC

RJB

Schematic Overview CTM Implementation Concept

ENVIRONMENTST-Ambient AirFlow

Heatsinks Heat PipesSpace etc.

CTMComponent

Library2R – Ok

BOARDDEFINITION

Extent, Layer Details, Cut-Outs

CTM TOOL(Thermal Solver)

Input Power – PdFor Components

More ThanOne

Component

Tj –PREDICTION(Other Predefined

Component Temps)

Figure 2 – CTM topologies

Figure 3 – CTM application schematics

P O W E R

Let’s examine the benefits of CTM pre-diction in the cases below.

Case Study #1: Battleboard Temperature EstimationsThe “battleboard” is a 24-layer 20 x 16-inch board that Xilinx uses to assess signalintegrity issues on Virtex components. Inthis case, lab measurements showed that aVirtex-4 component case temperature waswell below what you would expect from theΘja prediction. A discrepancy of about 20°C was apparent. The Icepak CFD modelusing CTM inputs with radiation “on”showed a more realistic Tj – very compara-ble to those measured in the lab. Table 3summarizes the observations (note thatreported temperature is T-case).

Case Study #2: Small Board with Multiple Components A 3.75 x 2-inch board can illustrate thesmall board size effect and the influence ofadjacent components on the Tj predictionof the component of interest. I have usedthe BCI-CTM approach with the aggressorcomponents in active and inactive states toshow the impact on Tj prediction.

The high-density interconnect (HDI)board used in this case is smaller thanthe JEDEC standard 2S2P board. TheXilinx XC3S1000-FT256 componentdeployed on this board has a Θja of19.7° C per Watt. With a 20° C ambi-ent, and without any board input, youcan predict a Tj of 39.7° C (20 + Pd *Θja [100LFM]).

The BCI-CTM model with 100 lin-ear-feet-per-minute (LFM) airflowshows a Tj of 55° C with Ta = 20° C for

the single component on the board. Thepredicted Tj is already worse than theJEDEC prediction. This case is in linewith what you would see with smallercomponents that use smaller, thinnerboards in consumer products – PDAs,MP3 players, or GPS systems.

Figure 4 shows the board temperaturecontour with four chip-size package(CSP) components doing 0.25W each.The 8-mm-square CSP components used2-R data (published jb and jc) in themodel. Figure 4 shows the componenttemperatures. The XC3S1000-FT256component yielded a Tj of 65° C – a fur-ther 10° C rise over the single componentcase. Both single and multi-componentruns took less than four minutes on aconduction-based tool using the DelphiCTM for the Xilinx component. Thesepredictions are clearly different fromwhat the basic Θja parameter predicted.

ConclusionRelying on the basic Θja metric to pre-dict junction temperature for high-per-formance devices in a system isinadequate and can cause errors thatcould lead you to preclude a perfectlygood component in your system. Toaddress this shortcoming, Xilinx pro-vides compact thermal models to assistin predicting Tj – in stand-alone calcula-tions as well as system deployments.

You can use these models in CFDtools to make Tj predictions that takeyour environment and board conditionsinto consideration. Although you canaccomplish the same predictions with adetailed package model, note that theseCTMs offer reduced node benefits offaster solutions that are also computa-tionally efficient.

You can download Virtex-4 and Virtex-5 CTM data at www.xilinx.com/xlnx/xil_sw_updates_home.jsp. Future high-performance FPGA products will havethe Delphi models available for down-load as part of thermal collateral. Xilinxwill also support legacy products such asVirtex-II and Virtex-II Pro devices, olderSpartan™ FPGAs, and CPLD productson a by-request basis.

Delphi Compact Top Inner Bottom Inner Top Outer Bottom OuterModel (C/Watt) (TI) (BI) (TO) (BO) Side

Junction 0.22 1.25 1.05 14.97 --

Top Inner (TI) 16.14 4.45 -- --

Bottom Inner (BI) -- 9.47 11.1

Top Outer (TO) 14.55 3.18

Bottom Outer (BO) 4.72

Hand Calc Reported ICEPAK CFD solution < 4 min

Tj -- JA = 10.6° C/watt Battleboard Tc Tj; 2-Resistor Model Tj: Delphi Model (1-R model Ta = 25° C) Ta = 25 - 27° C Ta = 25° C as Published

(Ta = 25° C)

JB = 2.6 andJC = 0.19° C/watt

67.4° C 46.4° C 48.8° C 43.9° C

Table 2 – Delphi CTM resistors for FF676-XC5VLX50

Table 3 – “Battleboard” temperature predictions

Figure 4 – Contours of static temperature


P O W E R

by Gang SunSenior Product Marketing Manager, High-Speed Serial I/OXilinx, [email protected]

The incessant demand for ever-increasingbandwidth has led designers away fromparallel buses and low-speed transceiverstoward serial transceiver-based interfaces.High-speed signals solve many designchallenges; they offer new levels of band-width and lower overall system cost andpower consumption.

These successes have led engineers tobelieve that the industry can continue tolower overall cost and power simply byincreasing transceiver speed indefinitely.However, going beyond 3 Gbps can insome cases lead to fundamentally differentengineering challenges that make it hard-er to lower overall system cost and powerconsumption. The explanation is simple;maintaining signal integrity becomesincreasingly difficult at ultra-high speeds,

and the extra overhead can sometimesoutweigh the benefits associated withincreased data rates.

Transceivers in TransitionFigure 1 shows the frequency loss andcrosstalk associated with a legacy back-plane channel. At 1.6 GHz, the loss is rea-sonably manageable, making transceiverimplementation at or below 3.2 Gbps rel-atively cost-effective and power-efficient.

However, at 3 GHz, the loss becomessignificant. Consequently, the implemen-tation of a 6 Gbps backplane transceiverrequires different feature sets. You will like-ly need advanced techniques such as deci-sion feedback equalization (DFE) tomaintain signal integrity, and theseadvanced capabilities require a different setof optimized features.

This explains why a 3 Gbps transceivertypically consumes less than 100 mW perchannel, whereas a DFE-enabled 6 Gbpstransceiver consumes at least twice as

much power. For applications requiringthese advanced features, this extra powerconsumption is a worthwhile trade-off.But it becomes advantageous to offer botha low-power 3.2 Gbps transceiver and ahigh-performance transceiver for cutting-edge applications – in essence offering thebest tool for the job.

At 5 GHz, the signal-to-noise ratio(SNR) becomes negative. In that case,you would have to redesign the entirebackplane with more expensive materialsand more sophisticated manufacturingtechnologies to enable 10 Gbps transmis-sion. Consequently, achieving a 10 Gbpsserial transmission over a backplaneincurs a higher cost in terms of die areaand power consumption.

The preceding example clearly shows thattransceivers running at or below 3.2 Gbpsare at a sweet spot; they are more cost-effec-tive and power-efficient than both parallelinterfaces and ultra-high-speed transceivers(running at 6 Gbps and 10 Gbps) for a large


A Multi-Gigabit Transceiverfor the MassesA Multi-Gigabit Transceiverfor the Masses The Virtex-5 GTP transceiver brings versatility,

ease of use, power efficiency, and cost-effectivenessto high-volume mainstream applications.

The Virtex-5 GTP transceiver brings versatility, ease of use, power efficiency, and cost-effectivenessto high-volume mainstream applications.

S E R I A L C O N N E C T I V I T Y

majority of interconnect applications. Thisphenomenon has led to two diverging trendsin the transceiver market:

1. Bandwidth-hungry applications (suchas a backplane interconnect for ter-abit routers) have needs for 6 Gbpsand 10 Gbps transceivers. Theseapplications continue to push theperformance envelope while tradingoff cost and power.

2. High-volume applications are wellserved by transceivers running at orbelow 3.2 Gbps.

The Virtex-5 RocketIO GTP TransceiverXilinx clearly recognizes the differentrequirements of the high-performancemarket segment, noting that the high-vol-ume market segments have in some casesconflicting requirements. The vast major-ity of serial protocols run at or below 3.2Gbps; examples include PCI ExpressGeneration 1, Gigabit Ethernet, XAUI,SATA I and II, Serial RapidIO, CPRI,OBSI, and HD-SDI. Many emergingprotocols such as JEDEC’s data converterinterface and VESA’s DisplayPort also runat these relatively slow data rates. In reali-ty, these established and emerging proto-cols represent more than 90% of currenttransceiver applications. Therefore, trans-ceivers running at or below 3.2 Gbps are“transceivers for the masses.”

Xilinx has taken a truly innovative stepand developed two different transceiversfor its Virtex™-5 FPGA family. The firsttransceiver, the Virtex-5 RocketIO™GTP transceiver, is designed for high-vol-ume applications and covers data ratesfrom 100 Mbps to 3.2 Gbps. Targeting themajority of system designers, the GTPtransceiver is versatile, easy to use, power-efficient, and cost-effective.

The GTP transceiver is versatile becauseit is designed to support not only 8B/10B-based protocols such as the PCI ExpressWrapper but also scrambling-based proto-cols such as SONET. (Table 1 is a completelist of applications supported by GTPtransceivers.) Consequently, the spectrumof applications that can be supported bythe GTP transceiver is limitless. In addi-

FPGA CAD tools. The Xilinx® Virtex-5RocketIO GTP transceiver wizard offers anintuitive GUI interface that allows you toselect the GTP, clocking option, FPGA fab-ric interface, protocol stack, and encod-ing/decoding mechanism. After you havecompleted your selections, the tool generatesa GTP wrapper with the necessary features.

tion, validation and characterization of theGTP transceiver occurs in application-spe-cific settings to ensure standards compli-ance. The combination of these design andcharacterization approaches ensures theuniversal appeal of the GTP transceiver.

The GTP transceiver is easy to usebecause it enjoys the support of the best


Figure 1 – Channel S-parameter and crosstalk

Figure 2 – ChipScope IBERT console


The Xilinx ChipScope™ Analyzeroffers self-testing capabilities for the GTPtransceiver by leveraging the integrated bit-error-rate tester (IBERT) feature built intothe transceiver. The ChipScope IBERTconsole is shown in Figure 2.

Among the advanced features offered bythe ChipScope Analyzer are channel per-formance measurement capabilities, auto-mated eye scan capability for finding thebest Tx and Rx settings, and transceiverand link status reporting. This comprehen-sive set of tool offerings greatly simplifiesdesign and manufacturing efforts based onthe GTP transceiver and is a key enabler fora large variety of applications.

As PCB boards become increasinglycrowded, transceiver power consumptionbecomes a critical issue. Therefore, powerefficiency was one of the top designobjectives for the GTP transceiver.Average power consumption per GTPtransceiver is substantially below 100 mW.In some cases, per-transceiver power con-sumption is as low as 60 mW. The uni-versal appeal of low-power requirementsfurther enhances the competitiveness ofthe GTP transceiver for power-sensitiveapplications.

As high-volume applications start to useembedded transceivers, cost has alsobecome an important consideration.Consequently, Xilinx offers certain solu-tions in hard logic rather than in look-uptables (LUTs). For example, a hard-codedPCI Express protocol stack includes aphysical layer based on the GTP transceiv-er, a link layer, and a transaction layer. Thisapproach significantly lowers overallsolution costs, and the increased cost-effectiveness makes GTP transceiver-based solutions even more attractive tohigh-volume/high-margin applications.

ConclusionTransceivers at 3.2 Gbps or below are at asweet spot; the vast majority of transceiver-based applications fall into this data-raterange. With its versatility, ease of use,power efficiency, and cost-effectiveness, theVirtex-5 GTP transceiver from Xilinx isideally positioned in this market, a truemulti-gigabit transceiver for the masses.


Market StandardSpeed

Key Features(Bits per Second)

Telecom OC-3/SDH STM-1 155 Mbps

OC-12/SDH STM-4 622 Mbps

OC-48/SDH STM-16 2.488 Gbps

OBSAI (Issue 1.0) 768 Mbps

1.536 Gbps

3.072 Gbps

CPRI (Version 2.0) 614 Mbps

1.228 Gbps

2.457 Gbps

SFI-5 2.448 -3.125 Gbps

Datacom 1G Ethernet (802.3z D5.0) 1.25 Gbps

XAUI (802.3ae D5.0) 3.125 Gbps

10G Base CX-4 3.125 Gbps (x4)

Computing / PCI Express Specification 2.5 GbpsCommunication (Rev 1.1)

Serial Rapid IO 3.125 Gbps

InfiniBand 2.5 Gbps

Storage Fibre Channel (Rev4.0) 1.0625 Gbps

2.125 Gbps

SATA (Rev1.0a) 1.5 Gbps

3.0 Gbps

SAS (Rev5) 1.5 Gbps

3.0 Gbps

Video SDI 143 Mbps

176 Mbps

DVB-ASI 270 Mbps

• FIFOs can be Bypassed forSynchronous Operation

• Synchronous Clocking(Bypass FIFOs)

• Loss of Signal (LOS)

• Tx Receive Detect

• Loss of Signal (LOS)/Idlestate detect

• Low Power States and OOBBeacon

• Ground ReferencedTermination

• Supports All Data Ratesfrom 1.25-3.125G

• Rate Negotiation, Allows Tx and Rx to Operate atDifferent Speeds

• Rate Negotiation forGen1/Gen2

• LOS and Out-of-Band Signaling Beacon

• Internal AC Coupling Caps canbe Bypassed for VideoStandards

• 2.97G is the New HD-SDIStandard in Development

Table 1 – Applications supported by GTP


by Doug KernStaff System Design EngineerXilinx, [email protected]

Currently dominating the desktop PCmotherboard and graphics markets, thePCI Express (PCIe) interconnect is poisedto supplant PCI and PCI-X as the domi-nant high-bandwidth interconnect for theserver, enterprise, mobile, workstation, net-working, communications, industrial con-trol, and medical equipment markets.

With more than 58 form factors, includ-ing Express Card, Advanced TCA,Compact PCI Express, Com Express, and acable spec, the PCIe protocol is becomingubiquitous. The PCI Special Interest Group(PCI-SIG) maintains the PCIe specification(along with the PCI and PCI-X specifica-tions) and holds compliance workshops.

The PCIe subsystem is a point-to-pointinterface that replaces and overcomes thelimitations of bus-based PCI and PCI-Xstandards. PCIe Generation 1 (Gen1)offers 2.5 Gbps speed with low-voltage dif-ferential signaling (LVDS), embedded

8B/10B encoding, dual-simplex signaling,and message-based serial protocol.

With plans in place to increase band-width to 5 Gbps in Generation 2 and 10 Gbps in Generation 3, the PCIe bus isexpected to be the dominant high-band-width interconnect for several years tocome. (For more information on the PCIespecification or compliance information,visit www.pcisig.com.)

With scalable lane widths from x1 tox32 lanes and advanced features such astraffic classes, virtual channels, hot-plug,and power management, the Xilinx PCIeblock provides support for a wide range ofapplications, from a simple upgrade fromPCI to an x1 PCIe endpoint device toadvanced high-bandwidth x8 PCIe com-munications endpoint devices.

Figure 1 shows the topology of a PCIesystem. The CPU is connected to a rootdevice and is responsible for configuringand enumerating all plug-and-play PCIExpress endpoint devices in a system.Because the PCIe system is point-to-point,switch devices are necessary to grow thenumber of devices or endpoints in a system.

A switch has one upward facing port andnumerous downward facing ports. Thesedownward facing ports connect to the work-ing devices or endpoints of a system.

Although only one root exists in a sys-tem, there are one or more endpointdevices. For example, a standard PCmotherboard provides three to sevenexpansion PCIe slots. With the integratedPCI Express Endpoint block, Xilinx®

Virtex™-5 LXT FPGAs allow you to rap-idly develop and deploy high value-addedPCIe endpoint devices. The numerousvalue-added endpoint designs are the tar-get applications for the FPGA-based con-figurable Virtex-5 LXT PCI ExpressEndpoint block.

The Virtex-5 LXT PCIe Endpoint BlockThe Virtex-5 LXT PCIe Endpoint block(see Figure 2) implements the physicallayer (PHY), data link layer (DLL), trans-action layer (TL), and configuration layersof a PCIe endpoint device. The imple-mentation of a small reset circuit andclock generation blocks require you to usethe FPGA fabric.


Introducing the Virtex-5 PCI Express Endpoint BlockIntroducing the Virtex-5 PCI Express Endpoint BlockWith PCI Express quickly becoming the standard high-bandwidth interconnect, the Virtex-5 LXT PCIe Endpoint block enables a configurable single-chip solution.With PCI Express quickly becoming the standard high-bandwidth interconnect, the Virtex-5 LXT PCIe Endpoint block enables a configurable single-chip solution.


The PCI Express Endpoint block capa-bilities include:

• Compliance with the PCI Express basespecification, revision 1.1

• Choice of PCI Express Endpoint blockor legacy PCI Express Endpoint blockimplementation

• x8, x4, x2, or x1 lane width

• Easy-to-use user interface similar to thefamiliar Xilinx LocalLink interface

• Integration of RocketIO™ GTP transceivers

• Spread-spectrum clocking support

• Low power operation

• Power management support

• Ability to use on-chip block RAMs forbuffering

• Fully buffered transmit and receive

• Management interface to access PCIeconfiguration space and internal con-figuration

• Support for full range of maximumpayload size (128 to 4,096 bytes)

• Capable of as many as two virtualchannels (VCs)

• VC arbitration: round robin, weightedround robin, or strict priority

• 6 x 32-bit or 3 x 64-bit base-addressregisters (BARs) or a combination of32-bit and 64-bit BARs

• BARs configurable for memory or I/O

• Memory BAR checking/filtering

• Non-memory transaction layer packet(TLP) ID checking/filtering

• Implements one PCI Express function

• Signals to the programmable fabric forstatistics and monitoring

• Full documentation and referenceexample design

Virtex-5 GTP transceivers interface tothe serial differential electrical signals of thePCIe specification. The PCIe block com-pletes the physical logic that provides lane

virtual channels, great flexibility for packetarbitration is available.

High-Level IntregrationThe Virtex-5 PCI Express Endpoint blockallows you to implement a single endpointdevice with one FPGA while leaving almostall of the FPGA programmable fabric avail-

deskew. The DLL is responsible for dataintegrity and implements a user-config-urable-sized retry buffer to retransmit pack-ets that are received incorrectly withoutre-requests from the applications software.The TL provides Tx and Rx buffers andorders the packets to be transmitted. Withcapability for eight traffic classes and two

PCI ExpressGraphics: 16x

CPU

Root Complex

Switch Switch Can Be Open orClosed System

Virtex-5 PCIe Endpoint Block Applications

Memory

Switch

x2 EndPoint

x1 EndPoint

x8 EndPoint

PCI

PCIBridge

LegacyEnd Point

Transaction

Layer

PL Lane

Configuration and Capabilities ModulePCIe

Block

GT

P T

ran

sce

ive

r(s)

User

Applic

ation

Transaction

Layer

Interface

Management

Interface Hot Plug and Power

Management

Interface

Clock and

Reset

Interface

Configuration

and Status

Interface

Block

RAM

(Tx)

Block RAM

Interface Transceiver

Interface

Clock and

Reset BlockMiscellaneous Logic (Optional)

Block

RAM

(Tx)

Block

RAM

(Tx)

Data Link

Layer

Physical

Layer

PL Lane

PL Lane

PL Lane

PL Lane

PL Lane

PL Lane

PL Lane

Figure 1 – PCI Express topology

Figure 2 – Xilinx Virtex-5 LXT PCI Express Endpoint block




able for value-added end-point application designfunctionality. The combina-tion of the PCIe block, GTPtransceivers, and the blockRAM incorporate the major-ity of the logic required for alow-power, high-bandwidth,configurable PCIe endpointport. The GTP transceiverssupport Gen1 2.5 Gbps seri-al rates while being electri-cally compliant to the PCIespecification. Some of thenew transceiver featuresinclude power managementsupport such as beacon andelectrical idle detect and thespread-spectrum referenceclock required in PC systemmotherboards.

The block RAM provides ascalable, user-configurableretry memory along with Txand Rx FIFO for any packetsize supporting one or two vir-tual channels. With complexconfiguration options of thePCIe block, GTP transceivers,and block RAMs, along withclock and reset logic, softwareautomation provides quickand accurate configurationand interconnect of these functions.

Using the PCI Express Endpoint BlockThe Xilinx PCIe LogiCORE™ solutiondelivers wrappers through the CORE Generator™ software GUI flow ofthe ISE™ tool, which makes it easy to usethe PCIe block and still provides full flexibil-ity of the configurable features and capabili-ties of the high-bandwidth PCI Endpointblock. The configuration capabilities of thePCIe block are abstracted into several self-checking and instant-feedback menus thatwalk you through the configuration of keydesign elements (Figure 3).

The LogiCORE wrappers connect thePCIe block with the GTP transceiversand block RAMs. They also create andconnect the clock and reset logic blocksto the PCIe block. You can customize the

clock and reset RTL blocks to addressadvanced system requirements. In addi-tion to connecting the various hardwareresources to the PCIe endpoint block, theadvanced wrapper provides value-addedfeatures such as the user-friendly interfacesimilar to Xilinx LocalLink, the memoryBAR, and non-memory TLP ID checkingand filtering.

Ready to Provide High BandwidthThe launch of the Virtex-5 PCIe block inthe Virtex-5 LXT device not only providesthe block and PCIe wrapper support, butalso includes extensive system design aids.Memory endpoint and programmedinput/output (PIO) reference designs areincluded in the deliverables. These designsserve as a training aid, quickly bringing up asimple user application to test in hardware.

In addition, the successfulPCI-SIG compliance achievedwith this design greatly speedsup the path to compliance forthe users of the PCIe block inVirtex-5 devices. Xilinx providesa suite of hardware referenceboards such as the ML523 char-acterization board, ML505embedded design referenceboard, and the ML555 x8 high-data-bandwidth PCIe board toenable you to build and testPCIe systems (Figure 4).

Compliance testing requires acomplete design with a demon-strable function, hardware board,software device driver, and appli-cation software to demonstrateinteroperability with PC mother-board systems. The memory end-point and PIO reference designsprovide all of the above. Sampledevice drivers for Windows XP,Windows Server 2003, WindowsVista, or Linux are available byrequest. As the reference designfunction emulates a memoryaperture, a simple test is provid-ed to demonstrate the designoperation.

This Virtex-5 LXT device andXilinx reference hardware boards

are listed on the PCI-SIG integrators list. ForPCI-SIG-complaint Xilinx solutions, visitwww.pcisig.com/developers/compliance_program/integrators_list/pcie.

Conclusion The Virtex-5 LXT PCI Express Endpointblock, combined with the GTP trans-ceivers and block RAMs, provide anextremely high level of integration for youto efficiently and quickly build high-per-formance, fully compliant PCIe systems ina single device. Xilinx developed and deliv-ered the Virtex-5 PCI Express solutionwith a focus on end-user requirements forease of use, legacy design migration, pow-erful yet flexible features and capabilities,system-level compliance, and cost.

For more information, visit www.xilinx.com/virtex5.

Figure 4 – ML523, ML555, and ML505 Virtex-5 LXT PCIe hardware reference boards

Figure 3 – Xilinx CORE Generator GUI


Here are 6 of the new, faster, bigger, Virtex-5 FPGAs on a 12 Million ASIC Gate Boardthat offers unmatched performance to ASIC Prototypers, IP Designers, and FPGADevelopers. The V5 65nm process, with 6 input LUT and advanced interconnect,enables 30% faster clock speeds in your application. The Dini DN9000k10PCI capturesthis performance on an easy to use board with these handy features:

•33/66 MHz PCI bus or stand-alone operation•6 - DDR2 SODIMM Sockets•7 - Global Clock Networks•3 - 400pin FCI-MEG Connectors for daughter cards• Easy configuration via CompactFlash, USB or PCI

All necessary operating software, including referencedesigns and Synplicity Certify™ models to simplifypartitioning, is supplied with the board. If yourneed is speed — visit The Dini Group web site www.dinigroup.com for complete details on the fastest FPGA board ever.

30% faster than lastyear’s model…

1010 Pearl Street, Suite 6La Jolla, CA 92037 (858) 454-3419 [email protected]

by Navneet RaoTechnical Marketing Manager, Horizontal Platform SolutionsXilinx, [email protected]

End users are adopting multimedia-enableddevices rapidly; you need look no furtherthan iPod video or YouTube-like blog sites.As users consume this type of rich data, theneed for efficient storage and higher con-nectivity speeds becomes critical.

Today, the megahertz debate has beenreplaced with the gigabit-per-seconddebate, as the focus shifts from processingspeed to high-speed interconnect. A host ofserial standards have come into play. Thekey market requirements governing thesestandards are:

• Scalable performance

• An extensible feature set to adapt to various use models (chip-to-chip,backplanes, cable)

• Interconnects suitable for multiplemarket segments and applications

• Implementation of cost-effective solutions in mainstream high-volumetechnology

One of the key serial standards to emergeis PCI Express (PCIe), a third-generationI/O interconnect introduced in 2002 toprovide a scalable path from PCI and PCI-X (see Table 1). PCIe has become the stan-dard interconnect of the PC industry and israpidly gaining momentum in other appli-cations as well (Figure 1). It promises scala-bility, an extensible feature set, multiplemarket suitability, and cost-effectiveness.


PCI Express Markets, Trends, and ApplicationsPCI Express Markets, Trends, and ApplicationsThe Virtex-5 LXT device’s built-in PCI Express solution enables significant power and area savings.The Virtex-5 LXT device’s built-in PCI Express solution enables significant power and area savings.


Key highlights of PCIe include:

• A high-speed serial standard offeringbidirectional communication at 2.5 Gbps line rates per lane

• Layered packet-based architecture,enabling modular design

• Bandwidth enhancement (as much as80 GB) through easier scalability – 1, 2, 4, 8, 16, and 32 lanes

• Advanced features like reliability,power management, and hot plug

• Support for next-generation three-dimensional/multimedia trafficthrough virtual channels, traffic classes,and quality of service (QoS)

• Ease of use through new form factorsand innovative designs, enablingapplications targeted to multiple market segments

• Software preservation by supportinglegacy PCI architecture and infrastructure

Tremendous acceptance, design wins, andstrong customer feedback have propelled ourunderstanding of the inherent benefits ofPCI Express in our customers’ applications.To create solutions for solving tomorrow’sproblems today and keep pace with rapidlychanging times, Xilinx has introduced ahard PCI Express Endpoint block in itsVirtex™-5 LXT devices (Figure 2).

The salient features of the PCIeEndpoint block are:

• Full-featured and compliant to PCIebase specification v1.1

– Highly configurable PCIe end-point solution

• Passed compliance/interoperabilitytests at PCI plug-fest(www.pcisig.com/developers/compliance_program/integrators_list/pcie)

• Supports 1-, 2-, 4-, or 8-lane implementations

• Meets all key requirements

– Electrical signaling

– Protocol (CRC, automatic retry)

– QoS

– Hot-pluggable

PCI Specification Bus Width Transfer Rate Lane Width Line Rate Max Data Bandwidth

PCI 1.0 32 bits 33 MHz 133 Mbps (half-duplex)

PCI 2.x 64 bits 33-66 MHz 266-533 Mbps (half-duplex)

PCI-X 1.x 64 bits 133 MHz Up to 1 Gbps (half-duplex)

PCI-X 2.0 64 bits 266-533 MHz Up to 4 Gbps (half-duplex)

PCI Express 1.x 1 lane 2.5 GHz Up to 500 Mbps

2 lane 2.5 GHz Up to 1 Gbps





PCI Express 2.0* 1-32 lanes 5 GHz Up to 32 Gbps

* The PCI Express 2.0 specification is “still under construction.”

Systems

2004

300

250

200

150

100

50

0

Mill

ion

s

2005 2006 2007 2008 2009

Systems = PCs + Servers + Workstations + Embedded Systems

Hard Core with GTPTransceivers

Tx

Rx

Table 1 – PCI/PCI-X/PCIe specification and bandwidths

Figure 1 – PCI Express momentum

Figure 2 – PCIe Endpoint block in the Virtex-5 LXT FPGA




• Uses Xilinx® RocketIO™ GTP transceiver blocks

– PCI Express electrical support

– 100-MHz direct reference clock

• Saves resources

– Integrated in all Virtex-5 LXTdevices

– Adjacent to GTP transceivers

• Ease of design

– Shortens design cycles

– Simplified, intuitive design flow

• Low cost and low power

• Packet buffering with configurableblock RAM

– Rx buffer

– Tx buffer

– Retry buffer

• Simple transaction layer interface foreasy integration

• Signals available to fabric for statisticsand monitoring

– credit status, max payload size,error signals

• As many as two virtual channels forQoS

– Round robin, weighted roundrobin, or strict priority

Designing with the Virtex-5 LXT PCIe BlockPCI Express has gained considerablemomentum, with broad acceptance in thePC industry. Engineers designing withVirtex-5 LXT FPGA-based PCIe end-points can also lead the proliferation ofPCI Express in new markets by leveragingthese advantages:

• Faster time to market. Existing ASSPsdo not support PCIe today; FPGAsenable bridging between proprietaryparallel interfaces and PCIe. In addi-tion, evolving add-ons to the PCIExpress standard discourageASIC/ASSP starts until a broad marketbase has been created. A case in pointis the recent “Geneseo” architectureannouncement by Intel and IBM at

the Intel Developers Forum inSeptember 2006. Xilinx also endorsesthis initiative, which extends the PCIearchitecture to enable emerging appli-cation accelerators.

• Low power and reduced area.Applications that need higher perform-ance but are designed to smaller formfactors can use the Virtex-5 LXT solu-tion (Figure 3). The PCIe Endpointblock allows you to be able to choose asmaller device and still achieve signifi-cant power and cost savings.

• Bridging legacy and disparate stan-dards to PCIe. The movement of lega-cy applications to new form factorsoptimized for PCI Express requiresbridging functions between legacystandards and PCI Express. The newVirtex-5 LXT platform offers the cus-tomization and logic resources toenable this transition as well as bridg-ing to other serial standards.

• Scalable solution. The PCI Expressprotocol is here to stay, but the proto-col itself and use models are in rapidly

6.22 34,600

25,100

3.09

NearestCompetitor

(90 nm)

Virtex-5 LXTFPGAs(64 nm)

XC5VLX30T versus 2SGX60D. Target Frequency = 200 MHz. Worst-case process 25K LUTs, 17K flip-flops, 1 Mb on-chip RAM, 64 DSP blocks, and 128 2.5V I/Os. Based on Xilinx tool v8.2 and competitor tool v6.01.

NearestCompetitor

(90 nm)

Virtex-5 LXTFPGAs(64 nm)

User Logic

Power Consumption and Area Required to Implement a Typical Design Including 8-Lane PCIe Endpoint

Pow

er (W

atts

)

Are

a (L

UTs

)

PCIe

User Logic

PCIeStatic Power

Xilinx PCI HistoryXilinx has been at the forefront of the PCI/PCI-X/PCIe technology. Significant achievements include:

1996 – Industry’s first PCI core for FPGAs

1999 – Industry’s first 64-bit, 66-MHz PCI solution

2000 – Industry’s first 64-bit, 133-MHz PCI-X solution

2003 – Industry’s first PCIe solution

2005 – Industry’s first PCIe PIPE solution – Xilinx + NXP Semiconductors

2006 – Industry’s first FPGA Express Card solution – Xilinx + NXPSemiconductors

2006 – Industry’s first FPGA with built-in PCI Express Endpoint block

Figure 3 – Power and area savings of a high-performance Virtex-5 LXT PCIe solution


evolving phases. Designing withVirtex-5 LXT PCIe Endpoint blocksenables you to scale from 1- to 4- to8-lane link-widths in the sameVirtex-5 family. This allows you tofuture-proof the system and theequipment. In addition, becausePCIe is inherently compatible withlegacy PCI and PCI-X architectures,scaling and designing Virtex-5 LXTFPGA-based PCIe solutions will pre-serve software investments andextend infrastructure life.

• Form factors supported. Virtex-5LXT RocketIO GTP transceiversoffer significant power advantagesover competing FPGA/ASSP solu-tions. This enables designers to con-sider Virtex-5 FPGAs in new markets.You can use the inherent advantagesof the 65-nm FPGA to support mul-tiple form factors by using scalablelogic density for different solutions.For example, a desktop solution in anadd-in card form factor can be scaledto a lower power Express Card formfactor using similar/identical FPGAresources. Conversely, a desktop add-in card form factor PCIe solution in aVirtex-5 LXT FPGA can be easilyscaled up to support the transition tohigh-performance form factor solu-tions like ATCA, uTCA, and serverI/O module.

Applications Form Factors Link Width Data Prominent Feature Required(Typical) Bandwidth

Enterprise HBAs, Server x1 250 Mbps High ReliabilityI/O Module x4 1 Gbps Scalability

x8 2 GbpsError RecoveryReduced Board SpaceReduced Power Budgets

Desktop Add-In Card x1 250 Mbps Legacy Support to Existing

x4 1 Gbps (PCI) Software

x8 2 GbpsReduced Board Space

x16 4 GbpsReduced Power BudgetsEcosystem ExistenceHigh Availability

Mobile Express Card, x1 250 Mbps Reduced Power BudgetsMini-Card High Reliability

Power Management Capability

Communications HBAs, ATCA, x4 1 Gbps High AvailabilityServer I/O Module x8 2 Gbps High Performance

InteroperabilityReliability

Embedded Integrated Endpoints, x1 250 Mbps Low CostPlatforms Custom Cards, Reliability

Mini-Card High AvailabilityEase of Use and Integration

PCI Express

Endpoint

PCI Express-PCI

Bridge

PCIExpress

PCIExpressPCI

ExpressPCIExpress

PCI Express

PCI Express

PCI Express

SwitchPCI/PCI-X

CPU

MemoryRoot

Complex

Legacy

Endpoint

Legacy

Endpoint

PCI Express

Endpoint

PCI Express

Endpoint

PCI ExpressFabric TopologyThe PCI ExpressFabric™ topology, referred to as a hierarchy, comprises a root complex (RC), multiple endpoints (I/O devices), a switch, and a PCI Express/PCI bridge, all interconnected through PCI Express links.

An RC denotes the root of an I/O hierarchy that connects theCPU/memory subsystem to the I/O. A root complex may support one or more PCI Express ports, for example, Intel chipset(s).

A switch is defined as a logical assembly of multiple virtual PCI-to-PCI bridge devices, which forward transactions using PCI bridgemechanisms, namely address-based routing such as the IDT PCIExpress switch.

Endpoint refers to a type of device that can be the requester or completer of a PCI Express transaction, either on its own behalf or on behalf of a distinct non-PCI Express device, for example, a PCI Express-attached graphics controller. Figure 4 – PCI Express topology

Table 2 – Virtex-5 LXT PCIe Endpoint applications




Virtex-5 LXT FPGAs with built-inPCIe Endpoint blocks can easily bedesigned in all form factor applications, asshown in Table 2.

Figures 5 and 6 outline applications usingVirtex-5 LXT PCIe Endpoint block capa-bilities to aggregate multiple-source trafficand bridge protocols to PCI Express.

ConclusionThe Virtex-5 LXT platform, with built-inPCIe Endpoint blocks and RocketIO GTPtransceivers, offers great value by providinga full-featured, fully compliant PCIe solu-tion. Say goodbye to IP licensing and helloto lower power and fewer utilized logicresources. You can achieve significant costsavings by targeting smaller FPGA deviceswith 50% of the power of soft-IP alterna-tives. Built-in hard blocks deliver guaran-teed functionality and ease of use byreducing design time.

Thus, the Virtex-5 LXT platform offersunique built-in PCIe capabilities in a fast,low-power, 65-nm FPGA, launching a newera of efficient PCIe system development.

CPU

EmbeddedGraphicsController

EthernetController

HDD

PCI / PCI-XEndpoints

GbE x1 PCIe

x8 PCIe

x8 PCIe

DDR2

FSB

x4 PCIe

x8 PCIe

x8 PCIe

x16 PCIe

x4 PCIe

SATA

Xilinx Virtex-5 LXT FPGAs

PCI Express Link

PCI Express Endpoint Block + GTP Transceiver

Tri-Mode Ethernet MAC Block + GTP Transceiver

10/100

Memory ControllerHub (MCH)

PCIe Root Complex

PCIe Switch

I/O Controller Hub(IOCH)

CPU

Dual-ChannelDDR2 Memory

Add-InExpansion Slot(s)

PCIe-PCIBridge

PCIe - InfiniBandBridge

InfiniBand

HD VideoStreams

AudioStreams

InfiniBandSwitched

Fabric

HD Video Source

HD Audio Source

FPGA-BasedDSP Acceleration

System

Card

Switch

Card

SONET Line

CardGbE Line

Card

Root ComplexCPU

Dual-ChannelMemory

DDR2 QDR

Memory

x4 PCIeBackplaneLinks

SONETLinks


Memory

PCIeSwitch

Memory DSP

Optics SFI toSPI

Bridge


Optics

Optics

Tri-Mode Ethernet MAC PCIe Link PCIe Endpoint

EthernetLinks

ControlFPGA TOE + TM

NPUFabric

InterfaceController

GbE to SPIBridge

NPU

x4 PCIeBackplaneLinksFabric

InterfaceController

LegacyPCI EP

Figure 5 – PCIe in a communication system

Figure 6 – PCIe in a high-end desktop/server system


by Nick McKaySenior Design Engineer Xilinx, [email protected]

Soma PotluriSenior Design ManagerXilinx, [email protected]

Stuart Nisbet Senior Design ManagerXilinx, [email protected]

Ethernet is the dominant wired connectivitystandard. The Xilinx® Virtex™-5 Ethernetmedia access controller (Ethernet MAC)block provides dedicated Ethernet functional-ity, which together with Virtex-5 RocketIO™GTP transceivers and SelectIO™ technologyenables you to connect to a wide variety ofnetwork devices. The Ethernet MAC block isintegrated into the FPGA as a hard block inVirtex-5 devices.

The Ethernet MAC is available in theXilinx design environment as a library prim-itive, named TEMAC. The primitive con-tains a pair of 10/100/1000 Mbps EthernetMACs. Each Virtex-5 LXT device containsfour Ethernet MAC blocks; thus, a Virtex-5LXT design can incorporate two TEMACprimitives. Using standard Xilinx products,you can create a range of customized packetprocessing and network end-point products.Xilinx has also provided an overclocking

mode to enable backplane connectivity atspeeds as fast as 2,000 Mbps.

Xilinx developed the Virtex-5 EthernetMAC from the Virtex-4 FX EthernetMAC, making improvements in the areasof global clock usage, serial interface flexi-bility, and software control complexity.

In this article, we’ll review the featureset of Ethernet MAC blocks in Virtex-5devices. We’ll also describe the differencesbetween Virtex-5 and Virtex-4 FXEthernet MACs, illustrate some potentialapplications, and describe how to usestandard Xilinx tools to integrate anEthernet MAC into your design.

Supported InterfacesThe Virtex-5 Ethernet MAC is fullycompliant to the IEEE802.3 specifica-tion. Figure 1 shows a block diagram ofthe Ethernet MAC.

Physical InterfacesYou can independently configure thephysical interface of each Ethernet MACto operate as one of five differentEthernet interfaces.

The Media Independent Interface(MII), Gigabit Media IndependentInterface (GMII), and Reduced GMII(RGMII) are parallel interfaces. These aretypically connected to an external physicallayer (PHY) chip to provide BASE-Tfunctionality at 10/100/1000 Mbps.Half-duplex operation is supported at

10/100 Mbps; full-duplex operation issupported at all speeds.

Serial GMII (SGMII) and 1000 BASE-Xare serial interfaces that use the physical cod-ing sublayer (PCS) and physical mediumattachment (PMA) sections of the EthernetMAC. These interface to the Virtex-5RocketIO GTP serial transceivers. SGMII,as with the parallel interfaces, provides10/100/1000 Mbps full-duplex BASE-Tfunctionality. The serial interface signifi-cantly reduces the number of pins requiredto connect to the external PHY chip.

When the Ethernet MAC is configuredin 1000 BASE-X mode, the PCS/PMAblock, along with the RocketIO transceiv-er, provides all of the functionality requiredto connect directly to a gigabit interfaceconverter (GBIC) or small form-factorpluggable (SFP) optical transceiver. Thisremoves the need for an external PHY chipfor 1000 BASE-X network applications.

Control InterfacesThe host interface provides access to theconfiguration registers of the EthernetMAC block. Examples of configurationoptions include jumbo frame enable, pauseand unicast address settings, and framecheck sequence generation.

The host interface is accessible througheither a generic host bus or a device controlregister (DCR) bus (when connecting to aprocessor). In addition, each EthernetMAC has an optional management data


Designing with Virtex-5 Embedded Tri-Mode Ethernet MACsDesigning with Virtex-5 Embedded Tri-Mode Ethernet MACsYou can implement flexible Ethernet systems using the Virtex-5 10/100/1000 Ethernet MAC.You can implement flexible Ethernet systems using the Virtex-5 10/100/1000 Ethernet MAC.


I/O (MDIO) interface. This allows accessto the management registers of an externalPHY and to the physical interface manage-ment registers within the PCS/PMA sec-tion of the Ethernet MAC.

Client InterfaceFrames are passed to the Ethernet MACacross the transmitter client interface. Thetransmitter pads the incoming data when itis less than the minimum Ethernet framelength and maintains the minimum inter-frame gap between frames; however, youcan increase the size of the gap. You canalso configure the transmitter to add aframe-check sequence to the frame. A sep-arate flow control interface allows you togenerate pause frames. In half-duplexmode, the transmitter signals collisions andrequests retransmissions for valid collisions.

The receiver interface verifies incomingframes and signals frame errors. Good andbad frame signals are provided. You canalso configure the Ethernet MAC to pauseand restart frame transmission upon thedetection of valid pause frames.

The data on the client interface is 8 or 16bits wide. The 8-bit interface is used for stan-

DCR Bus AddressingThe Virtex-5 DCR interface now featuresan individual base address for each of theEthernet MACs. This makes the sharedDCR bus interface transparent to softwaredrivers. The software no longer needs toknow the bit locations for individualEthernet MACs; the hardware automatical-ly multiplexes in the correct bits dependingon the base addresses.

Serial Interface ChangesXilinx made several changes to the operationof the serial interfaces. Auto-negotiation isnow more flexible with the inclusion of aprogrammable link timer. You can alter thetiming of the auto-negotiation process andreduce simulation time.

A newly added unidirectional mode per-forms the unidirectional enable functionfrom the IEEE802.3ah-2004 specification.When enabled, the Ethernet MAC transmitsregardless of whether valid input is presentat the receiver.

Finally, loopback can now take place inthe Ethernet MAC as well as in the trans-ceiver. This enables the transmission of idlesto the link partner while in loopback, ensur-ing that the link remains active.

Virtex-5 Ethernet MAC Use ModelsThe versatility of the Virtex-5 EthernetMAC enables its use in a wide variety ofapplications. For example, you can:

• Attach the Ethernet MAC to a processor running a protocol stack innetwork processing or remote moni-toring systems, as shown in Figure 2.

• Interface the Ethernet MAC to a packetprocessing system implemented in theFPGA, such as a checksum offload engineor remote direct memory access design.

• Connect multiple Ethernet MACs todedicated packet FIFOs and externalmemory for packet storage, bridging, orswitching applications.

Tools and IP SupportXilinx provides support for the EthernetMAC through CORE Generator software,LogiCORE™ IP, and reference designs.

dard Ethernet applications, giving a 1,000Mbps data rate with a 125 MHz clock.Using the 16-bit mode, you can increase thedata rate to 2,000 Mbps without anyincrease in clock speed at the client interface.

Each Ethernet MAC outputs statisticsvectors containing information about theEthernet frames seen on its transmit andreceive datapaths. An external statisticsmodule is freely available in XilinxCORE Generator™ software. The statis-tics module accumulates all of the Tx and Rxdatapath statistics of each Ethernet MAC.

New Features in the Virtex-5 Ethernet MACIn Virtex-4 FPGAs, implementing just thedatapath consumes as many as four globalclock buffers: one each for the Tx and Rxclient interface logic, and one each for theTx and Rx physical interface logic. ForVirtex-5 FPGAs, Xilinx added a clock-enable feature. You can use the clocksderived for the physical interface for all ofyour client logic. The internally generatedclock enable provides a way to maintain thecorrect data throughput on each of theinterfaces. This reduces the number of nec-essary clock buffers by 50%.


EMAC1

Host Interface

EMAC0

PCS/PMA1

PCS/PMA0

MDIO1

MDIO0

GMII, MII, or RGMIIto External PHY viaSelectIO Interface

PCS/PMA SGMIIto RocketIO Transceiver

MDIO to External PHY

MDIO to External PHY

PCS/PMA orSGMIIto RocketIO Transceiver

GMII, MII, or RGMIIto External PHY viaSelectIO Interface

EMAC0 Client Interface

Generic Host Bus

EMAC1 Client Interface

Clocking1

Clocking0

DCRBridge

DCR Bus

FPGA Fabric

Rx Stats MUX0

Tx Stats MUX0

Rx Stats MUX1

Tx Stats MUX1

Statistics Block

Statistics Block

Figure 1 – Block diagram of the Virtex-5 Ethernet MAC


Virtex-5 Ethernet MAC WrappersFigure 3 shows a block diagram of theHDL wrappers available from the XilinxCORE Generator tool.

The Ethernet MAC is a complex com-ponent with 162 ports and 79 parameters.Wrapper files enable you to easily set theparameters and interface only to thoseports required for your application. Theyalso offer benefits in simplifying the use ofclocking and physical I/O resources.

The different levels of hierarchy enableyou to extract the correct wrapper foryour application.

• Ethernet MAC Wrapper. In the lowestlevel, a single or dual Ethernet MAC isinstantiated and its attributes are set to your preferred selection in theCORE Generator GUI. All of theunused input ports are tied to groundand the output ports are left open.

• Block Level Wrapper. In the next levelof hierarchy, the physical interfacesand the required clock resources areinstantiated. This includes theRocketIO GTP transceivers for theserial interfaces. Clocking is also opti-mized for your configuration, and youcan clock the output to your design.

• LocalLink Level Wrapper. In thislevel, FIFOs are added to the clienttransmitter and receiver interfaces.The FIFOs handle the dropping ofbad frames on reception and retrans-mission of frames in half-duplexmode. LocalLink is used as the back-end interface.

• Example Design Wrapper. The toplevel features a demonstration designwhere the received data is looped backand sent to the transmitter. You candownload this design to a board andstimulate the receiver from a networkdevice to demonstrate the operation of the Ethernet MAC in hardware.Testbenches that stimulate receiverinput and monitor the transmitter out-put of the design are also included inthe CORE Generator software.

LogiCORE IP and Reference DesignsMost of the existing Virtex-4 Ethernet MACdocumentation is reusable with the Virtex-5Ethernet MAC. For example, a version of the“Ethernet Cores Hardware DemonstrationPlatform” (XAPP443, www.xilinx.com/bvdocs/appnotes/xapp443.pdf ) will be avail-able for the Virtex-5 Ethernet MAC.LogiCORE IP, such as Ethernet statistics,already supports the new architecture.

ConclusionThe Virtex-5 Ethernet MAC provides a cost-effective solution for a wide rangeof network interfaces, enabling you toconnect to BASE-X and BASE-T net-works at 10/100/1000 Mbps. Xilinx soft-ware tools and IP also allow you to takeadvantage of the improved feature set ofthe Ethernet MAC.

For more information, visit theVirtex-5 links on the Xilinx website,www.xilinx.com/virtex5/.


MasterAttachment

FPGA Fabric

SlaveAttachment

DMA ReadPacket FIFO

WritePacket FIFO Register,

SRAM, and Interrupt Interfaces

Virtex-5Ethernet MAC

Rx

HostInterface

Tx Sel

ectIO

Inte

rfac

e or

Roc

ketIO

Tra

nsce

iver

Ext

erna

l PH

Y

Client Receive

Client Transmit

Interface to Processor

Address

Swap

Module

Tx Client

FIFO

10 M/100 M/1 G

Ethernet FIFO

Lo

ca

lLin

k I

nte

rfa

ce

Host

Interface

DedicatedEthernet MAC

Address

Swap

Module

Tx Client

FIFO

10 M/100 M/1 G

Ethernet FIFO

Lo

ca

l L

ink I

nte

rfa

ce

Physical I/F

Rx Client

FIFO

Rx Client

FIFO

Physical I/F

EMAC1

EMAC0

FPGA

Fabric

Clock

Circuitry

Physical

InterfaceClient

Interface

LocalLink Level WrapperBlock Level Wrapper

DedicatedEthernet MAC

Wrapper

Example Design

(GMII/MII,

RGMII,

or

RocketIO

Transceiver)

(GMII/MII,

RGMII,

or

RocketIO

Transceiver)

Figure 2 – MAC connected to a processor on the Virtex-5 FPGA

Figure 3 – Block diagram of the Virtex-5 Ethernet MAC wrappers


by Gregg C. Hawkes Principal Engineer, Advanced Products Division Xilinx, Inc. [email protected]

Reed TidwellSenior Staff Applications Engineer, Advanced Products DivisionXilinx, [email protected]

John F. SnowSenior Staff Applications Engineer, Advanced Products DivisionXilinx, [email protected]

The diversified uses and ever-changinginnovations for digital video and audio con-tinue to drive the fast-paced proliferation ofequipment for audio, video, and broadcast(AVB). Today’s AVB equipment demandsbetter image quality, higher resolutions,higher bandwidths, more audio/videochannels, and the combining of previous-ly separate but related functions such asHD-SDI, audio multiplex, audio demul-tiplex, and asynchronous sample-rateconversion (ASRC).


Asynchronous Sample-Rate Conversion Between AES Audio Streams

Asynchronous Sample-Rate Conversion Between AES Audio StreamsXilinx Virtex-5 FPGAs provide the perfect platform for implementing AES digital audio sample-rate conversion.Xilinx Virtex-5 FPGAs provide the perfect platform for implementing AES digital audio sample-rate conversion.


Xilinx® FPGAs have kept pace with cus-tomer integration needs by incorporating sil-icon features, facilitating the absorption ofless integrated, complex, and expensiveASSP chips. One such ASSP chip function,ASRC, can be integrated into Xilinx FPGAsby leveraging diffused silicon features knownas DSP48E slices and block RAMs to buildsophisticated filter functions.

Free Xilinx application notes and refer-ence designs have also kept pace with ourcustomers’ need to integrate sophisticatedalgorithms. The ASRC reference design cor-rectly handles synchronous sample-rate con-version and the far more complex ASRCcalled for in most audio/video applications.

Simpler “synchronous-only” methods,offered by many ASSP chips and FPGA IPsuppliers, can be smaller in terms of utiliza-tion per audio channel; however, whenapplied incorrectly to asynchronous applica-tions, these methods have one or both of thefollowing artifacts:

• The input-to-output latency changesbecause of accumulating delay

• Artifacts are produced in the audio,such as skipping samples or repeatingsamples

Both of these cases represent undesirabledistortions.

Understanding Sample-Rate ConversionBefore diving into the theory of digital sam-ple-rate conversion, you should look at thebasic types of problems audio/video engi-neers are trying to solve. A few applicationsexist where you could use a fixed-rate syn-chronous conversion, such as a 48-KHzinput converted to a 44.1-KHz output usingthe same clock source, or an output clockderived from the input clock. However, morelikely is the asynchronous case, where inputand output clocks are completely independ-ent, such as two boards communicatingaudio between them. The different clockoscillators can be the same nominal frequen-cy but several parts-per-million different.

The Xilinx ASRC reference design forthe asynchronous case of independentinput and output clocks provides twoimportant and difficult design functions:

adjusts on the fly, maintaining high per-formance with no special attention neededfor input and output clocks. You can verifyall of this with the IP running the XilinxML571 Serial Digital Video demonstrationboard shown in Figure 1.

Best of all, the broad functionality andhigh-performance ASRC IP is free.

Sample-Rate Conversion TheoryFigure 2 illustrates conceptually the gen-eral case of up or down conversion. Theconversion ratio can vary continuously byrational numbers with fractional values.The diagram shows the up conversionprocess (creating many more samples andtime positions to choose from) followedby down conversion (judiciously choos-

• Automatically and accurately moni-toring the input-to-output ratio andsample-rate changes

• Adapting the filter function (filtercoefficients) on the fly to provide thehighest performance

Supporting ASRC for digital audio withan FPGA means that you can now signifi-cantly save costs for every SDI interface inyour system – and in many systems, thereare many channels.

The Xilinx ASRC IP is very high per-formance, with a worst-case input-to-out-put signal-to-noise ratio of -125 dB. It alsosupports conversion for multiple audio-input frequencies to multiple audio-outputfrequencies. The rate conversion algorithm

Input Samples at fs in

Output Samples at fs out = fs in * L/M

Low PassAnti-Imaging /

Anti-Aliasing Filterg(n)

Upsampler

L

Downsampler

M

Figure 2 – Classic conceptual data path for sample-rate conversion

Figure 1 – ML571 board and frame synchronization demonstration board with an ASRC to match the output digital audio rate to the output digital video rate.




ing the samples in the output datastreamthat most closely match the positions of thedesired samples). The anti-imaging/anti-aliasing filter in the center of the data pathensures that the spectral content is less thanhalf the Nyquist rate of both the input andoutput sampling frequencies.

Figures 3 and 4 show that for every out-put sample location or output phase, a dif-ferent set of sub-filter coefficients is required

because the inputs are in different locationsrelative to that output phase. The sub-filter,having a set of coefficients that align withthe input sample positions, is formed byinterpolating the prototype filter coeffi-cients. When this sub-filter is convolvedwith the corresponding input samples, theoutput sample of interest is produced. Thisprocess repeats, with new sub-filter coeffi-cients interpolated for each output sample.

ASRC Example Implemented on the ML571The simple function known as frame syn-chronization of video provides a great demon-stration of where you might use ASRC. Videocan be stored in a frame buffer at some rateand removed at a fractionally different rate.This process can be useful if two pieces ofvideo equipment are not “genlocked” togeth-er and operate at different pixel rates.

The result is the occasional need to add ordrop a frame of video data. Your eye proba-bly would not notice an added or droppedframe of video on your TV screen, but thehuman ear is very good at detecting discrep-ancies in added or dropped audio. The solu-tion is to remove the audio from the startingvideo stream and reinsert it in the resultingvideo stream at a fractionally different rate,matching the output audio rate to the newoutput video rate. The Xilinx ASRC refer-ence design is perfect for this task.

As an example, let’s connect two boardswith SDI video running at slightly differentfrequencies because of the different clockoscillators on each board. The receiving boarddemultiplexes the embedded AES digitalaudio from the video stream and sends it tothe ASRC. The difference in clock frequencybetween the two boards causes the framebuffer synchronization logic to add or dropvideo frames. The ASRC adjusts the de-embedded audio to match the output videostream clock rate, where it can then be re-embedded into the output SDI video stream.The difference in clock frequencies betweenthe two boards causes the frame buffer syn-chronization logic to add or drop videoframes. The ASRC adjusts the de-embeddedaudio to match the output video stream clockrate, where it can then be re-embedded intothe output SDI video stream.

For more information about framebuffer synchronization and asynchronoussample-rate conversion techniques, seeXAPP514, “Audio/Video ConnectivitySolutions for the Broadcast Industry,” atwww.xilinx.com/bvdocs/appnotes/xapp514.pdf.

Input Samples

Output Sample Times

Original Samples

Output Samples

Interpolated Samples

See Inset

Inset

Output Samples

Time

X

Prototype Filter

Output Sample Position

Input Sample Positions

Resulting Sub-Filter

Figure 4 – Prototype filter centered at output sample position

Figure 3 – Output sample position relative to the original sample position dictates which interpolated samples to use.

The ASRC adjusts the de-embedded audio to match the output video stream clock rate, where it can then

be re-embedded into the output SDI video stream.



Block Diagram and Specification HighlightsThe simple diagram in Figure 5 illustratestwo key design elements required inASRC. The first element is determiningthe changes between input sample rate andrequired output sample rate, labeled “ratiocontrol.” The second element within the“re-sampler” is a set of prototype filtersthat are modified depending on the statis-tics reported by the ratio control.

The ASRC reference design convertsstereo audio from one sample frequencyto another. The input and output samplefrequencies can be an arbitrary fractionof one another or the same frequency,but based on different clocks. The out-put is a band-limited version of the inputre-sampled to match the output sampletiming. The reference design has the following features:

• Fully asynchronous operation

• Expandable to multiple channels

• A -125 dB THD+N worst case with -130 dB THD+N typical

• A 24-bit audio word width in andout, with 31-bit internal math preci-sion and round away from zero

• Automatic input-to-output sampleratio monitoring with continuous filter modification

• Continuous rational/fractional ratio,up conversion, 8:1

• Continuous rational/fractional ratio,down conversion, 1:7.5

• Continuous input-to-output rate mon-itoring with adaptive filtering

• Input/output rates 8 KHz-192 KHz,continuous

• Low deterministic latency

The reference design has an interpolatedcoefficient FIR filter coded with Virtex™-5DSP48E slices as the primary math ele-ment and block RAM for input samplebuffers and prototype storage.

ConclusionThe need to maintain different input-to-out-put audio rates for varying numbers of digi-tal audio channels and support new AVBfunctions is a tremendous challenge. Throwin varying protocols, memory management,different sized payloads, and a variety of dif-ferent system interfaces, and it is easy to seehow these designs require high-performance,cost-effective flexibility that ASSPs andASICs cannot offer. These challenges openup opportunities for Virtex-5 devices becausethese devices can enable equipment vendorsto provide solutions to the ever-evolvingAVB equipment landscape.

Input Sample

Clock

Radio Control

Re-SamplerAsynchronous

Sample-Rate

Converter

Ratio

Detection Filter Phase Interpolation

Coefficient

Memory

FIR FilterOutput

Samples

Input

Sample

Storage

Output Sample

Clock

Input Samples

Figure 5 – Xilinx ASRC reference design top-level block diagram

GetonTarget

Is your marketingmessage reachingthe right people?

Hit your target audience by advertising your product or service

in Xcell Journal. You’ll reach more than 30,000 engineers, designers,

and engineering managers worldwide.

We offer very attractive advertising rates to meet any budget!

Call today: (800) 493-5551 or e-mail us at

[email protected]


by Gregg C. Hawkes Principal Engineer, Advanced Products Division Xilinx, [email protected]

Reed TidwellSenior Staff Applications Engineer, Advanced Products DivisionXilinx, [email protected]

John F. SnowSenior Staff Applications Engineer, Advanced Products DivisionXilinx, [email protected]

At Xilinx, we understand the challengesthat broadcast system designers are fac-ing. The number of emerging new stan-dards for video connectivity createsdifficult design challenges and schedulesfor broadcast products.

With the ever-changing video connec-tivity landscape prevalent throughout thebroadcast chain, our goal is to offer help inthe form of free reference designs, formingdrop-in building blocks that can solvemany system-level video connectivityissues. By providing you with cost-effectiveand highly integrated solutions comparedto ASSP chips, Xilinx hopes to get you tomarket faster, lower costs, and differentiateyour product from the competition.

Our video connectivity IP and referencedesign book, “Audio/Video ConnectivitySolutions for the Broadcast Industry”(www.x i l i nx . c om/bvdo c s / appno t e s /xapp514.pdf), includes chapters about SDI,HD-SDI, DVB-ASI, AES embeddedaudio, and audio-asynchronous sample rateconversion. Each chapter describes a specif-ic video connectivity topic and links to freereference designs in Verilog and VHDL,providing implementation examples.

Integrating the encoders and decodersfor these standards into the FPGA is simplewith the clear, concise reference materialfound within the chapters of XAPP514.The reference design code, offered in bothVerilog and VHDL, is clearly documentedand illustrated, as shown in Figure 1.

We also offer a suite of validation plat-forms that can quickly and easily testyour video processing algorithms or veri-fy connectivity performance. For exam-ple, you can use our new Xilinx®

Virtex™-5 ML571 Serial Digital Video(SDV) board (www.cook-tech.com) todemonstrate or develop video connectivi-ty with Virtex-5 FPGAs. Figure 2 shows ablock diagram; Figure 3 is a photographof the ML571 board. Many of the freereference designs linked to XAPP514’schapters were verified on the ML571platform using broadcast industry-stan-dard test equipment.


Implementing Integrated Video Connectivity Solutions with Virtex-5 LXT Devices

Implementing Integrated Video Connectivity Solutions with Virtex-5 LXT DevicesXilinx Virtex-5 FPGAs provide the perfect platform for integrating broadcast video solutions inside a single chip.Xilinx Virtex-5 FPGAs provide the perfect platform for integrating broadcast video solutions inside a single chip.


“The ML571 board is yet anotherexample of how Xilinx provides cus-tomers with detailed design assistance forreal broadcast industry issues,” said AndyDeBaets, senior director, systems andapplication engineering at Xilinx. “This

Talk to your Xilinx sales channel aboutseeing the demonstrations or obtaining oneof these boards so that you can test yournew algorithms long before your propri-etary board is produced. We hope you findthis article and the audio/video connectivitybook valuable, but it represents just a smallsample of the information available aboutdesigning with Xilinx programmable logicdevices. To access the latest information on these subjects and more, visit www.xilinx.com/esp/broadcast.

Virtex-5 Features Support Broadcast DesignsThe Virtex-5 feature set supports manyaspects of broadcast solutions by providinghigh performance, flexibility, and scalabilitywith unique, cost-optimized family mem-bers built on the following features:

• High-density, high-speed, reprogram-mable ExpressFabric™ technology

• 550-MHz, 36-Kb, dual-port blockRAM/FIFO

• 550-MHz, 25 x 18 DSP48E slice

• 550-MHz clock management tile (CMT)

• SelectIO™ technology

• Reduced power consumption

• Sparse chevron package

These features are described throughoutthe articles in this issue of Xcell Journal,with detailed descriptions of the featuresand performance at www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex5/index.htm.

Overview of the Xilinx ML571The new serial digital video (SDV) boardfor demonstrating and testing high-band-width video communications channelsbased on Xilinx Virtex-5 platform FPGAsshows you how to easily implement high-speed serial interfaces to popular industrystandards like HD-SDI.

Standards and Functionality SupportedThe diffused silicon integration of high-per-formance and low-power multi-gigabit serialI/O, tri-mode Ethernet MACs, PowerPC™processor, and PCI Express Endpoint block

board demonstrates how engineers caneasily implement advanced video net-working protocols while greatly increas-ing system integration, reducing systemcosts, lowering power, and shorteningdesign schedules.”

SDIDriver

SDIBitstream

SDI VideoEncoder

SDI VideoDecoder

SDIReceiver

VideoStandardDetect & Flywheel

VideoStandardDetect & Flywheel

SDI ANC & EDHProcessor

SDI ANC & EDHProcessor

Digital VideoTest PatternGenerator

DigitalVideo

AncillaryData

Data

Data

Clock

AncillaryData

DigitalVideo

Select IO LDVS

GTP Transceiver

SD-SDI or ASI Out

HD/SD-SDI or ASI Out

HD/SD-SDI or ASI Out

Video Sync Input

Video Sync Output

Audio VCXO

Digital Audio Out

(2 Stereo Pairs - 2 BNCs)

2 RJ-45 Connectors

4 Pairs

+12V Power

SD-SDI or ASI In Multi-Rate

Equalizer

GTP Transceiver

GTP Transceiver

GTP Transceiver

Virtex-5

FPGA

GTP Transceiver

GTP Transceiver


Equalizer


Equalizer

148.35-MHz

Low Jitter XO

148.5-MHz

XO

Clock

Module

148.35- / 74.1758-MHz

VCXO

AES/EBUAudio In

64-MBDDR

125- / 200-MHz XO

5X & 10X

PLL

2 GLCKs

GLCK

GLCK

4 LDVS Pairs

32 Select IO

Interface

GLCK

GLCK

27-MHz

VCXO

135-MHz & 270-MHz VCXO

XGI

Daughtercard Connector

Sync

Separator

DAC

GLCK

27.576-MHz

VCXO

33-MHz XO

GLCK

LVDS

SystemACE

Interface

JTAG

RS-232Rx/Tx

DB9

JTAGHeader

CompactFlash

Digital Audio In

(2 Stereo Pairs - 2 BNCs)

133- / 166-MHz XO

AES/EBUAudio Out

10/100/1000Ethernet

ML410 PersonalityModule Connectors

DAC

DAC

16 Pairs

Switches

LEDs

Figure 2 – Xilinx ML571 SDV video connectivity board block diagram

Figure 1 – Example block diagram of free modular Verilog and VHDL reference designs




into Virtex-5 platforms has enabled thesupport of many more networking stan-dards than previously possible.

The ML571 board now supports:

• Virtex-5 XC5VLX50T-FF1136 FPGAs(LX110T offered in a pin-compatiblepackage)

• Two RocketIO™ GTP HD/SD-SDIreceivers and two RocketIO GTPtransmitters. The transmitters haveGennum tri-mode, 3 Gbps-capablecable drivers and the receivers haveGennum tri-mode, 3 Gbps-capablereceiver equalizers. The standards sup-ported are:

• 3 Gbps HD-SDI (SMPTE424M),2.97 Gbps

• HD-SDI dual link (SMPTE372M)1.485 Gbps, 1.4835 Gbps

• HD-SDI (SMPTE292M) 1.485 Gbps, 1.4835 Gbps

• SD-SDI (SMPTE 259M), 270 Mbps

• DVB-ASI (CENELEC EN 50083-9Annex B), 270 Mbps

• A SelectIO video input and video out-put providing differential LVDS I/O.This demonstrates the ability of theVirtex-5 SelectIO interface to transmitand receive video bitstreams supportingthe following video standards:

• SD-SDI (SMPTE 259M), 270 Mbps

• DVB-ASI, 270 Mbps

• Select IO technology, LVDS, AES3digital audio (AES3id) I/O. Two BNCinput connectors provide two stereopairs of AES3id digital audio in. TwoBNC output connectors provide twostereo pairs of AES3id digital audioout. These inputs meet the SMPTE276M 75-Ohm unbalanced AES3audio input electrical specifications.

• SDI AES digital audio, embed and de-embed (SMPTE272M-2004)

• AES digital audio, high-performance,asynchronous sample-rate conversion(ASRC)

• DVB-ASI to/from Ethernet for videoover IP

• Frame synchronization using externalDDR DRAM

• Sync separator and genlock capability.A sync separator can accept a varietyof video sync sources including bi-level and tri-level video sync (HDand SD). The separated sync signalsfrom the sync separator go to theFPGA, where they can be used tobuild genlock PLLs using any of the VCXO clock sources available to the FPGA.

• An XGI-compatible expansion connector set is provided to allowvideo I/O daughtercards

• Two 10/100/1000 Ethernet interfaces

• Debug RS-232 serial port

• Configuration six-pin JTAG header for connection to a Xilinx downloadcable

• Xilinx System ACE™ configurationcontroller with a CompactFlash Type II socket

ConclusionThe need to support new AVB designsand assist you, our customers, withimplementations in Xilinx FPGAs is atremendous challenge. However, atXilinx, we pride ourselves in striving tokeep up with the demand for excellence.With varying protocols and a variety ofdifferent system interfaces, it is easy to seehow these designs require high-perform-ance, cost-effective flexibility that ASSPsand ASICs cannot offer. These challengesopen up opportunities for Virtex-5devices, for these devices can enable youto provide solutions to the ever-evolvingAVB equipment landscape.

The ML571 board is designed and soldby Cook Technologies. The CookTechnologies part number for the ML571 isCTXIL406. There are many clock and con-nectivity option daughtercards that alsoplug into the ML571 SDV board. For moreinformation, e-mail [email protected] orvisit their website at www.cook-tech.com.

Figure 3 – Xilinx ML571 SDV video connectivity board


by Anthony CollinsStaff Product Marketing EngineerXilinx, [email protected]

The telecommunications industry demandshigh availability; when you pick up the tele-phone, you expect to hear a dial tone. Asbroadband providers start to compete forvoice and video (with the deployment of so-called “triple-play” services), customersexpect the same high availability.

High availability is only possible bybuilding redundancy into the hardwarethat makes up the system. However, toeffectively manage this redundancy, thesystem must be able to monitor its ownoperating conditions and switch to back-up hardware in the event of a failurebefore the customer notices any down-time. Close monitoring of the physicalenvironment allows for preemptive actionin the event of a failing component. Thisinvolves monitoring the physical environ-ment inside the chassis, using various sen-sors to record such variables astemperature, supply voltages, humidity,and cooling performance.

FPGAs are important building blocks inhigh-availability infrastructure. Therefore,the on-chip environment of the FPGA andits immediate surroundings within the sys-tem should be carefully monitored. TheXilinx® Virtex™-5 System Monitor facili-tates easier monitoring of the FPGA and itsexternal environment.


Enhancing System Management and Diagnostics with the Virtex-5 System Monitor

Enhancing System Management and Diagnostics with the Virtex-5 System MonitorYou can use the Virtex-5 System Monitor to greatly increase environmental monitoring coverage of your FPGA design.You can use the Virtex-5 System Monitor to greatly increase environmental monitoring coverage of your FPGA design.


Virtex-5 System Monitor The Virtex-5 System Monitor allows youto easily access information about theFPGA on-chip (die) temperature andpower supply conditions. The system mon-itor also provides access to external sensorinformation through external analog inputchannels (monitoring as many as 17 exter-nal sensors). Access to this informationinvolves little or no design effort, depend-ing on the required functionality.Common functionality like alarms, auto-matic channel sequencer, and data averag-ing are available within the SystemMonitor block, enabling you to develop asolution easily.

Figure 1 shows a block diagram of theVirtex-5 System Monitor. The system mon-itor is built around a 10-bit, 200 kilosample-per-second analog-to-digital converter(ADC). The analog input range of the ADCis 0V-1V. At a resolution of 10 bits, theADC can resolve an input voltage to anaccuracy of approximately 1 mV.

As shown in Figure 1, both the on-chipsensors and external analog input channelsare connected to the ADC input using ana-log multiplexers. Therefore, the outputvoltages from various sensors must besequentially converted to a digital word bythe ADC. These measurement results arewritten to status registers, where they areeasily read using the FPGA fabric, or exter-nally through the FPGA and PC boardJTAG infrastructure. The System Monitorcontrol registers can be written or read

Checking the CheckerUsing the Virtex-5 System Monitor to pro-vide accurate and reliable environmentalinformation requires reliability checks onthe measurement data and system monitoroperation. The System Monitor has a num-ber of features that help to confirm reliableoperation. Built-in auto-calibration of theADC and sensors correct any drift in theanalog measurement system because of theoperating environment. Self-check featuresalso allow the system host to monitor theoperation of the System Monitor.

Leveraging System Monitor JTAG AccessA novel feature of the Virtex-5 SystemMonitor is the ability to access the full func-tionality of the block using the JTAG TAP.By enabling analog testing and access toanalog information, you can obtain greatervalue and efficiencies using the existingJTAG infrastructure in the system. Thisaccess is available before configuration ofthe FPGA for use as part of a PC board testscheme in production, or during normaloperation to facilitate a debugging effort.

To facilitate off-chip measurements suchas supply voltages and currents on the PCboard, you can use special JTAG commandsto enable external analog inputs beforeFPGA configuration. Even after FPGA con-figuration, the System Monitor does notrequire an explicit instantiation in yourdesign, thereby allowing full access to its fea-tures for debugging work through the JTAGTAP, even at a late stage in the design process.To ensure the availability of the SystemMonitor, the only requirement is that thecorrect PC board support must be in place.This involves the connection of an external2.5V reference IC as described in the SystemMonitor User Guide (www.xilinx.com/bvdocs/userguides/ug192.pdf).

Figure 2 illustrates a typical diagnosticapplication where the physical operatingenvironment of the FPGA is monitoredduring normal operation. In the exampleillustrated in Figure 2, the System Monitoris used to look at the voltage (IR) drop inthe power distribution system (PDS) dur-ing a period of heavy current demand start-ing at time t0. The temperature of theFPGA is also monitored during this period

using the same interfaces. The controlregisters configure the System Monitoroperation (for example, selecting sensorchannels for measurement, programalarm limits, and sensor averaging). TheSystem Monitor is fully functional short-ly after power-up and does not requirethe FPGA to be configured for correctoperation. By default, only the on-chipsensors are monitored after power-up;however, you can also enable externalanalog inputs. Measurement informationcan only be accessed through the JTAG testaccess port (TAP) before configuration.

User AlarmsOne of the useful built-in features of theSystem Monitor is its ability to generatealarm signals for the on-chip sensors. As adesigner, you can specify the threshold lim-its for these alarm signals. The SystemMonitor can autonomously monitor thesensors and alert the system only when analarm condition is detected.

The System Monitor also contains a fac-tory-set alarm condition called over tem-perature (OT). If you enable this feature,the System Monitor can request a full chippower-down if a die temperature greaterthan 125° C is detected. Chip power-up isinitiated after the die has cooled to a levelthat you specify. The System Monitor con-tinues to operate and monitor the on-chipsensors during chip power-down.

The OT functionality is disabled bydefault and must be explicitly enabled.


External

Sensor

Inputs

On-ChipTemperature

Sensor

On-ChipSupply Monitoring

MUX10 Bits

200 kSPSADC

JTAG TAPController

FPGA Interconnect

DynamicReconfiguration

Port (DRP)

Control Logic

Register File Interface

Status Registers

00h01h02h

03h

3Eh3Fh

Control Registers

MUX

VCCINT

VCCAUX

VREFP

VREFN

oC

40h41h42h

43h

7Eh7Fh

Figure 1 – Virtex-5 System Monitor


of high activity. Potential issues with thepower supply or PC board design can bequickly identified during development.The JTAG access also provides an easy wayto confirm that adequate cooling is in placefor a particular design. The ChipScope™Pro Analyzer provides an easy way to accessthe System Monitor; however, access caneasily be incorporated into other JTAG testand programming environments.

System Integration In addition to convenient accessthrough the JTAG TAP, full access to theSystem Monitor control and status reg-isters is also provided through the FPGAfabric. These registers can be configuredand read at any time from the fabric.Dual access to the System Monitor reg-isters by the JTAG TAP controller andfabric interface is permitted, and an

arbitration scheme is available to managepossible contention.

You can also define the contents ofthese registers when the System Monitor isinstantiated in a design and initialized dur-ing FPGA configuration. Thus, the SystemMonitor can be configured to start up in auser-defined mode of operation post-con-figuration. The fabric interface is known asthe dynamic reconfiguration port (DRP).The DRP is a parallel 16-bit synchronousdata port (similar to block RAM).

For more advanced applications wheregreater control over the System Monitoris required, the DRP allows the SystemMonitor to be easily mapped into theperipheral address space of a hard or softmicroprocessor. Figure 3 illustrates a typ-ical system management applicationwhere the MicroBlaze™ processor isrunning a protocol-like intelligent plat-form management interface (IPMI) andcommunicating with the system hostover management channels like Ethernetor even a simple UART/modem.

The System Monitor also provides animportant microprocessor peripheral inthe form of a general-purpose ADC. Thisis the first time analog peripherals likethose commonly found in microcon-trollers have been integrated into anFPGA. Full control over the ADC opera-tion is supported. The ADC offers a num-ber of sampling modes and can supportunipolar, bipolar, and full-differential ana-log input schemes.

ConclusionThe Virtex-5 System Monitor delivers agreatly simplified solution for common on-chip and external environmental monitoringneeds. Minimal development and designeffort are required to access the functionali-ty. By interfacing the System Monitor to theJTAG TAP controller, JTAG functionalityhas been extended into new applicationareas, thus enabling new test capabilities.

We would like to hear your commentsand feedback regarding any topicstouched on in this short article; in partic-ular, how our development team can bet-ter support your system monitoring andtest requirements.


10 Bits

200 kSPS

ADC

On-Chip

Peripheral Bus (OPB)

EMACPHY

Modem JTAG

TAP

Analog

Input

LAN

UART

External

Sensors

Intermediate Power Bus

POL

2.5V

VCCINT

Diagnostic SW

TCLK

TMS

TDO

TDI

VCCAUX

POL

1.0V

FPGA PhysicalEnvironment Monitoredvia JTAG TAP

VCCINT

1.01V

1.00V

0.99Vto time

VCCAUX

2.55V

2.50V

2.45Vto time

Temperature60

oC

50oC

40oC

to time

Figure 3 – System Monitor (or ADC) as a microprocessor peripheral

Figure 2 – You can access System Monitor measurements through the JTAG TAP.


©2006 Mentor Graphics Corporation. All Rights Reserved. Mentor Graphics, Accelerated Technology, Nucleus is a registeredtrademarks of Mentor Graphics Corporation. All other trademarks and registered trademarks are property of their respective owners.

Accelerated Technology, A Mentor Graphics Division [email protected] • www.AcceleratedTechnology.com

With a rich debugging environment and a small, auto-

configurable RTOS, Accelerated Technology provides the

ultimate embedded solution for MicroBlaze development.

Our scalable Nucleus RTOS and the Eclipse-based EDGE

development tools are optimized for MicroBlaze and

PowerPC processors, so multi-core development is easier

and faster than ever before. Combined with world-

renowned customer support, our comprehensive product

line is exactly what you need to develop your FPGA-based

device and get it to market before the competition!

For more information on the Nucleus complete solution, go towww.acceleratedtechnology.com/xilinx

The Ultimate Embedded Solutionfor Xilinx MicroBlaze™

by Lee HansenDesign Methodologies Sr. Marketing Manager –Horizontal Platform SolutionsXilinx, [email protected]

Xilinx® Virtex™-5 devices set a new bench-mark in FPGA functionality, with as muchas 12 times the logic capacity, 112 timesmore memory, 2 times the bandwidth, and2.5 times the performance of the leadingFPGA devices of just 8 years ago. Additionaldedicated hardware functionality likeDCM-based clock management tiles,embedded hard processors, high-speedMGTs, and DSP48E slices extend platformfunctionality to a broad spectrum of endapplications. This extreme functionalityplaces a huge demand on the design cycleand in particular the verification cycle,which tends to be the most time-consumingand time-critical phase of the design flow.The Xilinx ChipScope™ Pro software andanalyzer deliver advanced real-time debug-ging functionality to complex Virtex-5-based designs, moving you through the ver-ification cycle faster than ever before.

New FunctionalityThe functionality of the ChipScope ProAnalyzer version 8.2 has been enhancedwith Virtex-5 performance in mind. AllChipScope Pro-optimized software debug-ging cores work with Virtex-5 devices when

using ChipScope Pro 8.2 Service Pack 2 orlater versions. The debugging cores delivernew enhanced performance, supportinghigher clock speeds as fast as 500 MHz. Youcan analyze signals with greater speed andagility through advanced features like widerdata capture of up to 1,024 bits, deeper datacapture of up to 128K storage samples, andhigher density slice packing of trigger matchunit and capture control logic.

The resource estimator introduced withChipScope Pro version 8.1 lets you seehow much memory and device space thedebugging cores will take up on the chip,useful for project planning.

Another breakthrough feature isremote debugging, first introduced in ver-sion 7.1 of ChipScope Pro software.Remote debugging lets you run theChipScope Pro Analyzer and capture sys-tem through a server/client Internet con-nection. Your board can be runningremotely in the lab while you debug froman office on the other side of the buildingor the other side of the world. You canshare a single board or system in the labwith other engineers on your team orallow helpdesk personnel to debug a prob-lem remotely at a customer site, helping tolower field debugging and repair costs.

Optimized Real-Time DebuggingThe ChipScope Pro system is available as aseparately purchased option to Xilinx

ISE™ logic design software and allows youto debug Virtex-5 devices and other XilinxFPGA-based projects in real time. You canquickly find and analyze design problemswhile the chip is running on the board,interacting with the rest of the system.Then, leveraging FPGA re-programmabili-ty, design changes can be quickly imple-mented and sent back to the device onboard in a matter of minutes or hoursthrough the programming cable. Suchchanges might take days or weeks usingASIC or competing FPGA offerings.

The ChipScope Pro system also linksinternal FPGA debugging to AgilentTechnologies’ bench-top logic analyzersusing the included ChipScope Pro ATC2core. This core synchronizes the ChipScopePro system to Agilent’s FPGA DynamicProbe software, an optionally purchasedplug-in to your Agilent 1680, 1690, or16900 logic analyzer.

This unique partnership between Xilinxand Agilent delivers deeper trace memory,faster clock speeds, and more triggeroptions, all using even fewer pins on theFPGA. The advanced technology containedwithin the ATC2 core and FPGA DynamicProbe is not available in other FPGA orASIC real-time verification solutions.

For more information on the ChipScopePro Analyzer, visit www.xilinx.com/chipscopepro or contact your local sales officefor ordering information.


Real-Time Debugging for Virtex-5 FPGAsReal-Time Debugging for Virtex-5 FPGAsVersion 8.2 of the ChipScope Pro Analyzer delivers verification performance for Xilinx FPGAs.Version 8.2 of the ChipScope Pro Analyzer delivers verification performance for Xilinx FPGAs.


Now you can see inside your FPGA designs in a way that

will save weeks of development time.

The FPGA dynamic probe, when combined with an Agilent

Windows®-based logic analyzer, allows you to access

different groups of signals inside your FPGA for debug—

without requiring design changes. You’ll increase visibility

into internal FPGA activity by gaining access up to 128

internal signals with each debug pin.

Our newest 16800 Series logic analyzers offer you

unprecedented price-performance in a portable family,

with up to 204 channels, 32 M memory depth and a pattern

generator available.

Agilent’s user interface makes our logic analyzers easy

to get up and running. A touch screen or mouse makes

it simple to use, with prices to fit your budget. Optional

soft touch connectorless probing solutions provide

excellent reliability, convenience and the smallest

probing footprint available. Contact Agilent Direct today

to learn more.

www.agilent.com/find/xcell©Agilent Technologies, Inc. 2006 Windows is a U.S. registered trademark of Microsoft Corporation

• Increased visibility with FPGA dynamic probe

• Portable and modular logic analyzers

• Customized protocol analysis with Agilent's exclusive packet viewer software

• Low-cost embedded PCI Express packet analysis

• Agilent 16800 portable logic analyzers, starting at $9450

Get a quick quote and/or FREE CD-ROM with video demos showing how you can reduce your development time.

X-ray vision for your designsX-ray vision for your designsAgilent Logic Analyzers with FPGA dynamic probe

Agilent 16800 Series Portable logic analyzers

by Peter AlfkeDistinguished EngineerXilinx, [email protected]

All FPGA applications use various amountsof memory for data, parameters, andinstructions. To store from a few bits tomultiple megabytes, Xilinx® Virtex™-5devices offer a hierarchy of three differentmemory implementations:

• LUT-based distributed RAM has agranularity of 64 bits

• Block RAM has a granularity of 18 Kb

• External memory can store practicallyunlimited amounts of megabytes withthe help of on-chip memory interfaces

LUT RAMSince the early days of the XC4000, Xilinxhas made look-up tables (LUTs) available asuser RAM. In Virtex-5 devices, the LUT hasgrown to 64 bits and can be used as either64 x 1 or 32 x 2 RAM. LUT RAM (seeFigure 1) offers very fast (sub-nanosecond)access time, tight integration with thelogic fabric, and ultimate design flexibility(see sidebar, “Why 6-LUTs?”).

Multiport OptionThe four LUTs in a Slice M can share acommon write address that does not inter-fere with the read addressing of the otherthree LUTs. Together, these four LUTs canthus implement quad-port memory, withone write port and three independent read


Memories are Made of This...Virtex-5 FPGAs offer a wider range of memories and memory interfaces.

Why 6-LUTs?Xilinx invented the use of four-input LUTs in FPGAs 20 years ago. Exhaustive academic and commercial studies had shown that four inputs (16 stored bits) were theoptimal size for a LUT that implements random logic.

FPGA evolution has led to ever-smaller transistors as well as ever-more routing andother dedicated structures. As a result, the highly optimized LUT became a muchsmaller part of the circuitry. For Virtex-5 devices, we re-evaluated the optimal LUT sizeand found that a four-times-larger six-input LUT (6-LUT) increases the CLB size byonly 15%. Extensive benchmarking then showed that, on average, a 6-LUT packs 40%more logic functionality compared to the traditional four-input LUT. The decision waseasy: spend 15% more area to gain 40% more logic (or, expressed differently, saveroughly 30% in logic area).

The four-times-larger memory capacity of each LUT is an extra, very welcomebonus, as is the ability to make the LUT and the RAM 2 bits wide.

Virtex-5 devices combine four LUTs in a slice. There are two different types of slices:Slice L and Slice M, roughly equal in number on any Virtex-5 device. The LUTs in aSlice L can perform logic and contain a carry chain. The LUTs in a Slice M have thesame functionality but can also be used as distributed memory or shift register logic(SRL32) functionality.

64-BitRAM

DI

ADDR

WE

CLK

QD

DO

2 x 32 BitRAM

DI

ADDR

WE

CLKQD

DO

5

2

6

06

05

or

RAM

64- or2 x 32-Bit

RAM

RAM

RAM

COMMON WRITE

ADDRESS

READADDRESS

C

READADDRESS

B

READADDRESS

A

CECLK

DOC

DOB

DOA

5 or 6

5 or 6

5 or 6

5 or 6

2 or 1

2 or 1

2 or 1

Figure 1 – LUT RAM Figure 2 – Slice as quad-port RAM

M E M O R Y I N T E R F A C E S

ports all accessing the same data. Thenewest MicroBlaze™ processor uses thisfeature to reduce its register file from 384to 44 LUTs. In this kind of application, thenew six-input LUT is six times more effi-cient than a previous-generation four-inputLUT (see Figure 2).

Shift RegisterYou can use any LUT in a Slice M as a seri-al shift register with addressable length. TheLUT is configurable as either a single-bitshift register (a maximum of 32 bits long)or as a 2-bit-wide shift register (a maximumof 16 bits long). Different from earlierSRL16 structures, the Virtex-5 shift registeruses a more traditional and scalable designwith two latches per shift register bit –hence the maximum 32 bits (not 64 bits)per LUT (Figure 3).

Block RAMFor larger RAM structures, Virtex-5 deviceshave tens or hundreds of block RAMs, eachwith a capacity of as much as 36 Kb.

You can structure each block RAMthrough configuration as:

• 72 bits wide, 512 deep

• 36 bits wide, 1K deep





• 1 bit wide, 32K deep

You can also use the two halves of the36-Kb block RAM separately as two 18-Kb

correction (ECC) using Hamming code.The controller is built into each blockRAM. It detects single and double errorsand corrects all single errors.

The ECC controller can also be used tooperate with external memory. In this case,one complete block RAM is necessary forwriting and another for reading. The built-in ECC circuit is a great simplification formemory designers who care about the ulti-mate data integrity (Figure 4).

FIFOFIFOs are usually implemented using dual-ported SRAMs, with one port used for writ-ing and the other for reading. Many Virtexfamily block RAMs are traditionally used asFIFOs. That is why Xilinx chose to equipall Virtex-5 block RAMs with a built-indedicated FIFO controller (Figure 5).

Virtex-5 devices have between 32 and288 block RAMs, and each can be config-ured as a 36- or 18-Kb FIFO.

The controller can use the whole blockRAM as FIFO with the following configu-ration options:






But the controller can also use only halfof the block RAM and leave the other halfto be used as general-purpose block RAM.The FIFO options are then:





block RAM, configured as:






• 1 bit wide, 16K deep

Each block RAM always has two inde-pendent access ports and each port can beindividually configured. This greatly sim-plifies data-width conversion.

Read During WriteEach port supports a data-in (DI) bus anda data-out (DO) bus. When writing thedata on the DI bus into the memory, theDO bus presents either the previous dataat the write address or the new data justbeing written. A third option keeps DOunchanged from its previous state. Thesethree configuration options offer a designflexibility that is often overlooked.

All block RAM operations require aclock, even for reading data. This require-ment is not always desirable, but it isabsolute. Nothing happens without anenabled clock. Whenever the clock isenabled, data and address must meet therequired setup and hold-time specification.Violating this requirement can contami-nate the data content.

ECCA 72-bit-wide block RAM can provide 64-bit-wide data with error detection and


SHIFT IN

SHIFTOUT 31CE

CLK

CLK

A

32-Bit

Shift Register

MUX5

D Q

or

36KBlock RAM

orFIFO

18KBlock RAM

18K Block RAMor FIFO

EC

C E

ncod

e

Block RAM

EC

C D

ecod

e

Block RAM

64-Bit Data

64-Bit Data

64-Bit Data

Error

8-Bit Parity

8-Bit Parity

64-Bit Data

ExternalMemory

Figure 3 – LUT as shift register

Figure 5 – Dual-ported RAM or FIFO

Figure 4 – ECC for external memory


In all cases, the FIFO write and readports have identical width. Unequal widthwould complicate the interpretation offull/empty flags and is therefore not imple-mented in Virtex-5 devices.

Soft FIFO controller cores have beenavailable for many years, but the dedicatedFIFO controller offers three advantages:

• Higher performance, since dedicatedlogic is naturally faster than program-mable logic

• Smaller size and lower power consump-tion, as it uses no fabric resources,CLBs, nor additional interconnects

• Guaranteed functionality and perform-ance without any design effort

Write and read clocks can have arbitraryor undefined phase and frequency relation-ships. But for proper flag operation, bothclocks should be free-running.

The challenging aspect of FIFO designis the reliable generation of status flags(FULL, EMPTY, and ALMOST_FULLor ALMOST_EMPTY) when write andread clock frequencies are unrelated. Thetrailing edges of these flags are inevitablygenerated by the “wrong” clock domainand must be re-synchronized to the prop-er clock domain. For a detailed explana-tion, visit www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIFO2.pdf.(See sidebar, “Verifying the EMPTY FlagSynchronization.”)

The FIFO controller offers two newoptions: first-word fall through (FWFT)and synchronous operations.

After a first entry has been written intoan empty FIFO, the EMPTY output goeslow (inactive), indicating that the read portis allowed to enable its read clock and thuscause the data word to appear at the out-put. This might be described as a “pull”operation. Data appears at the output afterthe next enabled read clock.

In FWFT, the newly written data wordappears automatically at the output simul-taneously with EMPTY going inactive.This might be called a “push” operation.

You can configure the FIFO for eitherof these modes. The difference is visibleonly at the read data output of the very first

word after the FIFO has been empty. Insubsequent operations, no behavioral dif-ference exists between the two modes.

Asynchronous vs. Synchronous OperationThe main purpose of most FIFOs is tobridge between independent clockdomains; most FIFO applications thususe separate and uncorrelated clocks forwriting and reading. Because the trailingedge of each flag must be re-synchronizedto the opposite clock, there is an unavoid-able one-clock-period ambiguity aboutthe delay of the rising flag edge. This canincrease the delay for flags going inactiveand thus can cause a very small perform-ance loss. The operation is less pre-dictable, but only while the FIFO recoversfrom the empty or full condition.

In certain applications, there is onlyone clock domain, and write and readclocks are therefore identical. In that case,you can (optionally) set the mode to “syn-chronous.” This eliminates re-synchro-

nization circuitry, reduces the trailing flagdelay, and completely avoids the delayuncertainty. The performance improve-ment is very small.

External MemoryWhen a design needs multiple megabytesof memory, it is best implemented in exter-nal DRAM devices. High-performanceSDRAM controllers can pose a design chal-lenge, but Xilinx offers several applicationnotes and well-documented cores and eval-uation boards that implement several dif-ferent memory controller designs.

ConclusionVirtex-5 memories and memory interfaceshave come a long way since the first LUTRAM appeared in the XC4000 family 16years ago. Versatile dual-ported blockRAMs with FIFO and ECC options sim-plify system design, while well-documentedinterface designs allow for unlimited expan-sion with external DRAMs.


Verifying the EMPTY Flag Synchronization

We tested the EMPTY synchronization logic exhaustively by writing data into

the FIFO at 200 MHz and reading it out at 500 MHz, which makes it go

EMPTY soon after each write cycle. This exercised the detection logic and re-

synchronized the trailing edge of EMPTY 200 million times a second.

More specifically, we wrote an ascending data sequence at 200 MHz and read

it out at 500 MHz. We wrote the output data directly into a second FIFO at the

same 500 MHz. We then read the second FIFO out at the original 200-MHz rate.

The combined dual FIFO forms a synchronous system but with asynchro-

nous data transfer between the two halves. When we synchronously subtracted

the input data from the output data, the difference was constant, indicating

flawless transfer at the 500-MHz read/write rate and no flag synchronization

problem – even at this high rate.

When two clock frequencies are uncorrelated, each read clock cycle has a dif-

ferent phase relationship with respect to the write clock. During any second, the

active read clock edge steps across the ~5 ns write clock period in ~200 million

different phase orientations, thus creating a timing granularity of 0.025 fem-

toseconds. This resolution is millions of times better than any conventional

deterministic test methodology can possibly achieve.

We ran this test for several weeks, with more than 1014 operations, without

any errors.


by Richard ChiuStaff Applications EngineerXilinx, Inc. [email protected]

When not supporting new interface proto-cols, memory interface designers are con-stantly supporting faster and faster busspeeds for existing interfaces. Today’s source-synchronous double-data-rate (DDR)memory devices, such as DDR2 SDRAM,QDR II SRAM, and RLDRAM II, presentdesigners with challenges at chip and PCBlevels. Higher clock frequencies result in arapidly shrinking data valid window.Signal integrity issues, clock jitter, memo-ry uncertainties, varying silicon delays,PCB trace skew mismatch, and other fac-tors now have a proportionally largerimpact on meeting timing with a smallerdata valid window.

Virtex-5 FPGAs Enhance Memory Interface DesignThe Xilinx® Virtex™-4 FPGA familyintroduced a number of on-chip resources,in particular ChipSync™ technology,

which adds to each I/O block anadjustable delay element (IDELAY) com-pensated over process, voltage, and tem-perature changes as well as enhanced DDRcapture support. These features help meetthe challenges of designing with source-synchronous memory interfaces. WithVirtex-4 memory interface designs, youcan employ calibration algorithms to fac-tor out many of the skews and delays inthe timing path and operate your design athigher frequencies.

The Virtex-5 architecture adds addition-al features that allow you to push the limitsof operating frequency. Enhancements tothe Virtex-5 device integral to memoryinterface design include:

• The addition of ExpressFabric™technology. This architecturalenhancement enables internal logic torun at higher clock frequencies. Thebasic slice look-up table (LUT) hasincreased from a four- to a six-inputLUT (6-LUT), reducing the numberof required logic levels. The technologyalso offers additional routing

resources to provide more directroutes within a slice and between configurable logic blocks (CLBs).

• Reduction of the maximum bank sizefrom 64 I/O (or 80 I/O in selectVirtex-4 part/package combinations) to40 I/O, and an increase in the numberof banks. This leads to a more efficientimplementation of the usual myriad ofI/O voltage levels on the same FPGA.More I/O clocking resources have alsobeen added to each bank.

• The availability of phase-locked loop(PLL) blocks as clocking resources inaddition to digital clock manager(DCM) blocks. PLLs are useful forlow-jitter clock generation and inputclock jitter filtering.

• Enhanced block RAM/FIFOs thathave doubled in size to 36 Kb and support a maximum width of 72 bits.Applications requiring error-correctingcode (ECC) detection and correctioncan now take advantage of ECCencode/decode logic built into each


Meeting Memory Interface DesignChallenges with Virtex-5 FPGAsMeeting Memory Interface DesignChallenges with Virtex-5 FPGAsVirtex-5 devices support the latest generation of high-speed memory interfaces.Virtex-5 devices support the latest generation of high-speed memory interfaces.


block RAM, reducing logic usage andallowing much higher performanceover implementing the same function-ality in general logic.

• Support for digitally controlled imped-ance (DCI) on-chip split-Thevenin ter-mination for bidirectional I/O onlywhen the driver is 3-stated. Similar tothe on-die termination (ODT) featureimplemented in many memory devicefamilies, this support is provided forcertain HSTL and SSTL I/O standardsand can be used to save power whenthe FPGA is writing to memory.

• The incorporation of low-inductancebypass capacitors directly on the pack-age substrate, simplifying PCB layoutby reducing the amount of externalbypassing required.

Virtex-5 Data Interface TechniquesMeeting read and write timing for a high-speed source-synchronous bus demandsthat you keep uncertainties to a minimum.Typically, the capture of read data is themost challenging part of the design.

Write timing for Virtex-5 FPGAs is sup-ported in the same way as in the Virtex-4device. The DCM (or PLL) generates quad-rature phase outputs of the base (“system”)clock. The memory strobe is forwardedusing an output DDR register clocked byan in-phase copy (CLK0) of the systemclock. The write data is clocked by a DCMclock output that is 90 degrees ahead(CLK270) of the system clock. This ensuresthat the strobe is center-aligned to the dataon a write at the outputs of the FPGA.

Both Virtex-4 and Virtex-5 memoryinterface designs support two kinds of readcapture techniques:

• The “direct-clocking” technique delaysthe read data so that it can be directlyregistered using the system clock in theinput DDR flop of an I/O block. Thememory strobe is only used during cali-bration to determine the optimal time todelay the associated data. Figure 1 showsthe direct-clocking read capture path.

• The “strobe-based” technique uses thememory strobe to capture correspon-

Most Virtex-4 designs use the direct-clocking method for read data capture.Beginning with the Virtex-4 SERDESDDR2 design and continuing with thenew generation of Virtex-5 memory inter-face designs, the strobe-based method isbest to meet the tighter timing require-ments at higher clock speeds.

Both techniques involve the use of IDE-LAY elements that are varied during a cali-bration routine. This routine is performedduring system initialization, delaying boththe strobe and data to determine and setthe optimal phase between strobe/data and

ding read data and register it with adelayed version of the strobe distrib-uted through a localized I/O clockbuffer (BUFIO). This data is thensynchronized to the system clockdomain in a second stage of flops.The input serializer/deserializer(ISERDES) feature in the I/O block isused for read capture – the first twolevels of flops in the ISERDES trans-fer the data from the delayed strobe tothe system clock domain. Figure 2shows the read capture path for aVirtex-5 memory interface design.

IDELAYData

FPGA Clock

Read DataRising

Read DataFalling

User Interface FIFOs

Delayed Strobe

BUFIO

IDELAYStrobe

IOB Fabric

ISERDES

CLK OCLK CLKDIV

Q2

Q1

IDELAYData

FPGA Clock

Read DataRising

Read DataFalling

User Interface FIFOs

IDELAYStrobe

IOB Fabric

IDDR

CLK

Q2

Q1

Used forCalibration Only

Figure 1 – Virtex-4 direct-clocking read data capture path

Figure 2 – Virtex-5 strobe-based read data capture path




the system clock to maximize timing mar-gins. Calibration removes any uncertaintycaused by process-related delays, compen-sating for components of the path delaythat are static to any one board. Thesecomponents include PCB trace delays,package delays, and process-related com-ponents of propagation delays (both inthe memory and FPGA), as well assetup/hold times of capture flops in theFPGA I/O blocks. Calibration accountsfor variation in delays that are process-,voltage-, and temperature-dependent atthe system initialization stage – youshould also factor additional operatingtemperature and voltage variations sepa-rately into your interface timing budget.

During calibration, IDELAY for strobeand data are incremented to perform edgedetection by continuously reading backfrom memory and by sampling either aprewritten training pattern or the memo-ry strobe itself until either the leadingedge or both edges of the data valid win-dow are determined. The IDELAY fordata or strobe is then set to provide themaximum timing margin. In the case ofdirect clocking, the optimal delay for thestrobe is used to delay the associated data.

For strobe-based capture, the strobe anddata can have different delay values becausethere are essentially two stages of synchro-nization: one to first capture the data in thestrobe domain and another to transfer thisdata to the system clock domain.

The direct-clocking capture method issimpler in design complexity, and com-pared to the strobe-based capture method,it has fewer pin-out restrictions. However,the strobe-based capture method becomesnecessary at higher clock frequencies. Itstwo-stage approach offers better capturetiming margins for two reasons:

• The DDR portion of the timing isrestricted to the first rank of flops inthe ISERDES. Because the strobe isused to register the data, timing islimited largely by the strobe-to-datavariation; for example, in the case ofDDR2, these are given by thetDQSQ and tQHS parameters of thepart. For direct clocking, you must

consider the data-to-clock variation(for DDR2, this is tAC) because thesystem clock is used to both drive thememory clock and capture read data.This is a larger uncertainty than thestrobe-to-data variation.

• The strobe-to-clock variation isimportant for the second stage ofcapture, when the data is transferredfrom the delayed strobe to the systemclock domain. However, by this timethe data is split into two separate sin-gle-data-rate paths; therefore, aligningthe delayed strobe to the system clockcan take place over a much largertiming window.

The strobe-based capture method ismore pinout-restrictive, as it requires thememory strobes to be placed on clock-capable I/O pins. This can limit the I/Outilization over a given bank. Virtex-5devices have smaller banks and more I/Oclocking resources per bank (for example,the number of BUFIO local clock buffersper bank has increased from two to four),easing this restriction and allowing morestrobes and their accompanying I/O (data,mask) to be placed in each bank.

Other significant differences in Virtex-5memory controllers include:

• Full-speed operation. Both the Virtex-4SERDES design and Virtex-5 designsuse the ISERDES for memory capture.However, Virtex-5 designs do not usethe width expansion feature of theISERDES, and the controller runs atthe same speed as the memory clock.The Virtex-4 ISERDES design runs athalf the memory clock speed but twicethe bus width. Running at the sameclock speed as the memory is made pos-sible by the higher performance of theVirtex-5 fabric. This minimizes read-data latency through the ISERDES – aswell as controller latency – and simpli-fies bank-management logic.

• Bank management. The Virtex-5DDR2 controller employs a least-recently-used (LRU) bank-manage-ment algorithm that keeps banksopen to reduce the overhead associat-

ed with opening and closing banks.In an LRU algorithm, banks are leftopen at the end of accesses. If a newbank needs to open, the controllercloses the bank least recently used. Atany time, as many as four banks canbe left open.

Generating Virtex-5 Memory DesignsYou can generate a custom memory con-troller by using the Memory InterfaceGenerator (MIG) tool. The MIG tool isaccessed through CORE Generator™software and outputs HDL source (Verilogor VHDL) design files, along with accom-panying constraint and build scripts.

The latest version of the MIG tool(1.6) supports DDR2 SDRAM-registeredDIMM and QDR II SRAM componentinterfaces for Virtex-5 devices. The DDR2controller supports operation of bus clockspeeds as fast as 333 MHz (667 Mbps). TheQDR II supports operation of bus clockspeeds as fast as 300 MHz (600 Mbps).

Virtex-5 designs generated by the MIGtool also allow the physical layer interfaceportion of the design to be easily separatedfrom the controller portion. You can thenincorporate your own specific controllerbut retain the memory initialization and high-performance source-synchronouscalibration logic.

ConclusionThe Virtex-5 device family builds on theVirtex-4 FPGA, with additional featuresto ease memory interface design and meetthe challenges of supporting ever-increas-ing bus speeds.

To download the MIG tool and for moreinformation about the implementation anddesign details of Virtex-5 memory controllerreference designs, visit the Xilinx MemoryCorner at www.xilinx.com/memory/.

Virtex-5 memory controllers are alsoavailable as reference designs for down-loading from the Memory Corner:

• XAPP858 (DDR2 SDRAM)

• XAPP853 (QDR II SRAM)

• XAPP852 (RLDRAM II)

• XAPP851 (DDR SDRAM)


by Nagesh GuptaFounder & CEOTaray [email protected]

The Memory Interface Generator (MIG)tool is a comprehensive tool used to sim-plify the design of memory controllers forXilinx® FPGAs. Memories are part of amajority of Xilinx applications. The goal ofthe MIG tool is to simplify memory inter-faces, thus enabling FPGA users to focuson the rest of the system design.

The MIG tool was first introduced in2002 as a memory controller pin selectionutility for Virtex™-II and Virtex-II ProFPGAs. Since then, the MIG tool hasprogressed significantly; it now supportsall Xilinx FPGA devices, includingVirtex-4, Virtex-5, Spartan™-3, andSpartan-3E FPGAs.

The MIG tool dynamically generatesHDL in Verilog or VHDL formats basedon user inputs. Additionally, the MIG toolgenerates .ucf pin constraints, any slice andlogic placement constraints, and any otherconstraints required to create high-perform-ance designs with minimal user changes.

MIG outputs are fully available in non-encrypted formats. This enables you tomodify the designs.


Implementing Memory Controllers Usingthe Memory Interface Generator ToolThe Memory Interface Generator tool simplifies designing memory controllers for Xilinx FPGAs.


The Time-to-Market AdvantageHigh speed memories are complex todesign. Conservatively, you can save morethan six months by using the targeted ref-erence designs provided by the MIG tool.

Fully verified MIG reference designsenable you to focus on other design activ-ities, thus reducing overall time to market.

MIG Controller ArchitectureThe MIG tool produces everythingrequired to fully implement a memorycontroller. MIG controllers are imple-mented in logical layers comprising:

1. The physical layer, or PHY, whichcaptures the read data, transfers it to a convenient clock domain, andstores it. The PHY also transmits the write data and command/controlsignals.

2. The controller generates therequired commands based on userrequests. The controller also imple-ments the state machine for reading,writing, and refreshing the memory.

3. The user interface enables exchangeof data and commands to and fromyour application.

This layered approach allows you tomodify the required portions of thedesign. In Virtex-5 devices, Xilinx hasfurther simplified the layering com-pared to previous designs. For example,some designers want to use their owncontrollers, which is possible by replac-ing the controller that the MIG toolgenerates. This is easily achieved inVirtex-5 designs.

Hardware Verification The designs generated by the MIG toolare thoroughly verified to ensure highquality. These quality checks haveincreased significantly over time as we atTaray learn more from the field.

For a given FPGA family, we verify at least one set of designs in hardware.Hardware verification is usually performed on a Xilinx memory refer-ence board, such as the ML461 orML561 boards.

strobe (CAS) latencies, burst lengths,and data widths, as well as all supportedsynthesis tools.

SimulationsTaray simulates MIG designs usingModelSim from Mentor Graphics. Wesimulate a large number of combina-

Hardware verification starts with apoint test, such as a read/write datamatch, at a particular frequency for agiven memory part. We then perform fre-quency sweeps and ensure that thedesigns work ±10% in the required fre-quency range. We also verify all the possi-ble parameters such as column address

Hardware-Tested Configurations

HDL Verilog and VHDLSynthesis Tools XST and SynplicityBoard and FPGAs ML 461 —> XC4VLX25-FF668-10 and ML 462 —> XC4VLX25-FF668-11Burst Lengths 4 and 8CAS Latencies 3 and 4 Additive Latencies 0, 1, and 2ODT (in Ohms) Verified 0, 75, and 150Depth Verified for Components 1Depth Verified for DIMMs 1, 2, 3, and 4Component Verified MT47H32M16BT-37EDIMM Verified MT18HTF6472G-53E (Registered DIMM)Component Data Width Verified 16DIMM Data Width Verified 72 and 144ECC Verified 72 and 144Frequency Range 100 MHz to 280 MHz for 16-bit component

100 MHz to 250 MHz for 72-bit DIMM (with and without ECC)100 MHz to 250 MHz for 144-bit DIMM (with and without ECC)

Simulation-Tested Configurations

HDL Verilog and VHDLBurst Lengths 4 and 8CAS Latencies 3 and 4 Additive Latencies 0, 1, and 2ODT (in Ohms) Verified 0, 75, and 150Depth Verified 1, 2, 3, and 4 (for both components and DIMMs)Components Verified All supported by the MIG tool (X4, X8, and X16)DIMMs Verified All supported by the MIG tool (registered, unbuffered, and SODIMMs)Component Data Width Verified 8, 16, 24, 32, 40, 48, 56, 64, 72, 128, and 144DIMM Data Width Verified 64, 72, 128, and 144ECC Verified 40, 72, and 144Frequencies Verified 200 MHz and 267 MHz (for both components and DIMMs)Initialization As per both Micron and JEDEC specificationsMulticontroller 1 to 8


Table 1 – Hardware and simulation test summary for Virtex-4 DDR2 SDRAM designs


tions and ensure that every memory listedin the MIG tool is verified with at leastone of the test cases. Table 1 is a summa-ry of the different simulation test casesfor Virtex-4 DDR2 SDRAM designs.Below are some parameters to generatethe test cases:

• All possible data widths

• All of the supported memory compo-nents/DIMMs

• Different values for CAS latencies,burst lengths, and additive latencies,depending on the memory type

• Simulated Verilog and VHDL RTL files

• RTL with and without testbench

• RTL with and without DCM

• Use memory models with different frequencies

Key Features The MIG tool is part of Xilinx ISE™software and is invoked through theCORE Generator™ tool. Figure 1 is a

screen shot of the MIG GUI. The key fea-tures of the MIG tool v1.6 are:

• Virtex-5 FPGAs:

• DDR2 SDRAM, Verilog

• QDR II SRAM, Verilog

• Support for Virtex-4 FPGAs (and thefollowing designs):

• DDR2 SDRAM, Verilog andVHDL, direct clocking

• DDR SDRAM, Verilog andVHDL, direct clocking

• QDR II SRAM, DDR II SRAM,Verilog and VHDL, direct clocking

• RLDRAM II, Verilog and VHDL,direct clocking

• DDR2 SDRAM, Verilog andVHDL, SERDES clocking

• All Virtex-4 designs support bothXST and Synplicity

• Spartan-3 FPGAs:

• DDR SDRAM, Verilog and VHDL

• DDR2 SDRAM, Verilog and VHDL

• Spartan-3E FPGAs:

• DDR SDRAM, Verilog and VHDL

• All Spartan-3 and Spartan-3Edesigns support XST, Synplicity,and Precision Synthesis

• Support for many different memorycomponents and DIMMs

• Pins picked are based on the selectedmemory part and user inputs

• Generates RTL and bit files for Xilinxreference boards containing memories

• Basic I/O design rule check (DRC)engine ensures that signals are allocated correctly

• Verifying a modified MIG .ucf fileensures that MIG pin-out rules are valid

Using the Outputs of the MIG tool The MIG tool generates everythingrequired to create a memory interface:

• The RTL (Verilog or VHDL) design files

• Synthesis scripts

• ISE scripts for build, map, and placeand route

• A .ucf file for pin locations, RLOCs,and any other constraints

After generating the design RTL, you canexecute a batch file to synthesize, map, andplace the design. The MIG tool generatestwo designs – one with a testbench andanother without. The MIG scripts work onthe version with the synthesizable testbench.However, you can integrate your applica-tions to the version without the testbench.

ConclusionThe MIG tool significantly reduces designburden and improves time to market. It hasbeen used successfully by many customers.

For a copy of the Memory InterfaceGenerator or for additional information,visit www.xilinx.com/memory.


Figure 1 – The MIG tool 1.6 GUI


by Chris JohnsonNetworking and Communications Strategic Applications [email protected]

The increased bandwidth requirements ofhigh-speed networking systems are pushingDRAM to perform at SRAM speeds andlatencies. Micron’s reduced-latency DRAM(RLDRAM) II memory addresses this needwith additional densities not available inSRAM memories. Combining this memo-ry technology with the Xilinx® Virtex™-5device provides an excellent high-density,high-speed, low-latency solution for thenetworking and packet buffer applicationsof current and future platforms.

RLDRAM II Memory FeaturesRLDRAM II memory uses an eight-bankarchitecture optimized for high-speed oper-ation and a double-data-rate (DDR) I/Ofor increased bandwidth. The eight-bankarchitecture enables RLDRAM II memorydevices to achieve peak bandwidth bydecreasing the probability of random accessconflicts. Although bank managementremains important with RLDRAM IImemory architectures, one bank is alwaysavailable for use even in the worst case(burst of two at 533-MHz operation).


Micron Memory InterfaceMicron Memory InterfaceRLDRAM offers a complete low-latency memory interface for networking and communication solutions.RLDRAM offers a complete low-latency memory interface for networking and communication solutions.


One of the key features added in theRLDRAM II memory architecture isreduced row cycle latency time. Rowcycle latency (tRC) is the amount of timethat must elapse before a recentlyaccessed bank can be accessed again.Table 1 shows a direct comparisonbetween RLDRAM II memory, DDR2,and DDR at device densities of 576 Mb,512 Mb, and 512 Mb, respectively.

I/O OptionsRLDRAM II memory offers separate I/O(SIO) and common I/O (CIO) options.The SIO devices have separate read andwrite ports to eliminate bus turnaroundcycles and contention. CIO devices havea shared read/write port that requires oneadditional cycle to turn the bus around.RLDRAM II memory CIO architectureis optimized for data streaming, wherethe near-term bus operation is either100% read or 100% write, independentof the long-term balance.

Figure 1 illustrates the performancevariations between the versions at differ-ent read-to-write ratios. The reducedlatency and eight-bank architectureachievable with RLDRAM II memoryallow faster random access from thememory array, increasing sustainable busutilization.

SRAM-Style InterfaceThe command bus protocol used inRLDRAM memory is a bit simpler thanother DRAM devices. RLDRAM IImemory incorporates an SRAM-styleinterface with read and write commands,replacing the activate and precharge com-mands used in DDR2 and other similarcomputer-based memory technologies.This reduction of commands frees up thecommand bus, helping reduce deadcycles during short burst lengths.

error-correcting schemes used to eliminatesoft errors in the memory channel.RLDRAM II memory is the first DRAM-based technology to add the ECC DQ pinsto the devices. RLDRAM II memory isoffered in x9, x18, and x36 configurationsto provide a single-chip ECC solutionwithout adding unwanted components,reducing board layout space.

Manufacturability in large-component-count systems is a major problem, but ithas been ignored in the DRAM industry,largely because the applications that usecommodity DRAM devices do not needcontinuity testing. The module-based busi-ness uses so few components that the extra

pins are not desirable for JTAG support.Also, discarding manufacturing errors isless expensive than repairing them in smallcomponent-based systems.

The target market for RLDRAM mem-ory is an entirely different scenario.Systems based on RLDRAM memory aretypically point-to-point applications,where the DRAM is soldered directly tothe main board alongside countless othercomponents. Placing all of these compo-nents can be a manufacturing challenge. Toaddress this issue, RLDRAM II memory

Additional Features for RLDRAM II MemoryRLDRAM II memory also deviates fromthe refresh requirements of current DRAMtechnologies. Because ordinary DRAMdevices refresh a row in all banks, theyrequire dead clock cycles on the bus after arefresh command. This requires a period ofinactivity on the DQ bus, typically 66 ns.

RLDRAM II memory devices haveincorporated a bank-based refresh schemeto hide the refresh recovery periods requiredby other DRAM technologies. The refreshprocess for RLDRAM memory requires thebank address of the bank that needs to berefreshed, still allowing bus activity during

the refresh process and eliminating the deadtime seen in other DRAM technologies.Bank-dependent refresh helps increase thepercentage of bus utilization, increasing theoverall bandwidth of the system using theVirtex-5 device.

Computer-based memory technologiessuch as DDR2 have relied on modules tosupport error-correcting code (ECC) tech-nologies to remedy soft errors in the transi-tion of data from the processor to theDRAM. Multiple parts are placed on theDIMM, adding extra data lines for the

Latency RLDRAM II DDR2 DDR1 UnitsMemory

tRC 15 55 55 ns

100%

90%

80%

70%

60%

50%

40%

30%

20%

0.01 0.10 1.00 10.00 100.00

10%

0%

Read/Write Ratio

Bus

Util

izat

ion

RLD SIO BL2DDR2 CIO BL4RLD SIO BL4, BL8RLD CIO BL2RLD CIO BL4RLD CIO BL8RLD CIO 16 BurstRLD CIO 32 Burst

Table 1 – Row cycle time DRAM comparison

Figure 1 – Peak bandwidth comparison with different DRAM technologies




incorporates JTAG continuity technology,which helps with manufacturing issues inthe large system-based boards used in thenetworking and communication industry.

The density requirements of today’sapplications have presented a challenge forSRAM systems. The memory cell forSRAM memory devices is approximatelyfive times larger than a DRAM memorycell. Figure 2 shows a comparison of thetwo memory technologies. The DRAMmemory cell is substantially smaller,allowing for significantly higher densitymemory subsystems that still approachSRAM latency speeds.

In the first quarter of 2007, Micron isintroducing a 576-Mb RLDRAM IImemory device compatible with the high-volume 288-Mb RLDRAM memorydevice. The 576-Mb device will beoffered in multiple configurations thatare pin-for-pin compatible, but will haveadditional address pins to accommodatethe higher density. RLDRAM II memoryoffers one of the highest density, lowestlatency DRAM-based solutions availableon the market today.

Additional I/O Interface OptionsThe RLDRAM II memory I/O interfaceprovides other features and options, includ-ing support for both 1.5V and 1.8V I/Olevels and a programmable output imped-ance driver that enables compatibility withboth HSTL and SSTL I/O schemes.

RLDRAM II memory requires an exter-nal one-percent precision resistor (RQ) tiedto VSS in order to calibrate the driver to aknown value and eliminate the processvariation that can be introduced duringmanufacturing. The calibration processrequires the external resistor to operate atfive times the desired driver impedance.The programmable impedance control(PIC) circuit calibrates the output imped-ance to the desired value, eliminating vari-

ation introduced during the manufacturingprocess. Updates are made continuouslyduring device operation without interrupt-ing data transfer, ensuring consistent oper-ation of the RLDRAM memory systemindependent of temperature and voltage.

Micron’s RLDRAM II memory is alsoequipped with on-die termination (ODT)to enable more stable operation at highspeeds without the use of an external ter-

mination resistor. ODT provides simplici-ty and flexibility for high-speed designs bybringing termination resistors on-die, elim-inating some of the on-board termination.

At high-frequency operation, however,it is important you analyze the signal driv-er, receiver, printed circuit board network,and terminations to obtain good signalintegrity and the best possible voltage andtiming margins. Without proper termina-tions, the system can suffer from excessivesignal attenuation, leading to reducedvoltage and timing margins. This, in turn,can lead to marginal designs and causerandom soft errors that are difficult todebug. Micron’s RLDRAM II memory

provides you with simple, effective, andflexible termination options for high-speed memory designs.

ConclusionRLDRAM II memory combines severalperformance-critical features to provideflexibility and simplicity for a wide range ofhigh-speed applications. The speed andlatency requirements for high-speed appli-

cations continue to grow, demanding newand more innovative solutions to meetmarket requirements. As RLDRAM IImemory helps address current marketrequirements, the demand continues forincreased performance. Micron will addressthis demand with future low-latencydevices such as RLDRAM III memory.

Micron continues to innovate and deliv-er solutions to meet the needs of today’sand tomorrow’s markets. The joint effortsof Micron and Xilinx help enable our cus-tomers to quickly deliver next-generationnetworking, video, and imaging systems.

For more information, please contactRay Fontayne at [email protected].

VCC

VCC/2

Wordline Wordline

Digitline Digitline

Digitline

SRAM CELL DRAM CELL

Figure 2 – SRAM cell compared to a DRAM cell

The density requirements of today’s applications have presented a challenge for SRAM systems. The memory cell for SRAM memory

devices is approximately five times larger than a DRAM memory cell.


FREE on-line tutorialswith Demos On Demand

A series of compelling, highly technical product

demonstrations, presented by Xilinx experts, is now

available on-line. These comprehensive videos provide

excellent, step-by-step tutorials and quick refreshers

on a wide array of key topics. The videos are segmented

into short chapters to respect your time and make for

easy viewing.

Ready for viewing, anytime you are

Offering live demonstrations of powerful tools, the

videos enable you to achieve complex design require-

ments and save time. A complete on-line archive is

easily accessible at your fingertips. Also, a free DVD

containing all the video demos is available at

www.xilinx.com/dod. Order yours today!



by David BanasSr. Staff Applications EngineerXilinx, [email protected]

Let’s say that you are about to design yourfirst Xilinx® Virtex™-5 DDR2 memoryinterface. You need quick guidelines forpreferred circuit topologies and a quicksummary of the trade-offs involved whenusing digitally controlled impedance (DCI)on-die termination instead of external ter-mination resistors. In this article, I’ll pro-vide practical design guidelines taken fromprevious real-world design experience, aswell as IBIS simulation results.

Circuit Topologies for Memory InterfacesFigures 1 and 2 show several possibletopologies for DDR2 address/control anddata lines, respectively. On the bidirectionaldata lines, I made the memory chip thedriver and the Virtex-5 device the receiverto make use of the FPGA’s DCI. The topschematic diagram in Figure 1 shows thepreferred and recommended use model,while the other diagrams show variationsoften tried in regular design practice.


Designing Virtex-5 DDR2 MemoryInterfaces for Signal IntegrityDesigning Virtex-5 DDR2 MemoryInterfaces for Signal IntegrityFollow these guidelines to make your next Virtex-5 DDR2 design experience a success.Follow these guidelines to make your next Virtex-5 DDR2 design experience a success.


Figures 3 and 4 show typical receivereye diagrams corresponding to the topolo-gies shown in Figures 1 and 2, respectively.The input switching thresholds of thereceiver are shown as horizontal dashedblue lines for reference. The color of the“probe” arrows in Figures 1 and 2 corre-spond to the colors of the associated tracesin Figures 3 and 4, respectively. I usedMentor Graphics’s HyperLynx software togenerate these eye diagrams with the fol-lowing parameter settings:

• Pseudo-random binary sequence(PRBS) with bit order 7 (a sequencelength of 127)

assume that the “_DCI” versions of theSSTL (stub series terminated logic) driverfamily adjust their output impedance tomatch the DCI calibration resistors andcan therefore be used as matched imped-ance drivers of the transmission line.

But this is not true. The SSTL18_I_DCIoutput driver, for instance, has a fixed out-put impedance of approximately 20Ω, asper the SSTL18 specification. The disas-trous results of this erroneous assumptionare clearly visible in the yellow trace shownin Figure 3. Not only has the eye been dras-tically narrowed, but problematic over-shoot/undershoot has also been introducedat the receiver input.

• Bit interval = 1.5 ns (667 Mbps)

• One sequence repetition

• First 50 bits skipped

• Zero added jitter

When looking at the traces in Figure 3,it should be obvious that of the threetopologies shown, the recommended usemodel gives by far the cleanest eye.

The middle schematic in Figure 1 showsa typical mistake made by novice DCIusers, which is to assume that usingSSTL18_I_DCI drivers eliminates theneed for any external termination compo-nents. Some DCI users often incorrectly

Virtex-5 FPGA

SSTL18_I

Can be either SSTL18_I +External Resistor, orSSTL18_I_DCI.

Must be SSTL18_I +External Resistor

PCB Trace

20.0 Ohms 50.0 Ohms1.000 nsSimple

50.0 Ohms

VpullUp0.9V

MT47H128M4CB_...A0

U(A0) U(B0)

RP(B0)

RS(A0) TL1

Virtex-5 FPGA

SSTL18_I

20.0 Ohms 50.0 Ohms1.000 nsSimple MT47H128M4CB_...

A0

U(A1) U(B1)RS(A1) TL2

Virtex-5 FPGA

SSTL18_I

50.0 Ohms 50.0 Ohms1.000 nsSimple MT47H128M4CB_...

A0

U(A2) U(B2)RS(A2) TL3

50.0 Ohms

VpullUp0.9V

50.0 Ohms

VpullUp0.9V

U(D0)

50.0 Ohms1.000 nsSimple

U(A1) U(D1)

RP(B1) RP(C1)

TL2

Virtex-5 FPGA

SSTL18_II

Virtex-5 FPGA

SSTL18_II

Virtex-5 FPGA

SSTL18_II

20.0 Ohms 50.0 Ohms1.000 nsSimple

MT47H128M4CB_...DQ0

MT47H128M4CB_...DQ0

50.0 Ohms 50.0 Ohms

VpullUp0.9V

50.0 Ohms1.000 nsSimple

U(A0)

VpullUp0.9V

RP(B0) RP(C0)

TL1

MT47H128M4CB_...DQ0

U(A2) U(D2)RS(A2)

20.0 Ohms

RS(C2)

20.0 Ohms

RS(A0)

20.0 Ohms

RS(C0)

TL3

Figure 1 – Typical address/control circuit topology

Figure 2 – Typical data circuit topology

Figure 3 – Typical eye patterns for address/control

Figure 4 – Typical eye patterns for data




Increasing the series termination to 50Ω,as shown in the bottom schematic in Figure1, successfully eliminates overshoot/under-shoot but does nothing to restore the eye toits original width. Therefore, you shouldalways use parallel termination at the end ofall address/control lines.

If an appropriate termination voltagesource is not available, you can form aThevenin-equivalent termination usingtwo resistors connected in series betweenthe VCCO supply and ground, where eachresistor has a value twice that of the desiredimpedance. In this case, simply terminatethe line by connecting its end point to thenet that connects the two resistors. Notethat your circuit consumes more powerwhen terminated in this fashion because ofthe constant load on the VCCO supplyformed by the two resistors.

The data-line eyes in Figure 4 also illus-trate that removing the parallel terminationfrom the ends of the line causes unaccept-able overshoot/undershoot. However, inthis case, removing the series terminationsappears to have improved the eye, makingit slightly wider and providing more “headroom” against noise without introducingovershoot/undershoot. This serves as anexcellent reminder that blindly followingrecommended use models, rules of thumb,or general guidelines is never a good substi-tute for simulating your design.

Keep in mind that before approving anengineering change order (ECO) for theremoval of the series termination resistors,you should reverse the direction of the linewith the Virtex-5 device driving data to thememory chip and check the eye for goodsignal integrity (SI).

Why Use DCI?The benefits of using DCI, as opposed toequivalent external termination, arenumerous, including:

• Better SI at receiver inputs

• Reduced PCB size

• Reduced bill of materials (BOM) parts count

SI at receiver inputs improves whenusing DCI because the termination liescloser to the inputs than when you use anexternal termination resistor.

PCB size and BOM parts count areboth reduced when using DCI because ofthe elimination of external terminationcomponents.

Caveats when using DCI include:

• Impedance variation over process/volt-age/temperature

• Greater power consumption

The termination impedance whenusing DCI is provided by CMOS transis-tors; therefore, the value of that imped-ance can vary along with variations in thefabrication process, supply voltage, andoperating temperature (PVT) of theFPGA. You should always perform sys-tem-level SI simulations twice, using thehigh and low extremes for the value of thetermination impedance to ensure correctsystem operation across all possible com-binations of PVT.

Using DCI for parallel termination atthe end of a transmission line results inhigher power consumption than using anexternal resistor, assuming that an appro-priate termination voltage source is avail-able. In this case, the end of the line can beconnected to the voltage source through aresistor with a value equal to the character-istic impedance of the line, and no load willbe placed across the supply rails.

Conversely, when DCI is used to termi-nate the line, two pass transistors connectthe receiver input to VCCO and ground,respectively. Each transistor is adjusted tohave an effective resistance equal to twicethe characteristic impedance of the line,thus producing a Thevenin-equivalent ter-mination impedance of Z0 to VCCO/2.

A side effect of this termination schemeis that an additional load of 4Z0 appears

across the supply rails, consequentlyincreasing overall system power consump-tion. If the system architectural designspecifications do not provide for a voltagesupply at VCCO/2, then no power penaltyis incurred for using DCI because the ter-mination scheme has to be Thevenin-equivalent in either case.

Table 1 gives the worst-case outputpower dissipation of a single line for threetermination styles: source series, externalparallel (assuming availability of VCCO/2),and internal parallel (DCI).

ConclusionIn this article, I’ve shown how various digres-sions from the Xilinx recommended usemodel for circuit topology of DDR2 memo-ry interfaces affect the eye at the receiver. Ihope you are convinced that system-levelsimulation with an IBIS simulator such asHyperLynx is a necessity when designingDDR2 memory interfaces. And I’ve givenyou some pros and cons for using DCI as analternative to external termination resistorsin your next DDR2 design.

Going forward, as memory interfacespeeds continue to increase and I/O voltagelevels continue to decrease, you will be ableto apply the general principles learned hereto the design of more complex memoryinterfaces as those standards emerge.

For more information, visit www.jedec.com.

Termination Type Power Dissipation

External Source 41 mW Series Termination

External Parallel 49 mWTermination

Internal (DCI) 57 mWParallel Termination

Table 1 – Power dissipation versus termination type(for a discussion of output power calculations, see

“High-Speed Digital Design: A Handbook of BlackMagic” by Howard Johnson and Martin Graham)

...blindly following recommended use models, rules of thumb, or general guidelines is never a good substitute for simulating your design.


Æ

by Dean Armintrout Product Marketing Engineer, IP Solutions Division Xilinx [email protected]

Chris EbelingPrincipal Engineer, IP Solutions DivisionXilinx, [email protected]

System Packet Interface Level 4 Phase 2(SPI-4.2) is the Optical InternetworkingForum’s recommended interface for the inter-connection of devices for aggregate band-widths of OC-192 (ATM and POS) and 10Gbps (Ethernet), as illustrated in Figure 1.

The SPI-4.2 interface has become thestandard for interconnecting leading-edge10 Gbps framers, traffic managers, networkprocessors, and switch fabrics. SPI-4.2 ispopular because of its efficient interface,which offers high bandwidth and low pincount, along with seamless handling of typ-ical system requirements such as flow con-trol, error detection, synchronization, andbus realignment.

The Xilinx® Virtex™-5 architecture pro-vides an ideal platform for implementingSPI-4.2. The Xilinx SPI-4.2 LogiCORE™IP targeting Virtex-5 devices provides a sig-nificantly smaller solution with dramaticpower savings, 1.2 Gbps LVDS DDR I/O,and complete pin assignment flexibility.

SPI-4.2 LogiCORE IPContinually improving on its SPI-4.2 solu-tion, Xilinx has made the latest implemen-tation 25% smaller than previous versionsby leveraging the 65-nm ExpressFabric™technology and real six-input look-up tables(LUTs) of Virtex-5 FPGAs.

Enhanced ChipSync™ technology issupported on every pin of the Virtex-5device family, allowing you to target theSPI-4.2 LogiCORE solution to any devicepinout to meet your system and PCBrequirements. High-performance interfacesare supported by 1.2 Gbps LVDS data rates.

For applications requiring multipleSPI-4.2 interfaces, the Virtex-5 FPGA’slogic density, high pin count, and exten-sive clocking resources support four ormore full-duplex cores in a single device.


Improve SystemReliability with SPI-4.2 LogiCORE Solutions and Virtex-5 FPGAs

Improve SystemReliability with SPI-4.2 LogiCORE Solutions and Virtex-5 FPGAsVirtex-5 devices provide an ideal platform for source-synchronous designs like the widely adopted SPI-4.2 interface.

Virtex-5 devices provide an ideal platform for source-synchronous designs like the widely adopted SPI-4.2 interface.

SOURCE-SYNCHRONOUS INTERFACES

ChipSync Source-Synchronous TechnologyVirtex-5 devices build on ChipSync tech-nology to ensure reliable high-speed datatransfer for source-synchronous applica-tions like SPI-4.2 with these features:

• Built-in serializer/deserializer(SERDES) logic enables the fabric tointerface to the I/O at a fraction of thesource-synchronous clock rate. Theincluded bitslip function allows shift-ing of the deserialized data to achieveword alignment when linking multiplepins (bus deskew).

• Input delay (IDELAY) componentsallow the dynamic phase alignment(DPA) logic to independently adjustthe delay of each bit of a bus in 75-psincrements, providing a mechanism fortuning the interface timing to the sys-tem environment.

• DDR registers integrated into the I/Opins simplify the interface between theFPGA fabric and the I/O blocks bysupporting data transfer on a singleclock edge.

SPI-4.2 and ChipSync TechnologyThe SPI-4.2 interface has a DDR source-synchronous data bus that comprises 18LVDS pairs (16 data, 1 control, and 1 clock),operating at a minimum rate of 311 MHz.

The SPI-4.2 core uses ChipSync tech-nology to serialize/deserialize bus data to afour-word SPI-4.2 datastream at a lowerclock rate; thus, you can implement high-frequency SPI-4.2 interfaces in slowerspeed grade Virtex-5 devices.

The SERDES functions allow the corelogic to transfer these four words to andfrom the I/O logic without using any CLBlogic resources and operate at half thesource-synchronous DDR clock rate. Forexample, a SPI-4.2 interface with a 500-MHz DDR reference clock only requiresan FPGA fabric clock of 250 MHz – easilyachievable in the Virtex-5 architecture.

As the frequency of the source-synchro-nous clock increases, data recovery at thereceiving (sink) device becomes more chal-lenging. The SPI-4.2 protocol provides atraining pattern that permits a receivingdevice to adjust its data sampling to the sys-

shift with changes in operating conditions,such as voltage and temperature, as well asother variations (Figure 2b). ContinuousDPA addresses this by constantly monitor-ing the ingress data and adjusting the sam-ple point of each data bit to provide themaximum timing margin (Figure 2c).

Although the OIF SPI-4.2Implementation Agreement calls for theinsertion of periodic training patterns tomaintain clock-data alignment over time,Xilinx continuous DPA does not dependon the presence of training patterns. By

tem interface timing – a process referred to asdynamic phase alignment (DPA).

In Virtex-5 FPGAs, the IDELAY featurepresent in every I/O is ideally suited toadjust the clock-data phase relationship formaximum I/O timing margin. This hastwo primary benefits for the SPI-4.2 core:

• Integrating the IDELAY feature intothe input pin (ILOGIC) reduces theFPGA resources required for DPA toless than 350 slices.

• The IDELAY function’s ability toadjust the data sampling point enablesDPA to be implemented in the I/O –except for a small control state machineimplemented in the fabric. The statemachine portion is fully synchronousand does not require a complex macro.Thus, there are no restrictions on SPI-4.2 pin assignments.

Continuous DPAThe Xilinx SPI-4.2 LogiCORE solutionenhances communication system reliabilitywith continuous DPA, which monitors theclock-data alignment during operation andconstantly adjusts the data sampling pointsto adapt to system timing changes.

Following the initial clock-data align-ment phase, the sampling point of eachdata bit is aligned to the middle of the datavalid window (Figure 2a). This window can

SPI-4.2

PHY Layer

Device

or

NPU

User Logic

SPI-4.2

InterfaceUser

Interface

Rx Datapath

Rx Status Path

Tx Datapath

Tx Status Path

SPI-4.2

Sink

Interface

SPI-4.2 Sink Core

Virtex-5 Device

User

Sink

Interface

SPI-4.2

Source

Interface

SPI-4.2 Source Core

User

Source

Interface

Legend:

S1 = Slave IDELAY sample point 1

S2 = Slave IDELAY sample point 2

M = Master IDELAY sample point

S1S2

M

Fixed offset

of 2 taps

(a):

Initial Data Eye AlignmentSystem Jitter

S1S2

M

(c):

Adjust Master IDELAY Tap to

the Middle of Data Eye

S1S2

M

(b):

Data Window Drifts

Figure 1 – Typical SPI-4.2 application

Figure 2 – Continuous DPA operation




reducing/eliminating the need for periodictraining patterns, continuous DPA enablesthe maximum data bandwidth in your sys-tem while maintaining the optimal clock-data alignment at each pin.

DPA DiagnosticsIf your hardware operation encountersalignment issues, the Xilinx SPI-4.2 coreincludes DPA diagnostic ports to aidwith debugging. The DPA diagnosticdata monitors the data eye and final sam-pling point of the initial alignmentprocess, as well as a second sweep of thedata valid window to determine if anychanges have occurred.

You can connect the diagnostic ports tothe ChipScope™ analyzer or other logicprobes to analyze alignment conditionswhile the FPGA is on the board, interact-ing with the rest of the system

Clocking ResourcesVirtex-5 FPGAs provide an unprecedentednumber of clock resources for implementingmultiple SPI-4.2 interfaces in a singledevice. The abundance and flexibility ofclock distribution in the Virtex-5 familysolves this challenge, supporting as manySPI-4.2 interfaces as the device logic andI/O will accommodate.

In the Virtex-5 family, all devices have32 global clock resources, with any 10 of

the 32 total global buffers available in eachclock region. The global clock trees andassociated buffers are implemented differ-entially for best duty-cycle fidelity andgreater common-mode noise rejection.

In addition, each region in the device hasfour regional clock nets, which are ideal forsource-synchronous interface clocking atrates above 1 Gbps. You can configure theSPI-4.2 LogiCORE IP to use either globalor regional clock resources.

These high-performance clock resourcessupport as many as four SPI-4.2 interfaces ina mid-range device (LX85/LX110) and morethan four SPI-4.2 interfaces in the largerdevices (Figure 3). The Virtex-5 clockingcapability enables a whole new class of SPI-4.2 applications and provides an idealplatform for applications such as multiplexingand demultiplexing, bridges, and switches.

Higher Performance at Lower PowerVirtex-5 silicon is manufactured with a 65-nm triple-oxide process that reducespower consumption by as much as 35%.This has a positive impact for all designs,including the SPI-4.2 interface; the powersavings are summarized in Table 1.

With Virtex-5 devices, SPI-4.2 uses sig-nificantly less power than its predecessors,both because of the enhanced 65-nmprocess and because the LogiCORE solu-

tion uses 25% less fabric resources. At thesame time, Virtex-5 FPGAs support 20%higher performance for SPI-4.2, with high-speed 1.2 Gbps LVDS data rates on everyI/O of the device.

This means that not only can you placemultiple SPI-4.2 interfaces anywhere on thedevice, but for each interface, you can realizean aggregate bandwidth as high as 19 Gbps.Designs not requiring this level of perform-ance (such as more typical framer interfaces

running at 10-12 Gbps) automatically getadditional performance overhead, ensuringease of design integration and timing closure.

ConclusionXilinx SPI-4.2 LogiCORE IP coupled withVirtex-5 features provides a highly efficientand reliable SPI-4.2 solution. We devel-oped ChipSync technology and continuousDPA specifically for source-synchronousinterfaces like SPI-4.2.

This technology allows you to designthe most efficient and reliable SPI-4.2 solu-tions, which use significantly less resources(25% less), allow fully flexible device pinassignments (you choose the pinout), andsupport extremely high interface speeds(1.2 Gbps LVDS DDR I/O).

The higher performance is even moreremarkable because Virtex-5 FPGAsachieve this while consuming significantlyless power. The wealth of Virtex-5 clockingresources, combined with full pin assign-ment flexibility, enables a new class of appli-cations with multiple SPI-4.2 interfaces.

For more information about the SPI-4.2 LogiCORE IP targeting Virtex-5devices, visit the Xilinx IP Center atwww.xilinx.com/systemio/spi-4.2. A hard-ware demonstration is also available; formore information, contact your Xilinxrepresentative.

Virtex-4 FPGA Virtex-5 FPGA

Power: Static Alignment @ 700 Mbps per LVDS Pair 1.55W 1.42W

Power: Dynamic Alignment Performance per LVDS Pair 2.0W @ 1 Gbps 1.66W @ 1 Gbps

Speed Grades Supporting 800 Mbps per LVDS Pair -10, -11, -12 -1, -2, -3

Figure 3 – Illustration of four instances of SPI-4.2 LogiCORE IP implemented on

a Virtex-5 XC5VLX110 device

Table 1 – SPI-4.2 power estimates for Virtex-4 and Virtex-5 FPGAs


by Craig DaviesFirmware Engineer VMETRO Ltd. (High Wycombe, UK) [email protected]

Jeff BatemanSenior Systems EngineerVMETRO Inc. (Ithaca, NY)[email protected]

In the fast-paced world of FPGA develop-ment, Xilinx has struck again with its sec-ond-generation ASMBL™ architecturedevices, the Virtex™-5 family. This devicefamily has many upgrades from its prede-cessor, the Virtex-4 family, and likewisecontinues the evolution of the ASMBLarchitecture, with scalable FPGAs cateringto the application-specific marketplace. Forcommercial off-the-shelf (COTS) develop-ers, this means a platform that is low cost,light on power consumption, and opti-mized for high performance.

When creating the Virtex-4 family,Xilinx harnessed the flexibility of theASMBL architecture to build the first multi-platform FPGA family. Xilinx continues thisapproach with the Virtex-5 family. The ini-tial offering is the Virtex-5 LX platform,optimized for high-performance logic.

Seasoned FPGA users expect new FPGAgenerations to deliver more and the Virtex-5family certainly delivers, all while consum-ing less power. Compared to Virtex-4 LXdevices, Virtex-5 LX FPGAs offer:

• 65% higher logic capacity with asmany as 330,000 logic cells

• 70% more block RAM

• 100% more DSP slices

• 25% more SelectIO™ pins

For COTS board vendors, these fea-tures enable powerful products capable ofhandling the very high data rates and pro-cessing complexity required of modern

real-time DSP systems. In this article, we’llexamine those Virtex-5 architecture com-ponents that enable COTS designers todeliver more bang for the buck.

COTS FPGA BackgrounderFreed from the need to design hardwareand IP from scratch, COTS board-levelusers can focus their energies on imple-menting their specialist algorithms.COTS products incorporating user-pro-grammable FPGAs target a variety ofapplications, from simple customizabledigital I/O to RADAR, video, and signalsintelligence (SigInt).

Typically, the FPGA requires hardwareconnections to a real-world data source ordestination, plus a standardized interfaceto a host processor. COTS products mustusually follow an industry-standard formfactor (such as PMC, VME, VXS, andCompactPCI), enabling end users to inte-grate products from a range of vendors.


Using Virtex-5 FPGAs in COTS Board-Level ProductsUsing Virtex-5 FPGAs in COTS Board-Level ProductsOptimize your COTS designs with the many improvements in the Virtex-5 family.Optimize your COTS designs with the many improvements in the Virtex-5 family.

VERTICAL MARKET SOLUTIONS

With its parallel architecture and high-speed I/O capabilities, the Virtex-5 FPGA iscapable of streaming and processing data atthe gigabyte-per-second rates typicallyrequired for today’s applications. It is wellsuited to algorithms where a core “innerloop” can be parallelized to speed up opera-tion, employing the resources available inmodern devices. Many DSP algorithmsdovetail with this architecture. Conversely,even the fastest CPUs cannot easily processdata at gigabyte-per-second rates; they are,however, well suited to decision making anduser interaction tasks.

Given these trade-offs, FPGA-based DSPsystems often employ a hybrid approach, asillustrated in Figure 1. Here, a wide-band-width RADAR or video source is digitized atgigasample-per-second rates and fed to anFPGA. The FPGA performs some heavy-duty number crunching to eliminateunwanted data, focusing in on the key area ofinterest. Pre-processed data is fed at a moremanageable rate to a general-purpose CPUfor post-processing control and display.

Key COTS FPGA board requirements are:

• Large, reconfigurable FPGAs withample room for customer-programma-ble application logic

• Regular air-cooled and rugged conduc-tion-cooled options

• High-speed interface for efficient trans-fers to and from a host processor

• Flexible, fast I/O to and from a varietyof real-world interfaces

• Local memory interfaced directly tothe FPGA for I/O buffering as well astemporary storage during algorithmoperation

• Wide range of I/O and signal-process-ing IP cores to speed end-user develop-ment cycle times

• Flexible FPGA development tools cov-ering both budget-conscious andextreme-performance users

• Debugging interface capable of in-FPGA logic analysis

• Comprehensive board support firmwareand software

inputs. For many of today’s applications,especially those in DSP, this optimizationreduces significantly as system-level algo-rithms increase in complexity.

Configurable logic block storage densityimprovements increase the shift registerLUT (SRL) length from 16 bits to 32 bits(SRL32), while retaining a dual SRL16option. Distributed RAM now offers a 64-bit option, up from 16 bits. With improvedreduced-hop routing and more logic perslice (four LUTs/four flip-flops versus twoLUTs/two flip-flops), speed improvementsof as much as 45% are possible.

Improved FIR EfficiencyLet’s consider a finite impulse response filterimplemented in distributed logic.Distributed arithmetic filters are oftenselected because their operating frequency isnot tied to the length of the tap vector. Thischaracteristic is highly desirable becauseincreasing the tap vector length is funda-mental to improving the overall filterresponse. However, these types of filters are

With a proven track record in the high-end FPGA DSP arena and comprehensivetool and IP support from a variety ofsources, the Virtex-5 FPGA family is a nat-ural choice for COTS board-level vendors.

Optimization of Soft ComponentsA great addition to the Virtex-5 architec-ture is the replacement of traditional four-input look-up tables (LUTs) with newsix-input LUTs (6-LUTs) for more efficientmapping of wider functions. Because 6-LUTs are also configurable as dual five-input LUTs, design software tools canachieve greater efficiency in logic mappingwhen six-input functions are not required.

Most FPGA devices these days basetheir soft fabric components – those com-ponents configured to implement logicequations – on LUTs. Previously, the com-mon choice was the four-input LUT, as thiswas a nice binary base and was relativelyeasy to work with for optimizing a logicfunction. A given equation can be opti-mized to contain a sum of products of four


Capture Sensor DataGigasample/sec ADC

Pre-Process DataFind Targets Within NoiseHigh Speed, Parallel DSP

Post-Process DataIdentify Target:Friend/Enemy

Vehicle/Aircraft TypeSpeed, Position, Altitude

Threat Level

Display Result

Gbps to FPGA

Mbps

Filtering and Data ReductionLarge, Fast FPGA

User-Programmable IP

Interpretation/Display(CPU)

Figure 1 – Processing chain


serial in nature; most applications need validoutput at a marginally decimated rate withrespect to the sampling frequency. Thus, afully parallel architecture is required.

Parallel Distributed Arithmetic FIRfilters (DAFIRs) utilize significantly morelogic versus other FIR implementationsto perform the many partial products ona clock-to-clock basis (even when deci-mating the sampling rate). In the distrib-uted arithmetic architecture, acorresponding output product y(n) isproduced by summing the products of atime-delayed series of input x(n) andcoefficients a(m), where m, an integerbetween 1 and N, is the filter length.

For the sake of simplicity, let’s say thateach tap or filter coefficient is two bits wideand that the input vector is six bits. In total,our filter is 96 taps in length. If we calculatethis product using the partial productsmethod, we need 6 x 2 partial productsusing four-input LUTs. Each LUT is capa-ble of a 2 x 2 multiplication, which meansusing three 4-input LUTs. Using 6-LUTs,we can reduce this to just two LUTs. For 96taps, we have saved 96 LUTs of a possibletotal of 288. This is just the savings whenproducing the partial product.

LUTs and SRLs are also used for shiftregisters in the input delay pipe and for thescaling accumulator responsible for sum-mation and normalization of the output.Expanding our example input and tapwidths to a more applicable precision of 16bits increases the depth of our partial prod-uct multiplication tree, requiring evenmore LUTs. Using 6-LUTs as opposed tofour-input LUTs results in a LUT logicreduction of more than 33%.

Wider LUTs Improve Efficiency and SpeedSwitched fabric developers will also ben-efit from the 6-LUTs, as these are oftenused to implement multiplexers. 6-LUTsmean a reduction in the overall depth ofa logic equation. For implementing mul-tiplexers, this means an effective increasein speed for an equivalent multiplexerimplemented using 6-LUTs, as opposedto four-input LUTs.

Depending on the application, changingto 6-LUTs can make as much as a 1.6x

improvement in logic utilization over previ-ous generations of the Virtex device family.

Multiplying Computational PowerNot to be outdone by the soft-logic com-ponents, the hard-logic dedicated multipli-ers have also been optimized for theVirtex-5 FPGA. The 18 x 18-bit multipli-ers present in the Virtex-II and Virtex-4families have been upgraded to 25 x 18-bitmultipliers in the new family. Applicationdevelopers who implement beam-formingarrays or other advanced computations willbenefit from this enhancement.

Large multiplication arrays that requirea high degree of precision traditionallyrequired a large tree structure of multipli-ers. As output is carried between interme-diary stages of a large multiplication, themaximum allowable output value increaseswith each subsequent stage. To handle thisbit-width increase, typical solutions involvea precision reduction by truncation orsome other intelligent scheme such as con-vergent rounding or (less often) by break-ing down the multiplication into smallerstages and then rebuilding the final prod-uct by summation. Utilizing 25 x 18-bitmultipliers, more precision is carriedthrough intermediary stages of a multipli-cation and thus reduces the impact ofintermediary truncation/rounding errorswhile improving on overall speed and min-imizing pipeline latency.

Suppose convergent rounding isemployed to reduce the precision at eachstage of multiplication within an FFT. Ifwe implement an 8K FFT using a mixed-radix base of radix-4 and radix-2, that givesus six radix-4 butterfly stages and oneradix-2 butterfly stage. In an FFT, at eachsubsequent stage we perform calculationsthat produce partial products. These out-puts are fed into the multipliers of the nextstage until the time-domain data is trans-formed completely to the frequencydomain. However, each stage must employa scheme to reduce the precision of the out-put so that subsequent stages can acceptthem as inputs. After each stage of multi-plication, scaling is employed to reduce theprecision. Each stage of scaling introducesquantization errors.

But what if more precision is carriedinto the input side of each butterfly stage?Using 25 x 18-bit multipliers, we can carrymore precision from our partial productswhen multiplying them to new sample dataand in turn introduce less rounding errorsinto our results.

Improved Source-Synchronous Memory AccessMuch to the delight of many designersusing Virtex-4 FPGAs, Xilinx introduced aprimitive called IDELAY capable of syn-chronizing data and strobes to a sourceclock off the FPGA. This feature meantthat high-speed DDR and DDR2 SDRAMand QDR and QDR II SRAM memoriescould be accessed through controllersinside the Virtex-4 device at high data rates.

COTS developers are increasingly find-ing applications that require fast and deeponboard memory. For example, datarecording applications benefit greatly fromfast onboard memory to implement thesizeable buffers needed to sustain high-speed data transfers over PCI/PCI-X buses.Video processing applications also requirelarge, fast external memories to store thevery-high-resolution, high-frame-rateimages produced by today’s leading cameraequipment.

The introduction of the IDELAY prim-itive also benefits ruggedized applicationdevelopers, as the IDELAY taps can be con-stantly monitored by logic to perform run-time resynchronization to the source clock;this technique is known as dynamic clock-to-data centering.

Now, with the Virtex-5 family, Xilinxhas expanded the primitive to add ODE-LAY, enabling delay control on both inputand output signals. The key component ofthe IDELAY primitive is to delay the inputdata relative to the clock such that theinternal FPGA version of the source clockedge is centered with the input data. TheODELAY enables variable delays per out-put data line to better match trace-lengthdifferences.

Improving High-Speed I/O CommunicationAs the COTS marketplace moves more andmore into high-speed serial implementa-tions, clock and data recovery techniques




become more in demand. When imple-menting general-purpose high-speed seriallinks, transmission errors and data lossbecome a reality, especially when targetingdata rates beyond 1 Gbps.

For developers using previous genera-tions of Virtex devices, the choices forclock and data recovery (CDR) implemen-tation to de-serialize incoming streamswithout using multi-gigabit transceivers(MGTs) were limited.Although the delay-lockedloops (DLLs) used for clockgeneration in previous fami-lies are very stable in naturebecause of their first-orderloop architecture and digitalimplementation, they are notable to filter input jitter orhandle phase alignmentbeyond their discrete range.

With the phase-locked loop(PLL) blocks introduced withVirtex-5 family, jitter reduc-tion is an intrinsic feature,resulting in large improve-ments in higher data-rate sus-tainability. Filtering input jitterto produce stable internal ver-sions of source clocks is critical-ly important to correctlysample and store incomingdata at the FPGA I/O bound-ary. Using these new blocks,implementing SERDES com-ponents using regular SelectIOpins becomes practical even at1 Gbps and above.

Together with SelectIOperformance of as much as800 Mbps per pin single-ended and 1.25 Gbps differ-ential, the Virtex-5 device is able toinput, process, and output the high datarates generated by current real-worldinterfaces. For example, interfacingdirectly with ADCs and DACs runningin the gigasample-per-second range isnow perfectly feasible.

Looking to the future, new high-speedserial fabric interfaces will be a natural fitfor interfacing between FPGAs, externaldevices, and host systems.

65-nm Copper CMOSCOTS developers will greatly benefitfrom the move into the 65-nm copperCMOS process. One of the consequencesof process shrinks is that density and per-formance increase with the next genera-tion. This is true in the case of theVirtex-5 LX platform, which hasincreased the amount of CLBs by 65%over the Virtex-4 LX platform, block

RAM by 70%, DSP slices by 100%, andSelectIO pins by 25%.

With increased logic density in the over-all package, power consumption has beenreduced significantly. While the Virtex-4FPGA operates at 1.2V core voltage, theVirtex-5 FPGA improves power efficiencywith a core voltage of 1.0V. You can achievefurther power savings by optimizing thesoft and hard components, such as the 6-LUTs and 25 x 18-bit multipliers.

When implementing large designs –such as in software-defined radio (SDR)applications where multi-channel digitalfilters consume significant CLB space –the dynamic power dissipation is quitehigh because of the large amount ofswitching activity that occurs. This is inpart caused by the extensive signal rout-ing required to implement these designs.With the new components in the Virtex-

5 family, existing designsimplement in a smaller num-ber of primitives, reducing theoverall switching activity. Inaddition, the Virtex-5 architec-ture includes the enhancementof diagonally symmetric rout-ing for more efficient designimplementation.

COTS developers oftencomplain about violating powerspecifications when developingmezzanine cards in existingdesigns. With the Virtex-5device, their mezzanine cardswill be less power-hungry andmore desirable to end users. Inother words: less powerrequired, simplified cooling,and greater reliability.

COTS products are oftenemployed in environmentswhere power consumption is asignificant challenge – for exam-ple, high ambient temperaturesmay limit a cooling system’seffectiveness, so reducing heatoutput is an important motiva-tion. Other applications such asunmanned airborne vehicles(UAVs) have limited electricalpower availability, so using every

Watt available effectively is of paramountimportance.

The VMETRO PMC-FPGA05A good example of a current COTS prod-uct implementing the advanced Virtex-565-nm technology is the VMETRO PMC-FPGA05, a general-purpose high-endFPGA PCI mezzanine card (PMC) pic-tured in Figures 2 and 3 and illustrated inthe block diagram shown in Figure 4.

Figure 3 – Standard PCI option

Figure 2 – VMETRO PMC-FPGA05


A Virtex-5 FPGA at the heart of a PMC-FPGA05 means that it is a highly integrat-ed design. There is no need for externalbridges and controllers because the Virtex-5device is more than capable of handlingthese functions directly (see Figure 5) whileusing only a small amount of resources.This leaves plenty of space in the PMC-FPGA05 to include IP such as digitalreceivers, FFTs, or other DSP functions.

Implementing Flash and PCI-X inter-faces inside the FPGA introduces somedesign challenges. How is the FPGA con-figured and debugged without these inter-faces already in place – especially if the IP iscomplex? This is where the ChipScope™Pro Analyzer (version 8.2) comes in– it embeds a logic analyzer insidethe FPGA and connects the userinterface to the FPGA throughJTAG. It provides a debugging por-tal into the FPGA that can beinserted through HDL entry.

Once built, a ChipScope ProILA (integrated logic analyzer)port assignment can be changed inFPGA Editor. Therefore, changesto the ILA do not require anotherfull run through the ISE™ flowwhen debugging, as shown inFigure 6. Instead, you can makechanges at the map stage of thedesign flow, reducing the time

between generating bitstreams and improv-ing efficiency during the debugging cycle.The ChipScope Pro Analyzer consumes alimited amount of the FPGA’s resources,though this is largely a function of thedepth of the analyzer sample memory.Features such as the analyzer memorydepth and triggering functions are parame-terizable through the ChipScope Proinserter tool, eliminating unnecessaryresource waste.

The ChipScope Pro Analyzer is a valu-able asset when debugging designs such asthe VMETRO PMC-FPGA05. The PCI-Xinterface in particular represents a chal-lenge. In simple terms, this might be to

power up the PCI bus, configure the IPonto the FPGA, and re-initialize the PCIbus to detect the FPGA’s PCI controller.This low-level development requires a toollike the ChipScope Pro Analyzer to reducethe number of bus initialization cycles.Without the ChipScope Pro tool, debug-ging would require a sophisticated bus ana-lyzer – and even then the internal FPGAfunctionality would be inaccessible.Utilizing the ChipScope Pro Analyzer savesthe design team considerable time byreducing the number of iterations requiredto debug the PCI-X interface.

The ChipScope Pro tool is not only use-ful for VMETRO board development, butalso for customers integrating their IP andexternal interfaces. We designed the PMC-FPGA05 with a high-density parallel inter-face that you can use with a range of I/Omodules including analog I/O, RS485,LVDS, FPDP, and Camera Link. Using theChipScope Pro Analyzer makes these inter-esting projects manageable.

The PMC-FPGA05 also offers applica-tion developers a platform that can sustainhigh-bandwidth data transfers and imple-ment sophisticated DSP and processingalgorithms at a fraction of the power of pre-vious generations.

FPGA ResourcesMore than any other FPGA family, theVirtex series leads the way in IP availability.Through the Xilinx® Alliance Program, IPis available from both Xilinx directly as wellas third-party IP suppliers. With a strongworldwide network of IP vendors accessibledirectly from www.xilinx.com, COTSFPGA board users can choose IP coresfrom suppliers experienced with as many asfive generations of the Virtex family. This iskey to success when developing projects toaggressive timescales, and therefore is ahigh-priority reason for selecting theVirtex-5 family.

FPGA Firmware Development ProcessWith the Xilinx ISE tool chain offering aneasy-to-use GUI development environmentand an included VHDL/Verilog synthesistool (XST), many end users need only pur-chase ISE software for a complete cost-


Virtex-5 FPGA(LX50 - LX110)

64MbytesDDR SDRAM

64MbytesDDR SDRAM

256MbitFLASH

138 signals

16-bit, 200MHz

4MbytesQDR II SRAM

4MbytesQDR II SRAM

4MbytesQDR II SRAM

18-bit, 200MHz18-bit

G P I/O (64-bit)

64-bit, 125MHz PCI-X

8 MBQDR II SRAM

128 MBQDR II SRAM

128 MBQDR II SRAM

256 MbitFLASH

8 MBQDR II SRAM

8 MBQDR II SRAM

GP I/O (64-bit)

64-bit, 133 MHz PCI-X

138 Signals

16-bit, 200 MHz

18-bit, 200 MHz

Virtex-5 FPGALX110

Figure 4 – PMC-FPGA05 block diagram

8 MBQDR II SRAM

128 MBQDR II SRAM

128 MBQDR II SRAM

256 MbitFLASH

8 MBQDR II SRAM

8 MBQDR II SRAM

PCI-X

Virtex-5 FPGA

SRAMController

SDRAMController

SDRAMController

SRAMController

SRAMController

Cu

stom

User I/O

PC

I-X

Co

ntr

olle

rF

lash

Co

ntr

olle

r

Figure 5 – PMC-FPGA05 controller IP components


effective development solution. Thoseimplementing the most complex algorithmsmay benefit from high-end synthesis toolsfrom third-party vendors. These integratedirectly with ISE software to maintain asimple project management process.

Simulation may not reveal all errors inthe design, particularly for complex proj-ects. When an implementation does notfunction as expected, you must connect upto the actual hardware to see what is goingon. But with dense packaging coveringmany thousands of pins, there’s no practicalway to connect a traditional logic analyzer.

VMETRO engineers have many yearsof experience developing firmware forXilinx FPGAs. Using ISE software as theprimary synthesis/place and route tool,Model Technology’s ModelSim PE for sim-ulation, and the ChipScope Pro tool for in-circuit debugging, VMETRO developedthe IP necessary for interfacing with theboard hardware. This includes interfacesfor the SRAM and SDRAM memorydevices, to which users simply connecttheir address and data signals, and a high-performance bus mastering PCI-X inter-face core supporting customizable registersand simple FIFO-based DMA transfers forstreaming data.

Another good reason to choose theVirtex-5 family is a commitment to educa-

tion and support: from introductory cours-es in VHDL to DSP logic design courses todevelopment laboratories equipped withthe latest gear for testing high-speed serialinterfaces, Xilinx offers the resources neces-sary for successful FPGA deployment.

As the Virtex-5 device is brand new, thePMC-FPGA05 is still in development. Staytuned for an update on the challenges,solutions, and lessons learned as VMETROworks hard to bring you the world’s firstVirtex-5 COTS product.

ConclusionTo meet the growing demands being placedon the COTS marketplace, you must adaptand implement platforms with the righttools for the application. Today, this meansintegrating high-speed serial communica-tion, fast access memory, and plenty ofoptimized logic space for advanced algo-rithm development. Equally important areefficient development and debug tools, IPresources, and a commitment to high-speed DSP development.

The highest logic density available, thelowest power consumption, and the bestperformance are what COTS developersneed to meet the needs of their customers.Virtex-5 FPGAs, with their ASMBL archi-tecture and 65-nm process, deliver onthese demands.


Basic Design Flow

HDL Design Entry

NGD

Mapping, Placementand Routing

NCD

Bit Gen

BIT

Xilinx FPGA

Chipscope ILA

Chipscope ILA

interate

JTAG

Figure 6 – ChipScope debugging cycle

GetPublished

Would you like to be published

in XcellPublications?

It's easier than you think!

Submit an article draft for our Web-based

or printed Xcell Publications and we will

assign an editor and a graphic artist

to work with you to make your work

look as good as possible.

For more information on this

exciting and highly rewarding program,

please contact:

Forrest Couch

Publisher, Xcell Publications

[email protected]



P r o c e s s i n g a n d F P G A - I n p u t / O u t p u t - D a t a R e c o r d i n g - B u s A n a l y z e r s

Stay Ahead!Stay Ahead!

with the PMC-FPGAO5 Range of Virtex-5 PMCs

PCI based development system also available

Xilinx Virtex-5 LX FPGAA new generation of performance

Analog I/O, CameraLink, LVDS, FPDP-II &RS422/485 optionsFast, integrated I/O without bottlenecks

Multiple banks of fastmemoryDSP & I/O optimized memoryarchitecture

PCI-X Interface withmultiple DMA controllersMore than 1GB/s bandwidth to host

Libraries and ExampleCodeEasy to use with head-start time-to-market

For more information, please visit virtex5.vmetro.com or call (281) 584 0728

January 06 - 10, 2007 Int’l Conference on VLSI Design & Int’l Conference on Embedded Systems Bangalore, India

January 08 - 11, 2007 International Consumer Electronics Show Las Vegas, NV

February 12 - 15, 2007 3GSM Spain

February 13 - 15, 2007 Embedded World Germany

February 21 - 23, 2007 IDGA Software Radio Summit Vienna, VA

April 03 - 05, 2007 Embedded Systems Conference - Silicon Valley San Jose, CA

April 16 - 19, 2007 NAB Las Vegas, NV

Xilinx Worldwide Events Xilinx participates in numerous trade shows and events throughout the year.

This is a perfect opportunity to meet our silicon and software experts, ask questions, see demonstrations of new products, and hear other customer success stories.

For more information and the current schedule, visit www.xilinx.com/events/.

by Delfin RodillasSenior Manager, Wired CommunicationsXilinx, [email protected]

The rate of adoption of serial technology inhigh-end system design has reached criticalmass. As shown in Figure 1, 92% of respon-dents in a recent EE Times survey answered“yes” when asked if they were designingserial I/O systems in 2006, compared to64% serial design activity in 2005.

A good portion of this dramatic adop-tion rate is caused by the penetration ofserial technology in backplane applications.As system throughput requirementsincrease, the parallel backplane technologiesof old will be displaced by SerDes-basedbackplane subsystems that provide higherbandwidth, better signal integrity, lowerEMI and power, and simpler PCB designs.

Further promoting this growth is theemergence of standard serial protocols suchas XAUI and Gigabit Ethernet (GbE),

which allow reduced engineering effortsand interoperability. Standardization effortsfor serial backplane form factors such asAdvancedTCA and MicroTCA in the PCIIndustrial Computer Manufacturers Group(PICMG) have also contributed to theaccelerated adoption. The benefits of serialbackplanes are so compelling that they havebeen used as the backbone of not only com-munications, compute, and storage systemsbut also broadcast, medical, defense, andindustrial/test systems.

Persistent Design ChallengesRegardless of the increased rate of adop-tion, many design challenges still exist.Because the backplane subsystem is theheart of the system, it must be able to passsignals from card to card reliably. Thus,designing backplanes with high signalintegrity (SI) is of primary importance.

Also significant is the use of propersilicon ICs with SerDes technology,capable of driving backplanes with very

low bit-error rates. Silicon-basedapproaches to mitigating SI issues areparticularly important in “legacyupgrade” scenarios, in which designersre-use older backplanes with legacy com-ponents and design rules.

There are also challenges in developingserial backplane protocols and fabric inter-faces. The majority of backplane designsleverage legacy ASICs, which have propri-etary protocols. Even some newer back-plane designs require a proprietarybackplane protocol. Silicon solutions musttherefore be flexible and provide the nec-essary customizability. Although an ASICallows this, it can often be costly and risky,with unproven product demand/volumeand the possibility of design bugs andspecification changes.

An approach that has recently gainedtraction is the use of off-the-shelf stan-dards-based switch fabrics. This savesdevelopment time, but you must have sili-con solutions that conform to the standard


Tackling Serial Backplane Interface Design ChallengesTackling Serial Backplane Interface Design ChallengesThe Virtex-5 LXT FPGA enables robust, high-performance, and high-integration serial backplane interface solutions.The Virtex-5 LXT FPGA enables robust, high-performance, and high-integration serial backplane interface solutions.


protocol, as well as the flexibility to cus-tomize the end product and make it unique.

And of course, there are the ever-pres-ent challenges of cost, power, and time tomarket. To meet the challenges of serialbackplane design, Xilinx provides theVirtex™-5 LXT platform of FPGAs aswell as IP solutions.

Xilinx Solutions for Serial BackplanesThe key technology that enables the applica-tion of Xilinx® Virtex-5 LXT FPGAs in seri-al backplane applications is the embeddedRocketIO™ GTP low-power serial trans-ceiver. There are as many as 24 serial tran-ceivers in the largest Virtex-5 LXT FPGA;each serial tranceiver is capable of runningfrom 100 Mbps to 3.2 Gbps. Coupled withprogrammable fabric, the FPGA is capable ofsupporting virtually any serial protocol –proprietary or standard – up to 3.2 Gbps.

More important for serial backplaneapplications are built-in signal condition-ing features, including transmit pre-emphasis and receive equalization. Thesefeatures enable transmission of multi-gigabit signals over long distances, oftenreaching 40 inches or longer. Both equal-ization methods minimize the impact ofinter-symbol interference (ISI) by boost-ing high-frequency signal componentsand attenuating low-frequency compo-nents. The difference is that pre-emphasisis performed on the transmitted signal as

these IP cores are tested through consortiaplug-fests and independent third-party verifi-cation. To facilitate the creation of light-weight serial protocol designs, Xilinx alsocreated the Aurora protocol, which is idealfor simpler designs requiring minimal over-head and optimized slice/resource utilization.

With increased usage of Ethernet andPCIe, Virtex-5 LXT FPGAs also includeembedded tri-mode Ethernet MACs andPCIe Endpoint blocks. These allow signifi-cant savings of FPGA slice resources for cus-tomers needing interfaces in control planeapplications, for example.

Because many chips with parallel inter-faces are still used even in newer systems,Xilinx also offers IP cores for popular parallelinterfaces such as SPI-4.2, SPI-3, and PCI.These allow you to rapidly create serial-to-parallel bridges, which are still required inmany applications.

Besides serial and parallel interface IP,Xilinx offers more complete IP solutionsthat further reduce development time andtime to market. These solutions include aTraffic Manager for prioritizing trafficflows across backplanes, as well as a MeshFabric Reference Design that allows “every-to-every” connectivity between cards.Lastly, the ChipScope™ Pro Serial I/OTool Kit enables rapid serial tranceiversetup and debugging as well as BERT test-ing. Table 1 summarizes the serial back-plane-related IP available from Xilinx.

Application ExamplesLet’s look at how you could integrate all ofthe solution components to create a com-plete serial backplane fabric interfaceFPGA for both a star and mesh system.

it goes out of the line driver, while equal-ization occurs on the received signal afterit enters the IC package. Both pre-empha-sis and equalization features are program-mable to different states to allow foroptimum signal compensation.

Besides signal conditioning features, theserial tranceivers also provide additional fea-tures beneficial for backplanes, such as pro-grammable output swings that allowinterfacing to a variety of other currentmode logic (CML)-based devices and built-in AC coupling capacitors that simplifytransmission line design and reduce ISI.

IP CoresProprietary protocols still make up mostserial backplane implementations. However,some newer designs have used standards-based protocols such as XAUI and GbE.This growing acceptance has been drivenprimarily by the maturity of these standardsand the emergence of switch fabric ASSPsutilizing these protocols. Using ASSPs forswitching applications saves tremendousdevelopment time, but designers realize thatthey need to differentiate their products byadding value-added capabilities, primarilyon the line card.

FPGAs are the ideal platform for provid-ing customizability, as the serial tranceiversare designed to support a majority of stan-dard serial backplane protocols. Together, theserial tranceivers and fabric allow for stan-dards-compliant designs with value-addedfunctions – all in a single silicon device.

To reduce design time, Xilinx offers off-the-shelf available IP cores for key serial I/Ointerface standards such as XAUI, GbE,SRIO, and PCIe. To ensure interoperability,

IP Category Available IP

Serial Interfaces XAUI, GbE, PCI Express, Serial RapidIO, Aurora, CPRI, OBSAI

Parallel Interfaces SPI-4.2, SPI-3, Utopia, PCI, CSIX

System-Level Solutions 10G Traffic Manager, Mesh Fabric Reference Design

Serial Backplane Test Solutions ChipScope Pro Serial I/O Tool Kit

0%

25%

50%

75%

100%

2005 2006

Source: EE Times Survey, 2005

64%

92%

Table 1 – Xilinx IP for serial backplanes

Figure 1 – Percentage of engineers designing Serial I/O systems




Star Backplane Topology ApplicationStar fabric topologies prevail in high-endinfrastructure equipment because of theircost-effectiveness, particularly in systemswith a high number of cards. Figure 2 is anexample of a 10 GbE line card that imple-ments an FPGA-based star fabric interface.This FPGA instantiates the XAUILogiCORE™ IP and uses four serial tran-ceivers to connect to the 16-channel XAUIswitch fabric card. A LogiCORE SPI-4.2core is also realized in the FPGA to interfaceto the 10 Gbps network processing unit.

Between the serial and parallel interfaceis the Traffic Manager IP, which performsQoS-related functions on ingress andegress traffic. A memory controller controls

the external memory, which is used prima-rily as packet buffers. The benefits of thisarchitecture include increased integrationof SerDes and logic functions and quicktime to market through the use of IP, whileallowing an implementation that meetsyour exact system specifications. It alsoprovides solid SI as well as low SerDespower consumption (~400 mW total). Youcan implement all of this in the lowest costspeed grade XC5VLX50T device.

Mesh Fabric ArchitecturesStar topologies prevail in most cases, but insome smaller systems, a mesh topology isrequired. Take the case of the five-slot IPDSL access multiplexer shown in Figure 3,

which requires full connectivity between four24-port VDSL line cards and a 10 GbE back-haul card that connects to a metro Ethernetnetwork. Each card uses a Virtex-5 LXTdevice and four embedded serial tranceiversto realize the four independent channels ofthe mesh fabric physical layer. Implementingthe four link layers is the Aurora protocol,which runs at approximately 3 Gbps to trans-port the 2.4 Gbps payload – plus additionaloverhead such as the encoding.

A SPI-4.2 and SPI-3 LogiCORE IP isused on the trunk card and line cards, respec-tively, providing connectivity to the networkprocessor. The Mesh Fabric ReferenceDesign and Traffic Manager solution providethe distributed switching and QoS functionsrequired on all of the line cards.

The line card fabric interface could easilyfit in an XC5VLX30T device, while the trunkcard fabric could fit in an XC5VLX50Tdevice. Similar to the star example, you canrealize significant benefits in integration,time-to-market reduction, feature optimiza-tion, and power and cost reduction by usingthe Virtex-5 LXT solution.

ConclusionSerial backplane technology is now main-stream; its adoption will only continue toincrease with the rapidly growing demandfor bandwidth. Evolution of backplane sys-tem requirements in terms of rates and pro-tocols is inevitable and designers will facenew challenges.

However, with Xilinx Virtex-5 LXTFPGAs and off-the-shelf-available IP forserial backplanes, system architects have anoption that can accommodate legacy as wellas newer backplane designs. Virtex-5 LXTFPGAs with embedded SerDes have thecritical SI-improving features and integra-tion required to provide high reliability andarea- and cost-optimized designs.

Furthermore, Xilinx off-the-shelf IPreduces development time and time to mar-ket. Together, the powerful silicon and IPcores are what make the Virtex-5 solutionthe ideal vehicle for tackling even the tough-est serial backplane design challenge.

For more information, visit www.xilinx.com/backplanes and www.xilinx.com/qos.

10 GbE Line Card

10 GbE Line Card

MemoryMemoryMemory

MemoryController

10G TrafficManager Solutions

SPI-4.2LogiCORE

IP

XAUILogiCORE

IP

h. 0

Single XAUI Channel @ 4 x 3.125 Gbps

MemoryMemoryMemoryCPU10 GbE Line Card

XFP PHY10 GbE

MAC

MGT

MGT

MGT

MGT

16-Channel

XAUI-Based

Switch FabricASSP or ASIC

Switch Fabric

G

Network

Processor

G

Ne

Chh.

Virtex-5 Star

Fabric I/F

24-Port VDSL Line Card




MemoryController

MeshFabric

+TM

SPI-3LogiCORE

IP

Aurora

Aurora

Aurora

Aurora

. 2

Ch. 3

MemoryMemoryMemory

MemoryMemoryMemoryMemoryMemoryMemory

CPU

CPU

10 GbE Trunk Card

PHY10G

NPU

10GbE

MAC

DSL

Chipset

MGT

MGT

MGT

MGT

Virtex-5Mesh

Fabric I/F

2.5G

NPU

2.5G

Ch. 0

Ch. 1

Ch.

C

h.

C

XFPFour

Individual

Aurora

Channels @

1 x 3.125 Gpbs

Virtex-5

Mesh

Fabric I/F

Figure 2 – Star fabric I/F FPGA in a 10 GbE line card

Figure 3 – Mesh fabric I/F FPGA in a VDSL line card


Where Great Ideas Go to Work

A career with Xilinx® puts you at the Leading edge of technology.The world leader in programmable systems, Xilinx solutions are

found in numerous applications including wireless, networking,

storage, automotive, aerospace, and much more.

Visit our website today, www.xilinx.com/hr/ncg/index.htm,

and let’s talk about putting your ideas to work.

The programmable Logic CompanySM

©2006 Xilinx Inc. All rights reserved. XILINX, the Xilinx logo, are other designated brands included herein are trademarks of Xilinx, Inc.

All other trademarks are the property of their respective owners.

by Sriram R. ChelluriSenior Manager, Storage and ServersXilinx, [email protected]

As the data center network infrastructuremigrates to 10 Gbps, moving data trafficto an Ethernet-based solution becomeseconomically viable without sacrificingperformance and latency. Hardware-based host interfaces like PCI Express andmulti-Gigabit Ethernet (GbE) supportopen up design possibilities for low-cost,high-performance products in the com-puter and data-processing markets. TheXilinx® Virtex™-5 family of FPGAs setsthe stage for designing system-on-chip(SoC) solutions with higher functionalityand low power.

The Virtex-5 architecture brings tomarket critical features that make SoCdesigns easy to implement for TCP andiSCSI offload engines:

• Built-in PCI Express (PCIe) block – An integrated standards-compliant PCIeendpoint for supporting one to eight

lanes, capable of providing as much as32 Gbps full-duplex performance

• Built-in Gigabit Ethernet MAC(GEMAC) – four hardcore GEMACsenable multi-port gigabit solutions,reducing total real estate requirementsfor SoC designs

• Real six-input LUT (6-LUT) technology– improves slice utilization and reducesrouting latency for high performance

• 36-Kb dual-port block RAM – higher memory density with error-correction circuitry enables supportfor reliable computational logicstructures and increased on-chipTCP sessions for simultaneous transmit and receive operations

• DSP48E slices – enable massively parallel computations for image pro-cessing and multimedia applications

Because the Virtex family is a program-mable platform, you can adapt your designsto changing standards and market require-ments. Leveraging the resources available in

the Virtex-5 family, you can design cost-effective TCP and iSCSI offload solutionsfor the server, storage, multi-protocolswitch, and wireless base station marketswith extended product life cycles.

TCP Offload Engine (TOE) OverviewCurrent TCP offload solutions rely on acomplete software stack or on special net-work interface cards (NICs) based onASICs for handling TCP/IP processing. Anall-software solution is acceptable for low-bandwidth applications, but high-perform-ance applications would consume all of theCPU resources, creating a system bottle-neck for critical applications.

ASIC-based solutions are primarilyfrom start-ups looking to capitalize onthe high-performance 10 Gbps market.These solutions are still expensive andprone to vendor lock-in with an uncer-tain financial future.

Xilinx and its third-party IP partners pro-vide fully standards-compliant TCP/iSCSIoffload solutions that you can implement asis or customize for functionality, size, speed,or the target application.

Enabling Multi-Port 1Gbps and 10 GbpsTCP/iSCSI Protocol Offload SolutionsEnabling Multi-Port 1Gbps and 10 GbpsTCP/iSCSI Protocol Offload SolutionsThe Virtex-5 LXT platform enables low-footprint, system-level, multi-port 1Gbps and 10 Gbps TOE solutions. The Virtex-5 LXT platform enables low-footprint, system-level, multi-port 1Gbps and 10 Gbps TOE solutions.



FPGA-Based TCP/iSCSI EngineWith standards-compliant built-inGEMACs, a PCIe core, and increasedblock RAM, the Virtex-5 LXT device is aprogrammable platform chip that systemarchitects can exploit for TCP and iSCSIprotocol processing without worryingabout serial connectivity issues on thenetwork or host interface side. Some ofthe protocol offload design challenges are:

• The number of TCP connections to support

• TCP reassembly/reorder

• IP fragmentation and reassembly

• Latency

• On-chip versus off-chip TCP session management

These issues can be mitigated with theunique features of Virtex-5 devices andavailable IP cores. With built-in GEMACand PCIe interfaces, you can implementdirect memory access solutions with min-imal FPGA resources, reducing memorytransfer latencies and enabling TCPreassembly without using temporarymemory. Virtex-5 FPGAs also feature a36-Kb dual-port block RAM, allowingyou to support twice as many TCP con-nections than previous generations. Withthe Xilinx LogiCORE™ high-speedmemory controller, you can use externalDDR2 memory to scale TCP sessionmanagement. Let’s look at the resourcesyou could save in an FPGA-based net-work interface solution.

1 Gbps and 10 Gbps NIC SolutionAn integrated multi-port 1 Gbps and 10 Gbps TCP offload NIC for IP storageand bladed servers enables companies toleverage network infrastructure for storagetraffic. Figure 1 shows a typical FPGA-based NIC design.

Depending on the IP cores used, thisdesign can take as many as 20,000 slices toimplement. The Virtex-5 LXT platformcan reduce resource utilization by 50%,enabling you to develop lower cost solu-tions without sacrificing performance.Besides hardware efficiency, system archi-

Virtex-5 LXT platform – with hardenedGEMACs and PCIe Endpoint blocks, largerblock RAMs, and 6-LUTs – uses fewerFPGA resources to implement complex solu-tions for the server, storage, multi-protocolswitch, and wireless base station markets.

To learn more about Virtex-5 LXTFPGAs, visit www.xilinx.com/virtex5. Tolearn more about protocol offload solu-tions, visit www.xilinx.com/esp/storage/.And to learn how Xilinx FPGAs can helpyou in other applications, visit www.xilinx.com/esp.

tects can also reduce NRE costs becausethey are not required to implement high-speed I/O interfaces for GbE and PCIe.Figure 2 shows a redesign of the TCPoffload NIC leveraging the built-inresources of the Virtex-5 family.

ConclusionWith standards-compliant TCP and iSCSIoffload IP cores from third-party vendorsimplemented on Xilinx FPGAs, you cannow design a drop-in or custom SoC at amuch lower total cost of development. The

GbE PHY GbE PHY GbE PHY GbE PHY

GEMAC GEMAC GEMAC GEMAC

Access Controller

TCP/iSCSIOffload Core

MemoryController

DMAEngine

PCIe Core

Back-End I/O Interface

Programmable Logic

DDR2RLDRAM 2QDRII SRAM

GbE PHY GbE PHY GbE PHY GbE PHY

GEMAC GEMAC GEMAC GEMAC

Access Controller

TCP/iSCSIOffload Core

MemoryController

DMAEngine

PCIe Core

Back-End I/O Interface

Programmable Logic

DDR2RLDRAM 2QDRII SRAM

Built-In Hardcore Logic

Figure 1 – Designing a TCP offload solution with traditional FPGAs

Figure 2 – Designing a TCP offload solution with Virtex-5 LXT FPGAs



by Mike NelsonSr. Staff System Architect, Storage and Servers, Vertical MarketsXilinx, [email protected]

Encryption is a computationally inten-sive function, which makes extremelyhigh-performance implementations aserious system design challenge. TheXilinx® Virtex™-5 LXT platformmeets this challenge with performance-optimized features ideal for 10 Gbpsand faster implementations of leading-edge encryption algorithms.

A world-class programmable fabricprovides superior logic performance.Integrated GTP serial transceivers, hardPCI Express (PCIe) Endpoint blocks,and highly flexible SelectIO™ technol-ogy enable tremendous I/O bandwidth.And 65-nm device densities provide afamily of devices appropriate to almostany system design need.


Implementing Encryption Algorithmswith the Virtex-5 LXT PlatformImplementing Encryption Algorithmswith the Virtex-5 LXT PlatformThe Virtex-5 LXT platform makes encryption product development easy.The Virtex-5 LXT platform makes encryption product development easy.


The Virtex-5 architecture features severaladvances that enable the very high-perform-ance logic necessary for high-bandwidthencryption applications:

• Real six-input LUT-based fabricmeans that you can map circuits intodenser structures with fewer levels oflogic, increasing device utilization andperformance

• Improved routing architecture increasesthe reach of low-latency logic intercon-nection, providing more flexibility tosynthesis tools and also increasingdevice utilization and performance

• 36-Kb dual-port block RAMs withintegrated ECC allow extremely high-performance on-chip memoryresources for creating FIFOs andcomputational logic structures

Combined, these resources enable verycost-effective 10 Gbps and faster implemen-tations of IPsec AES-CBC/AES-XCBC-MAC-96, 802.1ae MACSec, LRW-AES,AES-GCM, SHA-256/384/512, and many other cryptographic algorithms.Furthermore, as the world of cryptographycontinuously evolves with additional modesand algorithmic refinements to these algo-rithms, your design can evolve with it –because the Virtex-5 family is a programma-ble logic platform.

I/O Bandwidth and FlexibilityComputationally intensive core logicrequires high-bandwidth I/O. But thenature of that I/O will vary based on yoursystem architecture. Figure 1 shows twocommon architectures for implementingencryption processing.

Look-aside co-processing is an attractiveoption widely used in x86-based systemappliances. This model leverages the excel-lent value of the commodity x86 platformto implement the application frameworkand selectively “looks aside” to an opti-mized accelerator to achieve high perform-

Virtex-5 LXT platform FPGAs addressthese limitations by combining embeddedRocketIO™ GTP transceivers and a hard-ened PCI Express Endpoint block in everydevice. With the LXT platform, extremelyhigh-performance co-processor I/O is easyand efficient, as shown in Figure 2.

Virtex-5 LXT platform FPGAs are alsoideal for in-line applications, as illustratedin Figure 3. A key requirement for in-lineencryption applications is flexibility.They may require identical – or different– input and output ports, port aggregation,

ance for the target application. FPGAshave always been well suited for this role,but scaling to approach to very high per-formance has been problematic.

PCI and PCI-X solutions require mod-est soft logic but have limited performanceand must share what bandwidth they dohave. PCI Express can implement a veryhigh-performance non-blocking switchedfabric, but traditionally requires extensivesoft-logic resources to implement the con-troller, and possibly an external PHY forthe electrical connection.

CPU

Applications

Host I/F

EncryptionAccelerationAPI

SystemI/O

System Memory

Look Aside Co-Processor• Offloads computationally intense workload for another processor

In-Line Processor• Performs encryption as a flow-through function in the datapath

OS Driver

PacketReader

PacketWriter

Por

t I/F

PacketWriter

In LineEncrypt &

Decrypt PacketReader

Por

t I/F

Chipset

ApplicationAcceleration

132 MB – 1 Gbps SimplexShared, Parallel Bus

Previous FPGA Application Co-Processor Options

User Programmable Soft LogicHard PCI Express ControllerRocketIO Multi-Gigabit Transcievers

Virtex-5 LXT FPGA

Up to 2 Gbps DuplexSwitched Fabric, Serial



PCI/PCI-X Controller


PCI Express Controller


PCI Express Controller


PHY

As the world of cryptography continuously evolves with additional modes and algorithmic refinements, your design can evolve with it...

Figure 1 – Look-aside and in-line encryption processing

Figure 2 – I/O bandwidth and soft logic progression for FPGA co-processor options




or local subsystem memory. The Virtex-5LXT platform meets this challenge with awide range of capabilities:

• Gigabit Ethernet (GbE) – Each devicein the Virtex-5 LXT platform includesfour independent hardened GbEMACs, making multi-port Ethernet avery efficient I/O option. You can addadditional ports as necessary with100% form-, fit-, and function-equiva-lent soft LogiCORE™ IP.

• 10 Gigabit Ethernet – A Xilinx softLogicCORE function is available thatcan be connected to four RocketIOMGTs for a XAUI interface or to aSelectIO pinout for an XGMII interface.

• 10 Gbps Fibre Channel (FC) – AXAUI-like Fibre Channel standard usesfour RocketIO MGTs operating at3.1875 Gbps in parallel to create a10.2 Gbps FC channel.

• PCI Express – Available to interface toa variety of industry-standard PCIe-based port controllers.

• SPI-4.2 – Soft LogicCORE IP sup-ports this networking industry stan-dard for chip-to-chip connectivity overhigh-performance SelectIO technology.

• Memory – In addition to port I/Ostandards, Virtex-5 SelectIO technolo-gy also supports a wide range of mem-ory interface technologies including

DDR2, RLDRAM II, and QDR IISRAM. These capabilities enable vir-tually any local memory subsystemthat an in-line processing enginemight require.

These features allow you to create in-line solutions that will connect to the portsyou need with the integrated encryptiontechnology you want.

Conclusion The Virtex-5 LXT platform expands thecapabilities of the Virtex-5 FPGA architec-ture with the addition of RocketIO GTPtransceivers, plus hard PCI ExpressEndpoint and tri-mode Ethernet MACblocks. The result is a platform ideally suit-ed to support very high-performance look-aside and in-line encryption functions.

Other applications where LXT platformdevices will excel include high-performancepacket handling and deep content inspec-tion for networking; high-speed data mining for databases; time-critical compu-tational processing for industrial, scientific,and medical applications; and real-timeimage processing for aerospace/defense andvideo graphic applications.

To learn more about Virtex-5 LXT platform FPGAs, visit www.xilinx.com/virtex5. To learn more about Xilinx inencryption, visit www.xilinx.com/esp/security/data_security/index.htm. And to learn howXilinx FPGAs can help you in other applica-tions, visit www.xilinx.com/esp.

PacketReader

PacketWriter

PacketWriter

PacketReader

Por

t I/F

Por

t I/F

Mul

ti-P

orte

dM

emor

y C

ontro

ller

Encryption Decryption

n X GE10GE10G FCPCIeSPI-4.2Etc.

n X GE10GE10G FCPCIeSPI-4.2Etc.

DDR2RLDRAM IIQDRII SRAMEtc.

Figure 3 – Virtex-5 LXT device in-line encryption platform flexibility


SystemBIST™ enables FPGA Configuration that is less filling for your PCB area, less filling for your BOM budget and less filling for your prototypeschedule. All the things you want less of and more of the things you do want for your PCB – like embedded JTAG tests and CPLD reconfiguration.

Typical FPGA configuration devices blindly “throw bits” at your FPGAs at power-up. SystemBIST is different – so different it has three US patentsgranted and more pending. SystemBIST’s associated software tools enable you to develop a complex power-up FPGA strategy and validate it.Using an interactive GUI, you determine what SystemBIST does in the event of a failure, what to program into the FPGA when that daughterboardis missing, or which FPGA bitstreams should be locked from further updates. You can easily add PCB 1149.1/JTAG tests to lower your down-stream production costs and enable in-the-field self-test. Some capabilities:

� User defined FPGA configuration/CPLD re-configuration

� Run Anytime-Anywhere embedded JTAG tests

� Add new FPGA designs to your products in the field

� “Failsafe” configuration – in the field FPGA updates without risk

� Small memory footprint offers lowest cost per bit FPGA configuration

� Smaller PCB real-estate, lower parts cost compared to other methods

� Industry proven software tools enable you to get-it-right before you embed

� FLASH memory locking and fast re-programming

� New: At-speed DDR and RocketIO™ MGT tests for V4/V2

If your design team is using PROMS, CPLD & FLASH or CPU and in-house software to configure FPGAs please visit our website at http://www.intellitech.com/xcell.asp to learn more.

Copyright © 2006 Intellitech Corp. All rights reserved. SystemBIST™ is a trademark of Intellitech Corporation. RocketIO™ is a registered trademark of Xilinx Corporation.

by Frank TothEasyPath FPGAs and Configuration SolutionsXilinx, [email protected]

System designers are always makingtrade-offs between alternative require-ments. Considerations include time tomarket, ease of use, total own-ership cost, and system speed.

Every alternative offers dif-ferent total-cost-of-ownership(TCO) considerations thatyou should examine whendesigning a configuration sys-tem, including design time,prototyping, manufacturingand test costs, and the per-bitcosts of the configurationdevice. The trade-offs of allthese factors should enter intoyour decision of which configurationmethod to use.

Designers using Xilinx® Virtex™-5devices have many additional choicesfor configuring the new FPGA family,including new configuration modesbuilt right into the chips; support for32-bit-wide high-performance parallelSelectMAP, which offers the ultimate inspeed; and both BPI (byte parallel inter-face) and SPI (serial peripheral inter-face) using industry-standard SPI andparallel flash memory devices (seeFigure 1). The easy-to-use, full-featured,configuration-engineered PlatformFlash PROMs offer a pre-engineeredway to flexibly configure Virtex-5devices and manage multiple bitstreams.

Platform Flash PROMsDropping a Platform Flash PROM into adesign provides a seamless solution with alow TCO, including minimum board space,high configuration speed, a guaranteedsource of supply, and value-added featureslike bitstream de-compression, design revi-sion management, JTAG Boundary Scan for

test and configuration, and additional stor-age for boot and scratch-pad memory.

Platform Flash features on-board decom-pression that can result in as much as 50%more configuration data into the same over-all memory space. Design revisioning allowsyou to switch between memory blocks forvarious configurations: for instance, usingthe board and system in different geograph-ical regions or loading a diagnostic followedby a mission load of configuration. In addi-tion, unused Platform Flash memory can beallocated to boot code or scratch-pad mem-ory. Both of these features work without anyadditional glue logic or special software. TheBoundary Scan JTAG port enables configu-ration and includes the Platform Flash inoverall board tests.

SPI Flash PROMsVirtex-5 FPGAs support direct connectionto SPI PROMs using the industry-standardfour-wire SPI interface. Many systems cur-rently use SPI PROMs; you can now easilytake advantage of the on-board SPI PROMwithout any additional circuitry or soft-ware. Designers should think about design

trade-offs, including different features offered by SPI PROMmanufacturers and the slowerconfiguration speed compared toparallel SelectMAP, PlatformFlash, and BPI.

BPI Flash PROMsVirtex-5 devices include on-board circuitry to directly con-nect – without any additionalglue logic or software – toindustry-standard parallel flash

devices. The parallel flash interface canbe directly connected to the FPGA andthe memory shared by the system bus.

ConclusionVirtex-5 FPGAs offer you the widest varietyof configuration alternatives in the industry,including Platform Flash, 32-bit SelectMAP,and interfaces that directly connect to SPIand BPI PROM devices. You should under-stand these alternatives and make aninformed decision on which one meets yourneeds based on the trade-offs between speed,complexity, and features.

For more information, please see XilinxApplication Note 483, “Multiple-Boot withPlatform Flash PROMs,” at www.xilinx.com/bvdocs/appnotes/xapp483.pdf.


Virtex-5 Configuration Options Offer Designers a Choice Xilinx provides a host of flexible choices in configuration memory to help you make the best decision for your design.

Figure 1 – Virtex-5 configuration modes

by Gokul Krishnan, Ph.DEasyPath MarketingXilinx, [email protected]

Derek JohnsonAPD MarketingXilinx, [email protected]

With increasing competition in many dif-ferent market segments, many companiesmust drive down their product develop-ment costs while at the same time addingmore and more complexity and features. Inaddition, companies must react to fast-changing market requirements and height-ened time-to-market pressures.

Xilinx® FPGAs can help you face thesechallenges by continually innovating toprovide increasingly complex functions at alower cost per logic cell. The recently intro-duced Virtex™-5 FPGAs, in combinationwith Virtex-5 EasyPath™ FPGAs, are thelatest generation of 65-nm devices thatprovide higher performance, lower systemcost, and greater embedded functionalitythan ever before.

Virtex-5 EasyPath FPGAs are the indus-try’s only 65-nm customer-specific FPGAcost-reduction solution, providing the lowesttotal cost of ownership (TCO) when com-pared to other solutions. EasyPath FPGAs areidentical to standard Xilinx FPGA offeringsbut use patented testing techniques and cus-tomer-specific test patterns to significantlyimprove FPGA yields for designs that nolonger require the full programmability of astandard FPGA. You can reap the benefits ofthese improved yields in the form of lowercosts. EasyPath technology is available acrossmultiple platforms, different product fami-lies, and 28 different devices over a range ofgate and memory counts.


Introducing Virtex-5 EasyPath FPGAsIntroducing Virtex-5 EasyPath FPGAsThe world’s first 65-nm FPGA cost-reduction solution.The world’s first 65-nm FPGA cost-reduction solution.

Lowest TCO with Virtex-5 EasyPath FPGAsVirtex-5 EasyPath FPGAs devices are man-ufactured using a 65-nm process, whichintrinsically offers a cost advantage (seeTable 1). In addition to a low unit price,EasyPath FPGAs provide many other costadvantages, such as:

• Low NRE

• No re-qualificationrequired

• No engineeringresources required

• Shorter lead times(12-16 weeks)

With Virtex-5 EasyPathFPGAs, you can realize a30%-75% price reductionwhen moving to high vol-ume as compared to stan-dard FPGAs. EasyPathFPGAs are identical totheir standard FPGAcounterparts, effectivelyeliminating any conversionwork. This has two important implications.The first is that you pay less yet incur very lit-tle risk, because every single feature in a stan-dard FPGA is supported and will work in anEasyPath FPGA. Second, you do not need tore-qualify your boards or systems when youmove to EasyPath FPGAs. This saves valu-able engineering time and resources and pro-vides cost savings of $500K or more.

Unlike structured ASICs, where cus-tomers have to go through multiple reviewswith the vendor and spend many monthsof valuable engineering resources, EasyPathFPGAs demand almost no resources fromyou. Once you have finalized the designand handed off the relevant files to Xilinx,you can get to full production directly in 8-12 weeks. No intermediate prototyping isrequired because the design has alreadybeen finalized (prototyped) in a standardFPGA. The lead time to get to productionis at least three to four months less thanwith structured ASICs.

Alternatively, those of you in fast-mov-ing markets can postpone the designfreeze milestone by three to four monthsto better address dynamic market condi-

you must take care to reduce parasiticcapacitance issues when signals are beingtransmitted simultaneously on adjoiningmetal lines.

Another major problem with 65-nmdesign is the issue of power consumption.Although Virtex-5 FPGAs take advantageof triple-oxide technology to reduce leak-

age power consumption significant-ly, each ASIC design must factor

in power consumptionand use techniques

such as clock gat-ing and selectivetransistors tomitigate leak-age current. So

although Virtex-5 EasyPath FPGAs

retain the simplicitythey had at 130 nm and 90

nm, 65-nm ASICs offer uniquechallenges that could significantlyreduce the chance of first-time successwith a design.

ConclusionThe Virtex-5 family is one of the mostcost-effective, high-performance FPGAfamilies in the industry. With advancedfeatures such as a higher utilization logicfabric, more integrated block memory,higher precision DSP slices, as well asadvanced connection and embedded pro-cessing blocks, you can reduce your overallsystem cost by fitting into a smaller FPGAor replacing external discrete devices onyour boards. This complements the naturalcost reduction that comes from fabricatingdevices on a 65-nm process.

Virtex-5 FPGAs are a faster time-to-mar-ket alternative to ASICs and other customlogic solutions, and enable a lower total sys-tem cost. In addition, Virtex-5 FPGAs aredesigned for the lowest overall power con-sumption, highest signal integrity, and high-est performance. All of these attributes canlead to lower overall system cost: lowerpower consumption and high signal integri-ty can cut design and debugging costs, andhigh performance can save device costs byallowing the design to be done in a lower,less-expensive speed grade.

tions. Getting to market faster can have asignificant impact on the market share aproduct can capture. Studies have indicat-ed that just a three-month delay in time tomarket can reduce market share by asmuch as 15%, according to research fromInternational Business Strategies, Inc.

Difficulties with 65-nm ASICsOne of the industry trends in place forsome time now is that ASIC design startsare decreasing every year. Part of the reasonfor this is that FPGAs have been able toprovide lower unit costs for higher func-tionality. The other driver has been the ris-ing cost of mask sets, design, andverification. By some recent estimates, thecost of developing a new 90-nm ASICdesign is in excess of $10 million(International Business Strategies, Inc). Asignificant portion of this cost occurs in theverification phase, which has becomelonger and longer with increases in chipcomplexity. At 65 nm, operating voltagesand transistor sizes are so small that verysmall process variations can have a bigeffect on the functionality of a design.

What this means to you is that youmust now factor in DFM (design for man-ufacturability) rules during the physicaldesign phase – something previouslyassumed to have been embedded in thelibrary itself. Furthermore, because theinterconnects are very closely spaced, signalintegrity becomes even more important;


EasyPathTotal Cost of Ownership

Advantage

Only Cost-Reduction PathSupporting Complex IP

100% FPGA Feature Support100% Package Support

Table 1 – EasyPath total cost of ownership advantage

Total Cost Driver Structured ASICs Xilinx EasyPath

Time to Cost Reduction 20 to 24 Weeks Yes

NRE Costs $100K to $400K No

Cost of Requalification $100K to $500K Yes

Engineering Costs $250K to $300K Identical

Cost of Design Tools $100K to $200K Identical

Unit Costs Lowest Low

Cosy of Respin High High

Memory InterfacesXAPP851 – DDR SDRAM ControllerUsing Virtex-5 FPGA Devices

By Toshihiko Moriyama and Rich Chiu

This application note describes a 200-MHz DDR SDRAM memory controllerimplemented in a Virtex™-5 device.This reference design uses the Virtex-5ChipSync™ features to calibrate andadjust read data timing.

A straightforward back-end user inter-face is provided to allow integration into acomplete FPGA design.

On the Web at www.xilinx.com/bvdocs/appnotes/xapp851.pdf

XAPP852 – Synthesizable CIO DDRRLDRAM II Controller for Virtex-5FPGAs

By Benoit Payette and Rodrigo Angel

This application note describes how to usea Virtex-5 device to interface to commonI/O (CIO) double data rate (DDR)reduced latency DRAM (RLDRAM II)devices. The reference design targets twoCIO DDR RLDRAM II devices at aclock rate of 200/300 MHz, with datatransfers at 400/600 Mbps per pin.


XAPP853 – QDR II SRAM Interface for Virtex-5 Devices

By Lakshmi Gopalakrishnan

This application note describes the imple-mentation and timing details of a four-word-burst quad data rate (QDR II)SRAM interface for Virtex-5 devices. Thesynthesizable reference design leverages theunique I/O and clocking capabilities of theVirtex-5 family to achieve performance

levels of 300 MHz (600 Mbps), resultingin an aggregate throughput for each 36-bitmemory interface of 43.2 Gbps.

The design greatly simplifies the taskof read data capture within the FPGAwhile minimizing the number ofresources used. A straightforward userinterface is provided to allow simple inte-gration into a complete FPGA designutilizing one or more QDR II interfaces.


XAPP858 – High-Performance DDR2SDRAM Interface in Virtex-5 Devices

by Karthi Palanisamy and Maria George

This application note describes the con-troller and data capture technique forhigh-performance DDR2 SDRAM inter-faces. This data capture technique uses theinput serializer/deserializer (ISERDES)and output double data rate (ODDR) fea-tures available in every Virtex-5 I/O.


Source-Synchronous InterfacesXAPP855 – 16-Channel DDR LVDSInterface with Per-Channel Alignment

by Greg Burton

This application note describes a 16-channelsource-synchronous DDR LVDS interface.The design takes advantage of the Virtex-5I/O ChipSync feature’s ability to adjust thedelay of the receiver datapaths, creatingdynamic setup and hold timing for eachdevice at initialization and compensating forskews associated with the manufacturingprocess. The receiver operates at 1:8 deserial-ization on each of the 16 data channels.


XAPP860 – 16-Channel DDR LVDS Interface with Real-Time Window Monitoring

by Greg Burton

This application note describes a 16-channel source-synchronous DDRLVDS interface. The receiver operates at1:6 deserialization on each of the 16data channels. Similar to XAPP855, thedesign also includes a real-time windowmonitoring circuit for added perform-ance. This reference design calibratesand compensates for skews associatedwith process, voltage, and temperature(PVT) at initialization and also dynami-cally during operation.


Serial ConnectivityXAPP861 – Efficient 8x OversamplingAsynchronous Serial Data RecoveryUsing IDELAY

by John Snow

Virtex-5 devices a have a high-precisionprogrammable delay element (IDELAY)associated with every input pin. Thisapplication note shows how to imple-ment 8x oversampling of many datastreams using a single DCM, two globalclock resources, and minimal FPGAlogic resources. This solution providesbetter jitter tolerance than techniquesusing multiple DCMs. When pairedwith a suitable data recovery scheme,this oversampling technique can be usedwith many different data protocols up to550 Mbps. A reference design is includ-ed that implements a SD-SDI (SMPTE259M) receiver running at 270 Mbps.



Connectivity SolutionsRealize the full potential of the solutions in our silicon with Xilinx application notes.

A P P L I C A T I O N N O T E S

The IP Center includes intellectual propertyfrom Xilinx and its third-party vendors. Toaccess the IP solutions highlighted here, visitwww.xilinx.com/ipcenter and type the key-words listed in the search box.

Source-Synchronous InterfacesSPI-4 Phase 2 Interface Solutions (DO-DI-POSL4MC) Xilinx IP Core

The Xilinx® SPI-4 Phase 2 core provides afully compliant packet-over-SONET/SDH(POS) solution, which can be quickly inte-grated into networking systems.

Through user-configurable options, theXilinx SPI-4.2 core provides ultimate flexi-bility while seamlessly interoperating withindustry-leading ASSPs to maximize the datatransfer bandwidth. The Xilinx SPI-4.2 coreis fully compliant with the OIF’s SystemPacket Interface Level 4 (SPI-4) Phase 2standard, as well as the Saturn DevelopmentGroup’s POS-PHY Level 4 (PL4) interfacespecification.

Type search keywords: SPI-4 Phase 2

Serial ConnectivityVirtex-5 RocketIO GTP Wizard

The Virtex™-5 RocketIO™ GTP Wizardautomates the task of creating HDL wrap-pers to configure Virtex-5 RocketIO GTPtransceivers. The wizard’s customizationGUI allows you to configure one or moreGTP transceivers using pre-defined tem-plates to support popular industry stan-dards, or from scratch to support a widevariety of custom protocols.

Type search keywords: GTP Wizard

Virtex-5 Embedded Tri-Mode Ethernet MAC WrapperThe CORE Generator™ tool supportsthe Virtex-5 Tri-Mode Ethernet MediaAccess Controller (MAC) Wrapper toautomate the generation of HDL wrapperfiles for the tri-mode Ethernet MAC inVirtex-5 LXT devices. Preconfigured HDLwrappers, testbenches, and implement andsimulation scripts are generated automati-cally based on user-defined options.

Type search keywords: Virtex-5Ethernet MAC

Virtex-5 PCI Express Endpoint Block WrapperThe Xilinx PCI Express Endpoint blockwrapper integrates and interfaces to theon-chip PCI Express Endpoint block,supporting 1-lane, 2-lane, 4-lane, and 8-lane complete endpoint core implemen-tations. In addition, a PCI ExpressEndpoint block development kit is alsoavailable. This solution is used in com-

munication, multimedia, server, storage,and mobile platforms and enables applica-tions such as high-end medical imaging,graphics-intensive video games, DVDquality streaming video on the desktop,and 10 Gigabit Ethernet interface cards.

Type search keywords: PCI Express Block

Virtex-5 LXT PCI Express Block PlusLogiCORE Xilinx IP CoreThe Xilinx PCI Express Plus LogiCORE IPintegrates and interfaces to the PCI ExpressEndpoint block, supporting 1-lane, 4-lane,and 8-lane complete endpoint core imple-mentations. In addition, a PCI Express devel-opment kit is also available. This solution isused in communication, multimedia, server,storage, and mobile platforms and enablesapplications such as high-end medical imag-ing, graphics-intensive video games, DVDquality streaming video on the desktop, and10 Gigabit Ethernet interface cards.

Type search keywords: PCI Express Plus

Intellectual Property Offerings


The Xilinx IP Center on the Web allows you to search for IP by function, type, vendor, or keywords.

I N T E L L E C T U A L P R O P E R T Y

Nu Horizons Virtex-5 LXT Evaluation KitNu Horizons creates a low-cost evaluationkit for Virtex-5 LXT platform FPGAs.

Nu Horizons’s newest evaluation kit isdesigned for customers interested in evaluatingVirtex-5 LXT FPGAs. This kit differs fromother Nu Horizons kits in that it has the addedability for high-speed serial communication.

On the Web at www.nuhorizons.com

Xilinx Virtex-5 ML505 Evaluation and Development PlatformA low-cost embedded system andRocketIO GTP transceiver developmentplatform.

The Xilinx Virtex-5 ML505, based onRocketIO™ technology, is a feature-rich,low-cost evaluation/development platformthat provides easy and practical access to theresources available in the on-board Virtex-5LXT FPGA.

Supported by industry-standard inter-faces/connectors, generous memoryresources, and companion chipsets, theML505 evaluation platform is a versatiledevelopment platform for multiple applica-tions including embedded systems.

On the Web at www.xilinx.com/XOB


Virtex-5 Boards and KitsT H E B O A R D R O O M

Jumpstart your Virtex-5 designs with these development platforms and tool kits.

Avnet Virtex-5 LX Development KitA complete development platform fordesigning and verifying applications basedon the Xilinx Virtex-5 LX FPGA family.

Available with the Xilinx® Virtex™-5XC5VLX50-1FF676 device, the AvnetVirtex-5 Development Kit allows you toprototype high-performance designs withease, while providing expandability and cus-tomization through the EXP expansion slot.

The system board includes DDR2SDRAM, flash memory, a 10/100/1000Ethernet PHY, and a serial port, makingit an ideal platform for MicroBlaze™development. Other board featuresinclude a USB port, programmableLVDS clock, 10-bit Tx/Rx high-speedLVDS interface, user switches and LEDs,and a 2 x 16-character LCD panel.

The board also provides a full EXPexpansion slot, providing a total of 168high-speed, single-ended, and differentialuser I/O. You can easily add EXP modulesto the board for additional application-specific functions.

On the Web at www.avnet.com

Xilinx Virtex-5 ML501 Evaluation and Development PlatformAn ideal general-purpose, low-cost development platform.

The Xilinx Virtex-5 ML501 evaluationand development platform is a feature-rich, low-cost evaluation/developmentplatform that provides easy and practicalaccess to the resources available in theon-board Virtex-5 LX FPGA. Supportedby industry-standard interfaces and connectors, the ML501 is a versatiledevelopment platform for multiple appli-cations. Video, audio, and communica-tion ports as well as generous memoryresources extend the functionality andflexibility of the ML501 evaluation plat-form beyond a typical FPGA develop-ment platform.

On the Web at www.xilinx.com/ML501

HiTech Global Virtex-5 PCI Express Development PlatformSeamless serial interface connectivityenabled by the Virtex-5 LXT FPGA.

Powered by a Xilinx Virtex-5 LXT FPGA,supported by mainstream peripherals, anddesigned with excellent signal integrity per-formance, the HiTech Global HTG-V5PCIEis the ideal platform for serial interface/con-nectivity developments, including PCIExpress subsystems, Serial ATA (SATA), FibreChannel, RapidIO, and XAUI.

On the Web at www.hitechglobal.com

Xilinx Virtex-5 ML550 Networking Interfaces Tool KitDesigning networking, telecom, servers, andcomputing systems with Virtex-5 FPGAs.

Many of today’s telecom and networkingsystems use high-bandwidth interfaces basedon LVDS or other differential I/O standards.Differential I/O standards simplify systemdesign by improving system performanceand signal integrity.

Protocols based on source-synchronousI/Os such as SPI-4.2 and SFI-4 are central toleading-edge system design. To take advan-tage of these technologies, you have to workthrough multiple challenges to ensure deviceinteroperability and standards compliance.Xilinx provides the Virtex-5 network interfaceboard, as well as standards-compliant IP coresand free reference designs, to help you tacklethese high-speed, source-synchronous inter-face challenges. This allows you to focus onuser application design and not worry aboutinteroperability and standards compliance.


Xilinx Virtex-5 ML555 PCI ExpressDevelopment Tool KitA highly configurable pre-verified development solution.

The Xilinx ML555 RoHS-compliantPCIe/PCI-X/PCI development boardprovides a pre-verified solution to paralleland serial PCI interface design chal-lenges. Using an established developmentenvironment can dramatically shortenthe design cycle. By using proven Xilinxdedicated blocks, you can focus yourefforts on specific application develop-ment and avoid time-consuming PCIe orPCI development.


Xilinx Virtex-5 ML561 Advanced Memory Development SystemAchieve your performance targets in the shortest development time.

Building interfaces to high-performancememory devices presents challenges suchas high-speed synchronous data capture,along with implementing complex physi-cal-layer interfaces and control logic. TheML561 advanced memory developmentsystem offers an excellent platform todevelop and verify high-performancememory interfaces using Virtex-5 FPGAs.



T H E B O A R D R O O M

HAPS – HARDI ASIC Prototyping Systema high performance, high capacity FPGA platform for ASIC prototyping and emulation composed of multi-FPGA boards and standard or custom-made daughter boards

HapsTraka set of rules for pinout and mechanical characteristics, which guarantees compatibility with previous and futuregeneration HAPS motherboards and daughter boards

HARDI Electronics Inc., 26831 Magdalena Lane, Mission Viejo, CA 92691, (949) 202-5572

www.hardi.com, [email protected]

Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 are registered trademarks of Xilinx Inc.

Hap

sTrak

2003co

mpat i b

le

Hap

sTrak

2004co

mpat i b

le

Hap

sTrak2005c

o

mpat i b

le

Hap

sTrak

2007co

mpat i b

le

HAPS-10

HAPS-30

HAPS-20

HAPS-40

Low-Power TransceiversUltimate Connectivity . . .

Reduce serial I/O power, cost and complexity with the world’sfirst 65nm FPGAs.

With a unique combination of up to 24 low-power transceivers,

and built-in PCIe™ and Ethernet MAC blocks, Virtex-5 LXT FPGAs

get your system running fast. Whether you are an expert or

just starting out, only Xilinx delivers this complete solution to

simplify high-speed serial design.

Lowest-power, most area-efficient serial I/O solutionRocketIO™ GTP transceivers deliver up to 3.2 Gbps connectivity

at less than 100 mW to help you beat your power budget. The

embedded PCI Express® endpoint block ensures easy implemen-

tation and reduced development time. Embedded Ethernet MAC

blocks enable a single-chip UNH-verified implementation. And

the Xilinx solution is fully supported by development tools,

design kits, IP, characterization reports, and more.

Visit our website today, view the Webcast, and order your free

eval CD to give your next design the ultimate in connectivity.


www.xilinx.com/virtex5

Pow

er (

Wat

ts)

User Logic

PCIe

User Logic

PCIeStatic Power

Virtex-5 LXT FPGAs (65nm)

Nearest Competitor

(90nm)

Virtex-5 LXT FPGAs (65nm)

Nearest Competitor

(90nm)

5VLX30T vs. 2SGX60D. Target Frequency = 200 MHz. Worst-case process. 25K LUTs, 17K Flip-Flops,1 Mbit On-Chip RAM, 64 DSP Blocks, 128 2.5V I/Os.Based on Xilinx tool v8.2 and competitor tool v6.0.1

3.09

6.22

Are

a (L

UTs

)

Power consumption and area required to implement a typical design including 8-lane PCIe endpoint

25,100

34,600

The Ultimate System Integration Platform


PN 0010999

Documents

Virtex-5 Special Edition - Xilinx · 2018-08-02 · W Welcome to this special edition of Xcell Journal, featuring a broad array of articles on Xilinx® Virtex™-5 FPGAs. In this